I have a dataframe with column names that start with a set list of prefixes. I want to get the sum of the values in the dataframe grouped by columns that start with the same prefix.
JavaScript
x
10
10
1
df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]],
2
columns=['abc', 'abd', 'wxy', 'wxz'])
3
prefixes = ['ab','wx']
4
df
5
abc abd wxy wxz
6
0 1 2 3 4
7
1 1 2 3 4
8
2 1 2 3 4
9
3 1 2 3 4
10
The only way I could figure out how to do it was to loop through the prefix list, get the columns from the dataframe that startwith that string, and then sum the results.
JavaScript
1
9
1
results = []
2
for p in prefixes:
3
results.append([p, df.loc[:, df.columns.str.startswith(p)].values.sum()])
4
results = pd.DataFrame(results,)
5
results.set_index(keys=[0], drop=True).T
6
7
ab wx
8
1 12 28
9
I hoped there was a more elegant way to do it, perhaps with groupby(), but I couldn’t figure it out.
Advertisement
Answer
First, it is necessary to determine what columns contain what prefix. We then use this to perform a groupby
.
JavaScript
1
9
1
grouper = [next(p for p in prefixes if p in c) for c in df.columns]
2
u = df.groupby(grouper, axis=1).sum()
3
4
ab wx
5
0 3 7
6
1 3 7
7
2 3 7
8
3 3 7
9
Almost there, now,
JavaScript
1
5
1
u.sum().to_frame().T
2
3
ab wx
4
0 12 28
5
Another option is using np.char.startswith
and argmax
to vectorize:
JavaScript
1
10
10
1
idx = np.char.startswith(
2
df.columns.values[:, None].astype(str), prefixes).argmax(1)
3
4
(pd.Series(df.groupby(idx, axis=1).sum().sum().values, index=prefixes)
5
.to_frame()
6
.transpose())
7
8
ab wx
9
0 12 28
10