I have a dataframe with column names that start with a set list of prefixes. I want to get the sum of the values in the dataframe grouped by columns that start with the same prefix.
df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]], columns=['abc', 'abd', 'wxy', 'wxz']) prefixes = ['ab','wx'] df abc abd wxy wxz 0 1 2 3 4 1 1 2 3 4 2 1 2 3 4 3 1 2 3 4
The only way I could figure out how to do it was to loop through the prefix list, get the columns from the dataframe that startwith that string, and then sum the results.
results = [] for p in prefixes: results.append([p, df.loc[:, df.columns.str.startswith(p)].values.sum()]) results = pd.DataFrame(results,) results.set_index(keys=[0], drop=True).T ab wx 1 12 28
I hoped there was a more elegant way to do it, perhaps with groupby(), but I couldn’t figure it out.
Advertisement
Answer
First, it is necessary to determine what columns contain what prefix. We then use this to perform a groupby
.
grouper = [next(p for p in prefixes if p in c) for c in df.columns] u = df.groupby(grouper, axis=1).sum() ab wx 0 3 7 1 3 7 2 3 7 3 3 7
Almost there, now,
u.sum().to_frame().T ab wx 0 12 28
Another option is using np.char.startswith
and argmax
to vectorize:
idx = np.char.startswith( df.columns.values[:, None].astype(str), prefixes).argmax(1) (pd.Series(df.groupby(idx, axis=1).sum().sum().values, index=prefixes) .to_frame() .transpose()) ab wx 0 12 28