The goal is to organize the data based on multiple columns and put contents of other columns in one cell, for example,
df:
name type link subs
id
1 x tx 10 sub1
2 x tx 10 sub2
3 y rx 11 sub3
4 y rx 11 sub4
5 y rx 11 sub5
To organize the table based on name/type/link, expect to get:
name type link subs
id
1 2 x tx 10 sub1 sub2
3 4 5 y rx 11 sub3 sub4 sub5
I know it is possible to organize the data stream based on multiple columns,
df.groupby(['name','type', 'link']).reset_index()
But I don’t know how to deal with the other two columns and put them in one cell. Thank you for any suggestions.
Advertisement
Answer
You can use groupby(...)[column].agg(...)
since this this is a reduction/aggregation. To aggregate, you simply want to join the contents into a single string so you can use something like " ".join
as your aggregation function like so:
new_df = (
df.reset_index() # we want to operate on the "id" as well
.astype({"id": str}) # " ".join only works on strings, so make "id" string dtype
.groupby(['name','type', 'link'], as_index=False)[["id", "subs"]]
.agg(" ".join)
)
print(new_df)
name type link id subs
0 x tx 10 1 2 sub1 sub2
1 y rx 11 3 4 5 sub3 sub4 sub5
notes:
- passing
as_index=False
to the groupby statement puts the columns that we grouped by['name', 'type', 'link']
back into the dataframe as columns (if we didn’t specify they would make up theIndex
- in this line
.groupby(['name','type', 'link'], as_index=False)[["id", "subs"]]
we don’t NEED to specify the columns since they’re the only remaining columns in our dataframe to work with. This snippet will run without selecting them manually, I just prefer explicit > implicit
At the end of the day, these data are mainly only useful to look at. Putting multiple values into a single string into a cell of a table makes those values very difficult to work with in the future.
To reinstate the ordering of your original dataframe you can:
- set
"id"
as the index (as your original dataframe had) - call
reindex
to reorder the columns of thenew_df
to be that of the originaldf
- ALTERNATIVELY, you can simply select the columns using bracket notation with the previous
df.columns
ordered_new_df = new_df.set_index("id").reindex(df.columns, axis="columns")
# alternative (both lines lead to same output)
ordered_new_df = new_df.set_index("id")[df.columns]
print(ordered_new_df)
name type link subs
id
1 2 x tx 10 sub1 sub2
3 4 5 y rx 11 sub3 sub4 sub5
Note that you’ll also get the “correct” ordering by just calling new_df.set_index("id")
however the methods I showed are a more explicit.