The goal is to organize the data based on multiple columns and put contents of other columns in one cell, for example,
df:
name type link subs id 1 x tx 10 sub1 2 x tx 10 sub2 3 y rx 11 sub3 4 y rx 11 sub4 5 y rx 11 sub5
To organize the table based on name/type/link, expect to get:
name type link subs id 1 2 x tx 10 sub1 sub2 3 4 5 y rx 11 sub3 sub4 sub5
I know it is possible to organize the data stream based on multiple columns,
df.groupby(['name','type', 'link']).reset_index()
But I don’t know how to deal with the other two columns and put them in one cell. Thank you for any suggestions.
Advertisement
Answer
You can use groupby(...)[column].agg(...)
since this this is a reduction/aggregation. To aggregate, you simply want to join the contents into a single string so you can use something like " ".join
as your aggregation function like so:
new_df = ( df.reset_index() # we want to operate on the "id" as well .astype({"id": str}) # " ".join only works on strings, so make "id" string dtype .groupby(['name','type', 'link'], as_index=False)[["id", "subs"]] .agg(" ".join) ) print(new_df) name type link id subs 0 x tx 10 1 2 sub1 sub2 1 y rx 11 3 4 5 sub3 sub4 sub5
notes:
- passing
as_index=False
to the groupby statement puts the columns that we grouped by['name', 'type', 'link']
back into the dataframe as columns (if we didn’t specify they would make up theIndex
- in this line
.groupby(['name','type', 'link'], as_index=False)[["id", "subs"]]
we don’t NEED to specify the columns since they’re the only remaining columns in our dataframe to work with. This snippet will run without selecting them manually, I just prefer explicit > implicit
At the end of the day, these data are mainly only useful to look at. Putting multiple values into a single string into a cell of a table makes those values very difficult to work with in the future.
To reinstate the ordering of your original dataframe you can:
- set
"id"
as the index (as your original dataframe had) - call
reindex
to reorder the columns of thenew_df
to be that of the originaldf
- ALTERNATIVELY, you can simply select the columns using bracket notation with the previous
df.columns
ordered_new_df = new_df.set_index("id").reindex(df.columns, axis="columns") # alternative (both lines lead to same output) ordered_new_df = new_df.set_index("id")[df.columns] print(ordered_new_df) name type link subs id 1 2 x tx 10 sub1 sub2 3 4 5 y rx 11 sub3 sub4 sub5
Note that you’ll also get the “correct” ordering by just calling new_df.set_index("id")
however the methods I showed are a more explicit.