The goal is to organize the data based on multiple columns and put contents of other columns in one cell, for example,
df:
name type link subs id 1 x tx 10 sub1 2 x tx 10 sub2 3 y rx 11 sub3 4 y rx 11 sub4 5 y rx 11 sub5
To organize the table based on name/type/link, expect to get:
name type link subs id 1 2 x tx 10 sub1 sub2 3 4 5 y rx 11 sub3 sub4 sub5
I know it is possible to organize the data stream based on multiple columns,
df.groupby(['name','type', 'link']).reset_index()
But I don’t know how to deal with the other two columns and put them in one cell. Thank you for any suggestions.
Advertisement
Answer
You can use groupby(...)[column].agg(...) since this this is a reduction/aggregation. To aggregate, you simply want to join the contents into a single string so you can use something like " ".join as your aggregation function like so:
new_df = (
    df.reset_index()        # we want to operate on the "id" as well
    .astype({"id": str})    # " ".join only works on strings, so make "id" string dtype
    .groupby(['name','type', 'link'], as_index=False)[["id", "subs"]]
    .agg(" ".join)
)
print(new_df)
  name type  link     id            subs
0    x   tx    10    1 2       sub1 sub2
1    y   rx    11  3 4 5  sub3 sub4 sub5
notes:
- passing as_index=Falseto the groupby statement puts the columns that we grouped by['name', 'type', 'link']back into the dataframe as columns (if we didn’t specify they would make up theIndex
- in this line .groupby(['name','type', 'link'], as_index=False)[["id", "subs"]]we don’t NEED to specify the columns since they’re the only remaining columns in our dataframe to work with. This snippet will run without selecting them manually, I just prefer explicit > implicit
At the end of the day, these data are mainly only useful to look at. Putting multiple values into a single string into a cell of a table makes those values very difficult to work with in the future.
To reinstate the ordering of your original dataframe you can:
- set "id"as the index (as your original dataframe had)
- call reindexto reorder the columns of thenew_dfto be that of the originaldf
- ALTERNATIVELY, you can simply select the columns using bracket notation with the previous df.columns
ordered_new_df = new_df.set_index("id").reindex(df.columns, axis="columns")
# alternative (both lines lead to same output)
ordered_new_df = new_df.set_index("id")[df.columns]
print(ordered_new_df)
      name type  link            subs
id
1 2      x   tx    10       sub1 sub2
3 4 5    y   rx    11  sub3 sub4 sub5
Note that you’ll also get the “correct” ordering by just calling new_df.set_index("id") however the methods I showed are a more explicit.