Skip to content
Advertisement

pandas data frame, group by multiple cols and put other columns’ contents in one

The goal is to organize the data based on multiple columns and put contents of other columns in one cell, for example,

df:

    name type link subs
id
1    x    tx   10   sub1
2    x    tx   10   sub2
3    y    rx   11   sub3
4    y    rx   11   sub4 
5    y    rx   11   sub5 

To organize the table based on name/type/link, expect to get:

        name type link subs
id
1 2      x   tx   10   sub1 sub2
3 4 5    y   rx   11   sub3 sub4 sub5

I know it is possible to organize the data stream based on multiple columns,

df.groupby(['name','type', 'link']).reset_index()

But I don’t know how to deal with the other two columns and put them in one cell. Thank you for any suggestions.

Advertisement

Answer

You can use groupby(...)[column].agg(...) since this this is a reduction/aggregation. To aggregate, you simply want to join the contents into a single string so you can use something like " ".join as your aggregation function like so:

new_df = (
    df.reset_index()        # we want to operate on the "id" as well
    .astype({"id": str})    # " ".join only works on strings, so make "id" string dtype
    .groupby(['name','type', 'link'], as_index=False)[["id", "subs"]]
    .agg(" ".join)
)

print(new_df)
  name type  link     id            subs
0    x   tx    10    1 2       sub1 sub2
1    y   rx    11  3 4 5  sub3 sub4 sub5

notes:

  • passing as_index=False to the groupby statement puts the columns that we grouped by ['name', 'type', 'link'] back into the dataframe as columns (if we didn’t specify they would make up the Index
  • in this line .groupby(['name','type', 'link'], as_index=False)[["id", "subs"]] we don’t NEED to specify the columns since they’re the only remaining columns in our dataframe to work with. This snippet will run without selecting them manually, I just prefer explicit > implicit

At the end of the day, these data are mainly only useful to look at. Putting multiple values into a single string into a cell of a table makes those values very difficult to work with in the future.


To reinstate the ordering of your original dataframe you can:

  • set "id" as the index (as your original dataframe had)
  • call reindex to reorder the columns of the new_df to be that of the original df
  • ALTERNATIVELY, you can simply select the columns using bracket notation with the previous df.columns
ordered_new_df = new_df.set_index("id").reindex(df.columns, axis="columns")

# alternative (both lines lead to same output)
ordered_new_df = new_df.set_index("id")[df.columns]

print(ordered_new_df)
      name type  link            subs
id
1 2      x   tx    10       sub1 sub2
3 4 5    y   rx    11  sub3 sub4 sub5

Note that you’ll also get the “correct” ordering by just calling new_df.set_index("id") however the methods I showed are a more explicit.

Advertisement