pandas data frame, group by multiple cols and put other columns’ contents in one

Question

The goal is to organize the data based on multiple columns and put contents of other columns in one cell, for example, df: To organize the table based on name/type/link, expect to get: I know it is possible to organize the data stream based on multiple columns, But I don&#8217;t know how to deal with the othe…

Accepted Answer

You can use groupby(...)[column].agg(...) since this this is a reduction/aggregation. To aggregate, you simply want to join the contents into a single string so you can use something like " ".join as your aggregation function like so:new_df = (    df.reset_index()        # we want to operate on the "id" as well    .astype({"id": str})    # " ".join only works on strings, so make "id" string dtype    .groupby(['name','type', 'link'], as_index=False)[["id", "subs"]]    .agg(" ".join))print(new_df)  name type  link     id            subs0    x   tx    10    1 2       sub1 sub21    y   rx    11  3 4 5  sub3 sub4 sub5notes:passing as_index=False to the groupby statement puts the columns that we grouped by ['name', 'type', 'link'] back into the dataframe as columns (if we didn&#8217;t specify they would make up the Indexin this line .groupby(['name','type', 'link'], as_index=False)[["id", "subs"]] we don&#8217;t NEED to specify the columns since they&#8217;re the only remaining columns in our dataframe to work with. This snippet will run without selecting them manually, I just prefer explicit > implicitAt the end of the day, these data are mainly only useful to look at. Putting multiple values into a single string into a cell of a table makes those values very difficult to work with in the future.To reinstate the ordering of your original dataframe you can:set "id" as the index (as your original dataframe had)call reindex to reorder the columns of the new_df to be that of the original dfALTERNATIVELY, you can simply select the columns using bracket notation with the previous df.columnsordered_new_df = new_df.set_index("id").reindex(df.columns, axis="columns")# alternative (both lines lead to same output)ordered_new_df = new_df.set_index("id")[df.columns]print(ordered_new_df)      name type  link            subsid1 2      x   tx    10       sub1 sub23 4 5    y   rx    11  sub3 sub4 sub5Note that you&#8217;ll also get the &#8220;correct&#8221; ordering by just calling new_df.set_index("id") however the methods I showed are a more explicit.

Advertisement

Answer