So i have a dataframe with two columns: artistID and genre:
artistID genre 0 52 rock 1 63 pop 2 73 salsa 3 94 reggaeton 4 6177 rock 5 64 salsa 6 862 metal 7 52 pop 8 63 hiphop 9 64 jazz 10 52 metal 11 63 electro 12 73 latino 13 94 trap 14 6177 pop 15 64 latino 15 456 hiphop
And what I want to do is to group by the column artistID (so the resulting datafdrame has as many rows as artistID there are in this dataframe), and the second column of the new dataframe I want it to be like a list or an array or whatever it is convenient of all the unique genres to which each artistID has been taged. So I want the resultind dataframe to look like this:
artistID genre 0 52 [rock, pop, metal] 1 63 [pop, electro, hiphop] 2 73 [salsa, latino] 3 94 [reggaeton, trap] 4 6177 [rock, pop] 5 64 [salsa, jazz, latino] 6 862 [metal] 7 456 [hiphop]
How can I do this ?
I also must say, this dataframe is just an wxample, my real dataframe has almost 200.000 rows and 20.000 different artistID
Advertisement
Answer
Use Groupby.agg:
In [2237]: df.groupby('artistID')['genre'].agg(set).reset_index()
OR:
In [2240]: df.groupby('artistID')['genre'].apply(lambda x: list(set(x)))
Out[2237]: 
   artistID                   genre
0        52      [rock, pop, metal]
1        63  [pop, hiphop, electro]
2        64   [salsa, jazz, latino]
3        73         [salsa, latino]
4        94       [reggaeton, trap]
5       456                [hiphop]
6       862                 [metal]
7      6177             [rock, pop]