Skip to content
Advertisement

python – find duplicates in a column, replace values in another column for that duplicate

I have a dataframe that consists of of video game titles on various platforms. it contains, among other values the name, critic’s average score and user’s average score. Many of them are missing scores for the user, critic and/or ESRB rating.

What i’d like to do is replace the missing rating, critic and user scores with those for the same game on a different platform (assuming they exist) i’m not quite sure how to approach this.(note – i don’t want to drop the duplicate names, because they aren’t truly duplicate rows)

here is a sample chunk of the dataframe (i’ve removed some unrelated columns to make it manageable):

                                            name platform  critic_score      user_score rating  
0                                     Wii Sports      Wii          76.0   0          8.0      E  
1                              Super Mario Bros.      NES           NaN   1          NaN    NaN  
2                                 Mario Kart Wii      Wii          82.0   2          8.3      E  
3                              Wii Sports Resort      Wii          80.0   3          8.0      E  
4                       Pokemon Red/Pokemon Blue       GB           NaN   4          NaN    NaN  
5                                         Tetris       GB           NaN   5          NaN    NaN  
6                          New Super Mario Bros.       DS          89.0   6          8.5      E  
7                                       Wii Play      Wii          58.0   7          6.6      E  
8                      New Super Mario Bros. Wii      Wii          87.0   8          8.4      E  
9                                      Duck Hunt      NES           NaN   9          NaN    NaN  
10                                    Nintendogs       DS           NaN   10         NaN    NaN  
11                                 Mario Kart DS       DS          91.0   11         8.6      E  
12                   Pokemon Gold/Pokemon Silver       GB           NaN   12         NaN    NaN  
13                                       Wii Fit      Wii          80.0   13         7.7      E  
14                            Kinect Adventures!     X360          61.0   14         6.3      E  
15                                  Wii Fit Plus      Wii          80.0   15         7.4      E  
16                            Grand Theft Auto V      PS3          97.0   16         8.2      M  
17                 Grand Theft Auto: San Andreas      PS2          95.0   17         9.0      M  
18                             Super Mario World     SNES           NaN   18         NaN    NaN  
19  Brain Age: Train Your Brain in Minutes a Day       DS          77.0   19         7.9      E  
20                 Pokemon Diamond/Pokemon Pearl       DS           NaN   20         NaN    NaN  
21                              Super Mario Land       GB           NaN   21         NaN    NaN  
22                           Super Mario Bros. 3      NES           NaN   22         NaN    NaN  
23                            Grand Theft Auto V     X360          97.0   23         8.1      M  
24                   Grand Theft Auto: Vice City      PS2          95.0   24         8.7      M  
25                 Pokemon Ruby/Pokemon Sapphire      GBA           NaN   25         NaN    NaN  
26   Brain Age 2: More Training in Minutes a Day       DS          77.0   26         7.1      E  
27                   Pokemon Black/Pokemon White       DS           NaN   27         NaN    NaN  
28                        Gran Turismo 3: A-Spec      PS2          95.0   28         8.4      E  
29                Call of Duty: Modern Warfare 3     X360          88.0   29         3.4      M  

now, there don’t happen to be any duplicates that stick out in this head 30 lines, but for instance i have 007: Quantum of Solace on the PS3, Wii, DS, PC and x360. between all of the platforms i have a mean rating for both users and critics, as well as a rating.

as requested – here is a sample of some duplicated values:

index   name    platform    critic_Score    user_score  rating
3862    Frozen: Olaf's Quest    DS  NaN NaN NaN
3358    Frozen: Olaf's Quest    3DS NaN NaN NaN
1785    007: Quantum of Solace  PS3 65  6.6 T
3120    007: Quantum of Solace  Wii 54  7.5 T
9507    007: Quantum of Solace  DS  65  NaN T
4475    007: Quantum of Solace  PS2 NaN NaN NaN
1285    007: Quantum of Solace  X360    65  7.1 T
14658   007: Quantum of Solace  PC  70  6.3 T
2243    007: The World is not Enough    PS  61  6.7 T
1204    007: The World is not Enough    N64 NaN NaN NaN

i’ve separated my duplicates into their own dataframe (df1 is my original games dataframe, df2 is the duplicates dataframe):

df2 = df[df.duplicated(['name'],False)]
df2 = df2.sort_values(['name'])

so i can see my duplicates and their values, but of course i don’t wanna fill in 8500 missing values from duplicates by hand.

I can find the duplicated names, but i don’t know how to fill the NaN values with the “good” values from the other platform?

i’m at a loss for how to begin this and would appreciate any input into a direction.

now – to add another step to it – in my example above of the 007 game – the critic and user scores aren’t the same across platforms (the ps3 game got a 65, the wii game got a 54 and PC a 70) calculating the mean of the 3 should be the ideal solution, but i’ll settle for ANY of the platforms if that is too complex (as you might have guessed, i am very new to python)

I appreciate any time and effort you have to share on my behalf.

Regards,

Jared

Advertisement

Answer

I’m pretty sure pandas.DataFrame.groupby is what you need:

df.groupby("name").mean()

If you want to join these results with you dataframe, you can use:

df.join(df.groupby("name").mean(), on = "name", rsuffix = "_mean")?
Advertisement