I have a dataframe that consists of of video game titles on various platforms. it contains, among other values the name, critic’s average score and user’s average score. Many of them are missing scores for the user, critic and/or ESRB rating.
What i’d like to do is replace the missing rating, critic and user scores with those for the same game on a different platform (assuming they exist) i’m not quite sure how to approach this.(note – i don’t want to drop the duplicate names, because they aren’t truly duplicate rows)
here is a sample chunk of the dataframe (i’ve removed some unrelated columns to make it manageable):
name platform critic_score user_score rating 0 Wii Sports Wii 76.0 0 8.0 E 1 Super Mario Bros. NES NaN 1 NaN NaN 2 Mario Kart Wii Wii 82.0 2 8.3 E 3 Wii Sports Resort Wii 80.0 3 8.0 E 4 Pokemon Red/Pokemon Blue GB NaN 4 NaN NaN 5 Tetris GB NaN 5 NaN NaN 6 New Super Mario Bros. DS 89.0 6 8.5 E 7 Wii Play Wii 58.0 7 6.6 E 8 New Super Mario Bros. Wii Wii 87.0 8 8.4 E 9 Duck Hunt NES NaN 9 NaN NaN 10 Nintendogs DS NaN 10 NaN NaN 11 Mario Kart DS DS 91.0 11 8.6 E 12 Pokemon Gold/Pokemon Silver GB NaN 12 NaN NaN 13 Wii Fit Wii 80.0 13 7.7 E 14 Kinect Adventures! X360 61.0 14 6.3 E 15 Wii Fit Plus Wii 80.0 15 7.4 E 16 Grand Theft Auto V PS3 97.0 16 8.2 M 17 Grand Theft Auto: San Andreas PS2 95.0 17 9.0 M 18 Super Mario World SNES NaN 18 NaN NaN 19 Brain Age: Train Your Brain in Minutes a Day DS 77.0 19 7.9 E 20 Pokemon Diamond/Pokemon Pearl DS NaN 20 NaN NaN 21 Super Mario Land GB NaN 21 NaN NaN 22 Super Mario Bros. 3 NES NaN 22 NaN NaN 23 Grand Theft Auto V X360 97.0 23 8.1 M 24 Grand Theft Auto: Vice City PS2 95.0 24 8.7 M 25 Pokemon Ruby/Pokemon Sapphire GBA NaN 25 NaN NaN 26 Brain Age 2: More Training in Minutes a Day DS 77.0 26 7.1 E 27 Pokemon Black/Pokemon White DS NaN 27 NaN NaN 28 Gran Turismo 3: A-Spec PS2 95.0 28 8.4 E 29 Call of Duty: Modern Warfare 3 X360 88.0 29 3.4 M
now, there don’t happen to be any duplicates that stick out in this head 30 lines, but for instance i have 007: Quantum of Solace on the PS3, Wii, DS, PC and x360. between all of the platforms i have a mean rating for both users and critics, as well as a rating.
as requested – here is a sample of some duplicated values:
index name platform critic_Score user_score rating 3862 Frozen: Olaf's Quest DS NaN NaN NaN 3358 Frozen: Olaf's Quest 3DS NaN NaN NaN 1785 007: Quantum of Solace PS3 65 6.6 T 3120 007: Quantum of Solace Wii 54 7.5 T 9507 007: Quantum of Solace DS 65 NaN T 4475 007: Quantum of Solace PS2 NaN NaN NaN 1285 007: Quantum of Solace X360 65 7.1 T 14658 007: Quantum of Solace PC 70 6.3 T 2243 007: The World is not Enough PS 61 6.7 T 1204 007: The World is not Enough N64 NaN NaN NaN
i’ve separated my duplicates into their own dataframe (df1 is my original games dataframe, df2 is the duplicates dataframe):
df2 = df[df.duplicated(['name'],False)] df2 = df2.sort_values(['name'])
so i can see my duplicates and their values, but of course i don’t wanna fill in 8500 missing values from duplicates by hand.
I can find the duplicated names, but i don’t know how to fill the NaN values with the “good” values from the other platform?
i’m at a loss for how to begin this and would appreciate any input into a direction.
now – to add another step to it – in my example above of the 007 game – the critic and user scores aren’t the same across platforms (the ps3 game got a 65, the wii game got a 54 and PC a 70) calculating the mean of the 3 should be the ideal solution, but i’ll settle for ANY of the platforms if that is too complex (as you might have guessed, i am very new to python)
I appreciate any time and effort you have to share on my behalf.
Regards,
Jared
Advertisement
Answer
I’m pretty sure pandas.DataFrame.groupby
is what you need:
df.groupby("name").mean()
If you want to join these results with you dataframe, you can use:
df.join(df.groupby("name").mean(), on = "name", rsuffix = "_mean")?