I have a pandas dataframe:
Server Clock 1 Clock 2 Power diff 0 PhysicalWindows1 3400 3300.0 58.5 100.0 1 PhysicalWindows1 3400 3500.0 63.0 100.0 2 PhysicalWindows1 3400 2900.0 25.0 500.0 3 PhysicalWindows2 3600 3300.0 83.8 300.0 4 PhysicalWindows2 3600 3500.0 65.0 100.0 5 PhysicalWindows2 3600 2900.0 10.0 700.0 6 PhysicalLinux1 2600 NaN NaN NaN 7 PhysicalLinux1 2600 NaN NaN NaN 8 Test 2700 2700.0 30.0 0.0
Basically, I would like to average the Power for each server but only if the difference is minimum. For example, if you look at the ‘PhysicalWindows1’ server, I have 3 rows, two have a diff of 100, and one has a diff of 500. Since I have two rows with a diff of 100, I would like to average out my Power of 58.5 and 63.0. For ‘PhysicalWindows2’, since there is only one row which has the least diff, we return the power for that one row – 65. If NaN, return Nan, and if there is only one match, return the power for that one match.
My resultant dataframe would look like this:
Server Clock 1 Power 0 PhysicalWindows1 3400 (58.5+63.0)/2 1 PhysicalWindows2 3600 65.0 2 PhysicalLinux1 2600 NaN 3 Test 2700 30.0
Advertisement
Answer
Use groupby
with dropna=False
to avoid to remove PhysicalLinux1 and sort=True
to sort index level (lowest diff on top) then drop_duplicates
to keep only one instance of (Server, Clock 1):
out = (df.groupby(['Server', 'Clock 1', 'diff'], dropna=False, sort=True)['Power'] .mean().droplevel('diff').reset_index().drop_duplicates(['Server', 'Clock 1'])) # Output Server Clock 1 Power 0 PhysicalLinux1 2600 NaN 1 PhysicalWindows1 3400 60.75 3 PhysicalWindows2 3600 65.00 6 Test 2700 30.00