pandas groupby dataframes, calculate diffs between consecutive rows

Tags: , , ,



Using pandas, I open some csv files in a loop and set the index to the cycleID column, except the cycleID column is not unique. See below:

for filename in all_files:
    abfdata = pd.read_csv(filename, index_col=None, header=0)
    abfdata = abfdata.set_index("cycleID", drop=False)

for index, row in abfdata.iterrows():
    print(row['cycleID'], row['mean'])

This prints the 2 columns (cycleID and mean) of the dataframe I am interested in for further computations:

1 1.5020712104685252e-11
1 6.56683605063102e-12
2 1.3993315187144084e-11
2 -8.670502467042485e-13
3 7.0270625256163566e-12
3 9.509995221868016e-12
4 1.2901435995915644e-11
4 9.513106448422182e-12

The objective is to use the rows corresponding to the same cycleID and calculate the difference between the mean column values. So, if there are 8 rows in the table, the final array or list would store 4 values.

I want to make it scalable as well where there can be 3 or more rows with the same cycleIDs. In that case, each cycleID could have 2 or more mean differences.


Update: Instead of creating a new ques about it, I thought I’d add here. I used the diff and groupby approach as mentioned in the solution. It works great but I have this extra need to save one of the mean values (odd row or even row doesn’t matter) in a new column and make that part of the new data frame as well. How do I do that?

Answer

You can use groupby

s2= df.groupby(['cycleID'])['mean'].diff()
s2.dropna(inplace=True)

output

1   -8.453876e-12
3   -1.486037e-11
5   2.482933e-12
7   -3.388330e-12
8   3.000000e-12

UPDATE

d = [[1, 1.5020712104685252e-11],
[1, 6.56683605063102e-12],
[2, 1.3993315187144084e-11],
[2, -8.670502467042485e-13],
[3, 7.0270625256163566e-12],
[3, 9.509995221868016e-12],
[4, 1.2901435995915644e-11],
[4, 9.513106448422182e-12]]

df = pd.DataFrame(d, columns=['cycleID', 'mean'])



df2 = df.groupby(['cycleID']).diff().dropna().rename(columns={'mean': 'difference'})
df2['mean'] = df['mean'].iloc[df2.index]
       difference    mean
1   -8.453876e-12   6.566836e-12
3   -1.486037e-11   -8.670502e-13
5   2.482933e-12    9.509995e-12
7   -3.388330e-12   9.513106e-12


Source: stackoverflow