Skip to content
Advertisement

Average for similar looking data in a column using Pandas

I’m working on a large data with more than 60K rows.

I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth. Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement

The data with top 48 rows is as follows,

Index Curr(mA)
0 1.362476
1 1.341721
2 1.362477
3 1.362477
4 1.355560
5 1.348642
6 1.327886
7 1.341721
8 1.334804
9 1.334804
10 1.348641
11 1.362474
12 1.348644
13 1.355558
14 1.334805
15 1.362477
16 1.556172
17 1.542336
18 1.549252
19 1.528503
20 1.549254
21 1.528501
22 1.556173
23 1.556172
24 1.542334
25 1.556172
26 1.542336
27 1.542334
28 1.556170
29 1.535415
30 1.542334
31 1.729109
32 1.749863
33 1.749861
34 1.749861
35 1.736024
36 1.770619
37 1.742946
38 1.763699
39 1.749861
40 1.749861
41 1.763703
42 1.756781
43 1.742946
44 1.736026
45 1.756781
46 1.964308
47 1.957395

I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..

I want the data to look like

Index Curr(mA)
0 1.34907
1 1.54556
2 1.74986

Need some help to get this done. Please help

Advertisement

Answer

First get the indexes of every row where there’s a jump. Use Pandas’ DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it’s greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.

indices = df.index[df['Curr(mA)'].diff() > 0.15]

The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it’s really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.

[df['Curr(mA)'].mean() for df in np.split(df, indices)]

> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]

To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().

pd.Series(
    [df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)

index   0
0   0   1.349073
1   1   1.545564
2   2   1.749863
3   3   1.960851
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement