I’m working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth. Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index | Curr(mA) |
---|---|
0 | 1.362476 |
1 | 1.341721 |
2 | 1.362477 |
3 | 1.362477 |
4 | 1.355560 |
5 | 1.348642 |
6 | 1.327886 |
7 | 1.341721 |
8 | 1.334804 |
9 | 1.334804 |
10 | 1.348641 |
11 | 1.362474 |
12 | 1.348644 |
13 | 1.355558 |
14 | 1.334805 |
15 | 1.362477 |
16 | 1.556172 |
17 | 1.542336 |
18 | 1.549252 |
19 | 1.528503 |
20 | 1.549254 |
21 | 1.528501 |
22 | 1.556173 |
23 | 1.556172 |
24 | 1.542334 |
25 | 1.556172 |
26 | 1.542336 |
27 | 1.542334 |
28 | 1.556170 |
29 | 1.535415 |
30 | 1.542334 |
31 | 1.729109 |
32 | 1.749863 |
33 | 1.749861 |
34 | 1.749861 |
35 | 1.736024 |
36 | 1.770619 |
37 | 1.742946 |
38 | 1.763699 |
39 | 1.749861 |
40 | 1.749861 |
41 | 1.763703 |
42 | 1.756781 |
43 | 1.742946 |
44 | 1.736026 |
45 | 1.756781 |
46 | 1.964308 |
47 | 1.957395 |
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index | Curr(mA) |
---|---|
0 | 1.34907 |
1 | 1.54556 |
2 | 1.74986 |
Need some help to get this done. Please help
Advertisement
Answer
First get the indexes of every row where there’s a jump. Use Pandas’ DataFrame.diff()
to get the difference between the value in each row and the previous row, then check to see if it’s greater than 0.15
with >
. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it’s really just curr(mA)
and index. In the latter case, you can use np.split()
to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)] > [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series
and reset_index()
.
pd.Series( [df['Curr(mA)'].mean() for df in np.split(df, indices)] ).reset_index(drop=True) index 0 0 0 1.349073 1 1 1.545564 2 2 1.749863 3 3 1.960851