I have a dataframe extracted from a csv file. I want to iterate a data process where only some of the columns’s data is the mean of n rows, while the rest of the columns is the first row for each iteration.
For example, the data extracted from the csv consisted of 100 rows and 6 columns. I have a variable n_AVE = 6, which tells the code to average the data per 6 rows.
rawDf = pd.read_csv(outputFilePath / 'Raw_data.csv', encoding='CP932') OUT: TIME A B C D E 0 2021/3/4 148 0 142 0 1 [0] 1 2021/3/5 148 0 142 0 1 2 2021/3/6 150 0 148 0 1 3 2021/3/7 150 0 148 0 1 4 2021/3/8 151 0 148 0 1 5 2021/3/9 151 0 148 0 1 .... 91 2021/4/30 195 5 180 0 1 [5] 92 2021/5/1 195 5 180 0 1 93 2021/5/2 195 5 180 0 1 94 2021/5/3 200 5 180 0 1 95 2021/5/4 200 0 200 0 1 96 2021/5/5 200 5 200 0 1 [6] 97 2021/5/6 200 5 200 1 1 98 2021/5/7 200 5 200 1 1 99 2021/5/8 205 5 210 1 1 100 2021/5/9 205 5 210 1 1
Take only the first row of [TIME, D, E] columns
Average the data per n_AVE (6) from [A, B, C] columns.
I want to create a new dataframe which looks like this
OUT: TIME A B C D E 0 2021/3/4 149.66 0 146 0 1 .... 5 2021/4/30 197.5 4.166 186.66 0 1 6 2021/5/5 168.33 5 170 0 1
The code is like this:
for x in range(0,len(rawDf.index), n_AVE): df = pd.DataFrame([rawDf.iloc[[x],0], rawDf.iloc[x:(x + n_AVE),1:3].mean(), rawDf.iloc[x,4:5]])
But the code is not working because apparently when I use pandas.mean(), the dataframe’s format changed into like this
df2 = rawDf.iloc[0:6,1:3].mean() print(df2) OUT: index 0 0 A 149.66 1 B 0.0 2 C 146.0 [3 rows x 2 columns]
How to use pandas.mean() without losing the old format?
Or should I not use pandas.mean() and just create my own averaging code?
Advertisement
Answer
You can group the dataframe by the grouper np.arange(len(df)) // 6
which groups the dataframe every six rows, then aggregate the columns using the desired aggregation functions to get the result, optionally reindex
along axis=1
to reorder the columns
d = { 'A': 'mean', 'B': 'mean', 'C': 'mean', 'TIME': 'first', 'D': 'first', 'E': 'first' } df.groupby(np.arange(len(df)) // 6).agg(d).reindex(df.columns, axis=1)
Define aggegation functions using columns index:
d = { **dict.fromkeys(df.columns[[0, 4, 5]], 'first'), **dict.fromkeys(df.columns[[1, 2, 3]], 'mean' ) } df.groupby(np.arange(len(df)) // 6).agg(d).reindex(df.columns, axis=1)
Result
TIME A B C D E 0 2021/3/4 149.666667 0.000000 146.000000 0 1 1 2021/4/30 197.500000 4.166667 186.666667 0 1 2 2021/5/6 202.500000 5.000000 205.000000 1 1