I have a pandas dataframe with several dates, and several values for each date, I’m trying to sum the values of each date then divide by the number of records for that same date.
Example:
date | value |
---|---|
2022-09-16 | 1 |
2022-09-16 | 2 |
2022-09-16 | 3 |
2022-09-15 | 6 |
2022-09-15 | 2 |
2022-09-15 | 2 |
2022-09-14 | 7 |
The expected result for this would be a new dataframe containing the sum of the values for the date divided by the count of values for each date: Eg:
date | value |
---|---|
2022-09-16 | 2 |
2022-09-15 | 3.3333 |
2022-09-14 | 7 |
2022-09-16 would be:
(1+2+3)/3
My current code for this function is the following so far:
dfyieldd = pd.read_sql_query(getmacrofactorvalues, conn) dfyieldd.to_csv('dfyieldd_raw.csv') resultseries = pd.Series(data=dfyieldd['rawvalue']) resultzscored = zscoreoutliers(resultseries) dfyieldd['value'] = resultzscored dfyieldd.to_csv('dfyieldd_zcored.csv') dfyieldd.set_index('datadate') dfyieldd.index=pd.to_datetime(dfyieldd.index)
My question is if there is any way to do this without having to use foreach. I could achieve it by looping through the dates, but it doesn’t feel it’s the proper way to deal with it.
Thanks
Advertisement
Answer
Use pandas.DataFrame.groupby
to aggregate then pandas.Series.div
to divide cols.
Try this :
out = ( dfyieldd.groupby('date')['value'].agg(['sum','count']) .assign(value=lambda x: x.pop('sum').div(x.pop('count'))) .reset_index() )
Or simply by using pandas.Series.mean
:
out = dfyieldd.groupby('date', as_index=False)['value'].mean()
# Output :
print(out) date value 0 2022-09-14 7.000000 1 2022-09-15 3.333333 2 2022-09-16 2.000000