The dataframe looks like this.
datetime hostname sessions 0 2020-10-27 00:00:05 server001 22 1 2020-10-27 00:00:10 server001 25 2 2020-10-27 00:00:15 server001 21 3 2020-10-27 01:00:05 server001 30 4 2020-10-27 01:00:10 server001 30 5 2020-10-27 01:00:15 server001 35 6 2020-10-27 00:00:05 server002 15 7 2020-10-27 00:00:10 server002 10 8 2020-10-27 00:00:15 server002 11 9 2020-10-27 01:00:05 server002 19 10 2020-10-27 01:00:10 server002 22 11 2020-10-27 01:00:15 server002 18
What I am trying to show the average sessions per hour by individual hostname.
So I would get something back like this.
datetime hostname sessions 0 2020-10-27 00:00:00 server001 23 1 2020-10-27 01:00:00 server001 32 2 2020-10-27 00:00:00 server002 12 3 2020-10-27 01:00:00 server002 20
I think I’m getting my grouping wrong as when trying this what I end up with is typically the largest average value per hour for any given hostname ordered in date by hour.
For example I may see something like
hostname datetime sessions 0 2020-10-27 server001 00:00:00 23 1 2020-10-27 01:00:00 32 2 2020-10-27 server002 02:00:00 12 3 2020-10-27 server003 03:00:00 20
Rather than the full 24 hours per hostname listed.
The code I tried was:
df = df.groupby(['hostname']).resample( 'H', on='datetime' ).agg({'sessions': 'mean'}).round(0).astype(int)
What do I need to do to get the desired result?
Advertisement
Answer
Here is an example based on the data you have provided. I have added the steps to make dates into datetime (in case they were objects) and to set datetime as a datetimeindex
in order to use resample
. It would go something like this:
import pandas as pd import numpy as np d ={'datetime' :['2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15','2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15'], 'hostname':['server001','server001','server001','server001','server001','server001','server002','server002','server002','server002','server002','server002'], 'sessions':[ 22,25,21 ,30,30,35,15,10, 11,19,22,18]} df = pd.DataFrame(data=d) df['datetime'] = pd.to_datetime(df['datetime']) df = df.set_index(pd.DatetimeIndex(df['datetime'])) df.resample('H').mean()
Actually, you can modify this example to fit other purposes. As I understood your question, you want to calculate hourly mean number of sessions. Check the resample
-function if you need other groupby.s
The alternative to doing this is to seaprate date
and time
and then take the mean:
df['datetime'] = pd.to_datetime(df['datetime']) df['Date'] = [x.strftime('%Y-%m-%d') for x in df['datetime'].tolist()] df['Time'] = ['%s:00' % x.strftime('%H') for x in df['datetime'].tolist()] df_1 = df.groupby(['Date', 'Time', 'hostname']).mean()
which gives