The dataframe looks like this.
datetime hostname sessions
0 2020-10-27 00:00:05 server001 22
1 2020-10-27 00:00:10 server001 25
2 2020-10-27 00:00:15 server001 21
3 2020-10-27 01:00:05 server001 30
4 2020-10-27 01:00:10 server001 30
5 2020-10-27 01:00:15 server001 35
6 2020-10-27 00:00:05 server002 15
7 2020-10-27 00:00:10 server002 10
8 2020-10-27 00:00:15 server002 11
9 2020-10-27 01:00:05 server002 19
10 2020-10-27 01:00:10 server002 22
11 2020-10-27 01:00:15 server002 18
What I am trying to show the average sessions per hour by individual hostname.
So I would get something back like this.
datetime hostname sessions
0 2020-10-27 00:00:00 server001 23
1 2020-10-27 01:00:00 server001 32
2 2020-10-27 00:00:00 server002 12
3 2020-10-27 01:00:00 server002 20
I think I’m getting my grouping wrong as when trying this what I end up with is typically the largest average value per hour for any given hostname ordered in date by hour.
For example I may see something like
hostname datetime sessions
0 2020-10-27 server001 00:00:00 23
1 2020-10-27 01:00:00 32
2 2020-10-27 server002 02:00:00 12
3 2020-10-27 server003 03:00:00 20
Rather than the full 24 hours per hostname listed.
The code I tried was:
df = df.groupby(['hostname']).resample(
'H', on='datetime'
).agg({'sessions': 'mean'}).round(0).astype(int)
What do I need to do to get the desired result?
Advertisement
Answer
Here is an example based on the data you have provided. I have added the steps to make dates into datetime (in case they were objects) and to set datetime as a datetimeindex
in order to use resample
. It would go something like this:
import pandas as pd
import numpy as np
d ={'datetime' :['2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15','2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15'],
'hostname':['server001','server001','server001','server001','server001','server001','server002','server002','server002','server002','server002','server002'],
'sessions':[ 22,25,21 ,30,30,35,15,10, 11,19,22,18]}
df = pd.DataFrame(data=d)
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index(pd.DatetimeIndex(df['datetime']))
df.resample('H').mean()
Actually, you can modify this example to fit other purposes. As I understood your question, you want to calculate hourly mean number of sessions. Check the resample
-function if you need other groupby.s
The alternative to doing this is to seaprate date
and time
and then take the mean:
df['datetime'] = pd.to_datetime(df['datetime'])
df['Date'] = [x.strftime('%Y-%m-%d') for x in df['datetime'].tolist()]
df['Time'] = ['%s:00' % x.strftime('%H') for x in df['datetime'].tolist()]
df_1 = df.groupby(['Date', 'Time', 'hostname']).mean()
which gives