Pandas Grouping by Hostname. Average of Sessions(on host) by Hour

Tags: , , , ,



The dataframe looks like this.

              datetime   hostname  sessions
0  2020-10-27 00:00:05  server001        22
1  2020-10-27 00:00:10  server001        25
2  2020-10-27 00:00:15  server001        21
3  2020-10-27 01:00:05  server001        30
4  2020-10-27 01:00:10  server001        30
5  2020-10-27 01:00:15  server001        35
6  2020-10-27 00:00:05  server002        15
7  2020-10-27 00:00:10  server002        10
8  2020-10-27 00:00:15  server002        11
9  2020-10-27 01:00:05  server002        19
10 2020-10-27 01:00:10  server002        22
11 2020-10-27 01:00:15  server002        18

What I am trying to show the average sessions per hour by individual hostname.

So I would get something back like this.

              datetime   hostname  sessions
0  2020-10-27 00:00:00  server001        23
1  2020-10-27 01:00:00  server001        32
2  2020-10-27 00:00:00  server002        12
3  2020-10-27 01:00:00  server002        20

I think I’m getting my grouping wrong as when trying this what I end up with is typically the largest average value per hour for any given hostname ordered in date by hour.

For example I may see something like

                hostname   datetime     sessions
0  2020-10-27  server001   00:00:00           23
1  2020-10-27              01:00:00           32
2  2020-10-27  server002   02:00:00           12
3  2020-10-27  server003   03:00:00           20

Rather than the full 24 hours per hostname listed.

The code I tried was:

df = df.groupby(['hostname']).resample(
        'H', on='datetime'
        ).agg({'sessions': 'mean'}).round(0).astype(int)

What do I need to do to get the desired result?

Answer

Here is an example based on the data you have provided. I have added the steps to make dates into datetime (in case they were objects) and to set datetime as a datetimeindex in order to use resample. It would go something like this:

import pandas as pd
import numpy as np
d ={'datetime' :['2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15','2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15'],
   'hostname':['server001','server001','server001','server001','server001','server001','server002','server002','server002','server002','server002','server002'],
   'sessions':[ 22,25,21 ,30,30,35,15,10, 11,19,22,18]}       
df = pd.DataFrame(data=d)
df['datetime'] =  pd.to_datetime(df['datetime'])
df = df.set_index(pd.DatetimeIndex(df['datetime']))
df.resample('H').mean()

Actually, you can modify this example to fit other purposes. As I understood your question, you want to calculate hourly mean number of sessions. Check the resample-function if you need other groupby.s

The alternative to doing this is to seaprate date and time and then take the mean:

df['datetime'] =  pd.to_datetime(df['datetime'])
df['Date'] = [x.strftime('%Y-%m-%d') for x in df['datetime'].tolist()]
df['Time'] = ['%s:00' % x.strftime('%H') for x in df['datetime'].tolist()]
df_1 = df.groupby(['Date', 'Time', 'hostname']).mean()

which gives

enter image description here



Source: stackoverflow