Skip to content
Advertisement

Pandas groupby datetime columns by periods

I have the following dataframe:

df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
              columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])


>>> 9:00:00     9:05:00       09:10:00  09:15:00    09:20:00    09:25:00    09:30:00 ....
a       1          2             3         4           7           9           5
b       2          6             5         4           9           8           2
c       3          5             3         21         12           6           7
d       1          7             8         4           3           4           3

I would like to get for each row (e.g a,b,c,d …) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).

I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:

df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")

but then I got the columns names with fake year “1900-01-01 09:00:00″…

And also, the columns headers type was object, so I felt a bit lost…

My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc…)

Advertisement

Answer

If need some period, e.g. each 2 hours:

df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")

df1 = df.resample('2H', axis=1).mean()
print (df1)
   1900-01-01 08:00:00
0             4.428571
1             5.142857
2             8.142857
3             4.285714

If need some custom periods is possible use cut:

df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")

bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])] 
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
   9:00:00-11:00:00  9:00:00-11:00:00  9:00:00-11:00:00  9:00:00-11:00:00  
0                 1                 2                 3                 4   
1                 2                 6                 5                 4   
2                 3                 5                 3                21   
3                 1                 7                 8                 4   

   9:00:00-11:00:00  9:00:00-11:00:00  9:00:00-11:00:00  
0                 7                 9                 5  
1                 9                 8                 2  
2                12                 6                 7  
3                 3                 4                 3  

And last use mean per columns, reason of NaNs columns is columns are categoricals:

df2 = df.mean(level=0, axis=1)
print (df2)
   9:00:00-11:00:00  5:00:00-9:00:00  11:00:00-12:00:00  12:00:00-23:59:59
0          4.428571              NaN                NaN                NaN
1          5.142857              NaN                NaN                NaN
2          8.142857              NaN                NaN                NaN
3          4.285714              NaN                NaN                NaN

For avoid NaNs columns convert columns names to strings:

df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
   9:00:00-11:00:00
0          4.428571
1          5.142857
2          8.142857
3          4.285714

EDIT: Solution above with timedeltas, because format HH:MM:SS:

df.columns = pd.to_timedelta(df.columns)
print (df)
   0 days 09:00:00  0 days 09:05:00  0 days 09:10:00  0 days 09:15:00  
0                1                2                3                4   
1                2                6                5                4   
2                3                5                3               21   
3                1                7                8                4   

   0 days 09:20:00  0 days 09:25:00  0 days 09:30:00  
0                7                9                5  
1                9                8                2  
2               12                6                7  
3                3                4                3 

bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])] 
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
   9:00:00-11:00:00  9:00:00-11:00:00  9:00:00-11:00:00  9:00:00-11:00:00  
0                 1                 2                 3                 4   
1                 2                 6                 5                 4   
2                 3                 5                 3                21   
3                 1                 7                 8                 4   

   9:00:00-11:00:00  9:00:00-11:00:00  9:00:00-11:00:00  
0                 7                 9                 5  
1                 9                 8                 2  
2                12                 6                 7  
3                 3                 4                 3 

#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
   9:00:00-11:00:00  11:00:00-12:00:00
0          4.428571                NaN
1          5.142857                NaN
2          8.142857                NaN
3          4.285714                NaN

df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
   9:00:00-11:00:00
0          4.428571
1          5.142857
2          8.142857
3          4.285714
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement