I have the following problem. Suppose I have a wide data Frame consisting of three columns (mock example follows below). Essentially, it consists of three factors, A
, B
and C
for which I have certain values for each business day within a time range.
import pandas as pd import numpy as np index_d = pd.bdate_range(start='10/5/2022', end='10/27/2022') index = np.repeat(index_d,3) values = np.random.randn(3*len(index_d), 1) columns_v = len(index_d)*["A","B","C"] df = pd.DataFrame() df["x"] = np.asarray(index) df["y"] = values df["factor"] = np.asarray([columns_v]).T
I would like to plot the business weekly averages of the the three factors along time. A business week goes from Monday to Friday. However, in the example above I start within a week and end within a week. That means the first weekly averages consist only of the data points on 5th, 6th and 7th of October. Similar for the last week. Ideally, the output should have the form
dt1 = dt.datetime.strptime("20221007", "%Y%m%d").date() dt2 = dt.datetime.strptime("20221014", "%Y%m%d").date() dt3 = dt.datetime.strptime("20221021", "%Y%m%d").date() dt4 = dt.datetime.strptime("20221027", "%Y%m%d").date() d = 3*[dt1, dt2, dt3, dt4] values = np.random.randn(len(d), 1) factors = 4*["A","B","C"] df_output = pd.DataFrame() df_output["time"] = d df_output["values"] = values df_output["factors"] = factors
I can then plot the weekly averages using seaborn as a lineplot with hue
. Important to note is that the respective time value for weekly average is always the last business day in that week (Friday except for the last, where it is a Thursday).
I was thinking of groupby. However, given that my real data is much larger and has possibly some NaN
I’m not sure how to do it. In particular with regards to the random start / end points that don’t need to be Monday / Friday.
Advertisement
Answer
Try as follows:
res = df.groupby([pd.Grouper(key='x', freq='W-FRI'),df.factor])['y'].mean() .reset_index(drop=False) res = res.rename(columns={'x':'time','factor':'factors','y':'values'}) res['time'] = res.time.map(pd.merge_asof(df.x, res.time, left_on='x', right_on='time', direction='forward') .groupby('time').last()['x']).astype(str) print(res) time factors values 0 2022-10-07 A 0.171228 1 2022-10-07 B -0.250432 2 2022-10-07 C -0.126960 3 2022-10-14 A 0.455972 4 2022-10-14 B 0.582900 5 2022-10-14 C 0.104652 6 2022-10-21 A -0.526221 7 2022-10-21 B 0.371007 8 2022-10-21 C 0.012099 9 2022-10-27 A -0.123510 10 2022-10-27 B -0.566441 11 2022-10-27 C -0.652455
Plot data:
import seaborn as sns import matplotlib.pyplot as plt sns.set_theme() fig, ax = plt.subplots(figsize=(8,5)) ax = sns.lineplot(data=res, x='time', y='values', hue='factors') sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1)) plt.show()
Result:
Explanation
- First, apply
df.groupby
. Grouping byfactor
is of course easy; for the dates we can usepd.Grouper
withfreq
parameter set toW-FRI
(each week through to Friday), and then we want to get themean
for columny
(NaN
values will just be ignored). - In the next step, let’s use
df.rename
to rename the columns. - We are basically done now, except for the fact that
pd.Grouper
will use each Friday (even if it isn’t present in the actual set). E.g.:
print(res.time.unique()) ['2022-10-07T00:00:00.000000000' '2022-10-14T00:00:00.000000000' '2022-10-21T00:00:00.000000000' '2022-10-28T00:00:00.000000000']
- If you are OK with this, you can just start plotting (but see below). If you would like to get
'2022-10-27'
instead of'2022-10-28'
, we can combineSeries.map
applied to columntime
withpd.merge_asof
,and perform anothergroupby
to getlast
in columnx
. I.e. this will get us the closest match to each Friday within each week (so, in fact just Friday in all cases, except the last:2022-10-17
). - In either scenario, before plotting, make sure to turn the datetime values into strings:
res['time'] = res['time'].astype(str)
!