Combining weeks 52 and 0 with Python Datetime

I have a Pandas DataFrame with daily data that I’m trying to group by week number to sum some columns, and I notice that when years do not begin on Sunday, the data for the week spanning the end of one year and the beginning of the next do not cleanly sum, instead being broken into two groups. My code to create a week number column is as follows:

df = pd.DataFrame([
  {'Date': 2020-03-01, 'population': 100, 'admission': 10, 'release': 10, 'feature_id': 'abc'},
  {'Date': 2020-03-02, 'population': 100, 'admission': 10, 'release': 10, 'feature_id': 'abc'},
  ...
])


df['week_num'] = df['Date'].dt.strftime('%Y-%U')
by_week = df.groupby(["week_num", "feature_id"])[["population", "admission", "release"]].sum().reset_index()

JavaScript
​x
 
df = pd.DataFrame([
  {'Date': 2020-03-01, 'population': 100, 'admission': 10, 'release': 10, 'feature_id': 'abc'},
  {'Date': 2020-03-02, 'population': 100, 'admission': 10, 'release': 10, 'feature_id': 'abc'},
  ...
])
​
​
df['week_num'] = df['Date'].dt.strftime('%Y-%U')
by_week = df.groupby(["week_num", "feature_id"])[["population", "admission", "release"]].sum().reset_index()
​

When I look at my grouped data, I see the following:

...,
{
  'week_num': '2020-52',
  'start_of_week': '2020-12-27T00:00:00.000Z',
  'end_of_week': '2021-01-02T00:00:00.000Z',
  'population': 1000,
  'admission': 100,
  'release': 150
},
{
  'week_num': '2021-00',
  'start_of_week': '2020-12-27T00:00:00.000Z',
  'end_of_week': '2021-01-02T00:00:00.000Z',
  'population': 200,
  'admission': 120,
  'release': 90
},
...

JavaScript
 
...,
{
  'week_num': '2020-52',
  'start_of_week': '2020-12-27T00:00:00.000Z',
  'end_of_week': '2021-01-02T00:00:00.000Z',
  'population': 1000,
  'admission': 100,
  'release': 150
},
{
  'week_num': '2021-00',
  'start_of_week': '2020-12-27T00:00:00.000Z',
  'end_of_week': '2021-01-02T00:00:00.000Z',
  'population': 200,
  'admission': 120,
  'release': 90
},
...
​

In reality, the summed values for the week from 2020-12-27 to 2021-01-02 should be a single group, population=1000+200=1200, admission=100+120=220, release=150+90=240.

Is there a clean way to create a week_num column that takes into account this end-of-year hiccup and sums a single week?

Edit

It appears the issue may derive from the application of a Week class to extract start and end of week, as follows:

class Week:
    def __init__(self, year, week_num):
        week_start = str(year) + " " + str(week_num) + " 0"
        week_end = str(year) + " " + str(week_num) + " 6"
        self.first_date = dt.datetime.strptime(week_start, '%Y %U %w')
        self.last_date = dt.datetime.strptime(week_end, '%Y %U %w')

...

by_week["start_of_week"] = by_week["iso_week_num"].apply(lambda w: Week(*w.split("-")).first_date)

JavaScript
 
class Week:
    def __init__(self, year, week_num):
        week_start = str(year) + " " + str(week_num) + " 0"
        week_end = str(year) + " " + str(week_num) + " 6"
        self.first_date = dt.datetime.strptime(week_start, '%Y %U %w')
        self.last_date = dt.datetime.strptime(week_end, '%Y %U %w')
​
...
​
by_week["start_of_week"] = by_week["iso_week_num"].apply(lambda w: Week(*w.split("-")).first_date)
​

Final from answer

df["week_num"] = df["Date"].dt.strftime("%G-%V")
by_week = df.groupby(["week_num", "feature_id"])[["population", "admission", "release"]].sum().reset_index()
by_week["start_of_week"] = by_week["week_num"].apply(lambda d: datetime.strptime(d+"1", "%G-%V%u"))
by_week["end_of_week"] = by_week["week_num"].apply(lambda d: datetime.strptime(d+"7", "%G-%V%u"))

JavaScript
 
df["week_num"] = df["Date"].dt.strftime("%G-%V")
by_week = df.groupby(["week_num", "feature_id"])[["population", "admission", "release"]].sum().reset_index()
by_week["start_of_week"] = by_week["week_num"].apply(lambda d: datetime.strptime(d+"1", "%G-%V%u"))
by_week["end_of_week"] = by_week["week_num"].apply(lambda d: datetime.strptime(d+"7", "%G-%V%u"))
​

Answer

%U breaks the week around new year as per documentation https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

Replace %U with %V and %Y with %G for the original code to work as expected (from python 3.6 onward).

For python < 3.6 use isocalendar():

df = pd.DataFrame([
    {'Date': pd.Timestamp('2020-12-20') + pd.Timedelta(n, 'day'), 'population': 100, 'admission': 10, 'release': 10,
     'feature_id': 'abc'}
    for n in range(21)
])

df = pd.concat([df, (df['Date'] + pd.Timedelta(1, 'day')).dt.isocalendar()], axis=1)
df['start_of_week'] = df.apply(lambda x: dt.datetime.fromisocalendar(x.year, x.week, 1), axis=1)
df['end_of_week'] = df.apply(lambda x: dt.datetime.fromisocalendar(x.year, x.week, 7), axis=1)
df["iso_week_num"] = (df['Date'] + pd.Timedelta(1, 'day')).apply(lambda t: f'{t.isocalendar().year}-{t.isocalendar().week}')

by_iso_week = df.groupby(["year", "week", "feature_id"])[["population", "admission", "release"]].sum().reset_index()

df["start_of_week2"] = df["iso_week_num"].apply(lambda w: Week(*w.split("-")).first_date)

JavaScript
 
df = pd.DataFrame([
    {'Date': pd.Timestamp('2020-12-20') + pd.Timedelta(n, 'day'), 'population': 100, 'admission': 10, 'release': 10,
     'feature_id': 'abc'}
    for n in range(21)
])
​
df = pd.concat([df, (df['Date'] + pd.Timedelta(1, 'day')).dt.isocalendar()], axis=1)
df['start_of_week'] = df.apply(lambda x: dt.datetime.fromisocalendar(x.year, x.week, 1), axis=1)
df['end_of_week'] = df.apply(lambda x: dt.datetime.fromisocalendar(x.year, x.week, 7), axis=1)
df["iso_week_num"] = (df['Date'] + pd.Timedelta(1, 'day')).apply(lambda t: f'{t.isocalendar().year}-{t.isocalendar().week}')
​
by_iso_week = df.groupby(["year", "week", "feature_id"])[["population", "admission", "release"]].sum().reset_index()
​
df["start_of_week2"] = df["iso_week_num"].apply(lambda w: Week(*w.split("-")).first_date)
​

Advertisement

Answer