I am trying to calculate time-based aggregations in Pandas based on date values stored in a separate tables.
The top of the first table table_a
looks like this:
COMPANY_ID DATE MEASURE 1 2010-01-01 00:00:00 10 1 2010-01-02 00:00:00 10 1 2010-01-03 00:00:00 10 1 2010-01-04 00:00:00 10 1 2010-01-05 00:00:00 10
Here is the code to create the table:
table_a = pd.concat( [pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"), 'COMPANY_ID': 1 , 'MEASURE': 10}), pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"), 'COMPANY_ID': 2 , 'MEASURE': 10})])
The second table, table_b
, looks like this:
COMPANY END_DATE 1 2010-03-01 00:00:00 1 2010-06-02 00:00:00 2 2010-03-01 00:00:00 2 2010-06-02 00:00:00
and the code to create it is:
table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']), 'COMPANY':(1,1,2,2)})
I want to be able to get the sum of the ‘measure’ column for each ‘COMPANY_ID’ for each 30-day period prior to the ‘END_DATE’ in table_b
.
This is (I think) the SQL equivalent:
select b.COMPANY_ID, b.DATE sum(a.MEASURE) AS MEASURE_TO_END_DATE from table_a a, table_b b where a.COMPANY = b.COMPANY and a.DATE < b.DATE and a.DATE > b.DATE - 30 group by b.COMPANY;
Advertisement
Answer
Well, I can think of a few ways:
- essentially blow up the dataframe by just merging on the exact field (
company
)… then filter on the 30-day windows after the merge.
- should be fast but could use lots of memory
- Move the merging and filtering on the 30-day window into a
groupby()
.
- results in a merge for each group, so slower but should use less memory
Option #1
Suppose your data looks like the following (I expanded your sample data):
print df company date measure 0 0 2010-01-01 10 1 0 2010-01-15 10 2 0 2010-02-01 10 3 0 2010-02-15 10 4 0 2010-03-01 10 5 0 2010-03-15 10 6 0 2010-04-01 10 7 1 2010-03-01 5 8 1 2010-03-15 5 9 1 2010-04-01 5 10 1 2010-04-15 5 11 1 2010-05-01 5 12 1 2010-05-15 5 print windows company end_date 0 0 2010-02-01 1 0 2010-03-15 2 1 2010-04-01 3 1 2010-05-15
Create a beginning date for the 30 day windows:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D')) print windows company end_date beg_date 0 0 2010-02-01 2010-01-02 1 0 2010-03-15 2010-02-13 2 1 2010-04-01 2010-03-02 3 1 2010-05-15 2010-04-15
Now do a merge and then select based on if date
falls within beg_date
and end_date
:
df = df.merge(windows,on='company',how='left') df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df company date measure end_date beg_date 2 0 2010-01-15 10 2010-02-01 2010-01-02 4 0 2010-02-01 10 2010-02-01 2010-01-02 7 0 2010-02-15 10 2010-03-15 2010-02-13 9 0 2010-03-01 10 2010-03-15 2010-02-13 11 0 2010-03-15 10 2010-03-15 2010-02-13 16 1 2010-03-15 5 2010-04-01 2010-03-02 18 1 2010-04-01 5 2010-04-01 2010-03-02 21 1 2010-04-15 5 2010-05-15 2010-04-15 23 1 2010-05-01 5 2010-05-15 2010-04-15 25 1 2010-05-15 5 2010-05-15 2010-04-15
You can compute the 30 day window sums by grouping on company
and end_date
:
print df.groupby(['company','end_date']).sum() measure company end_date 0 2010-02-01 20 2010-03-15 30 1 2010-04-01 10 2010-05-15 15
Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D')) def cond_merge(g,windows): g = g.merge(windows,on='company',how='left') g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)] return g.groupby('end_date')['measure'].sum() print df.groupby('company').apply(cond_merge,windows) company end_date 0 2010-02-01 20 2010-03-15 30 1 2010-04-01 10 2010-05-15 15
Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn’t blow up a dataframe but is pretty fast:
windows['date'] = windows['end_date'] df = df.merge(windows,on=['company','date'],how='outer') print df company date measure end_date 0 0 2010-01-01 10 NaT 1 0 2010-01-15 10 NaT 2 0 2010-02-01 10 2010-02-01 3 0 2010-02-15 10 NaT 4 0 2010-03-01 10 NaT 5 0 2010-03-15 10 2010-03-15 6 0 2010-04-01 10 NaT 7 1 2010-03-01 5 NaT 8 1 2010-03-15 5 NaT 9 1 2010-04-01 5 2010-04-01 10 1 2010-04-15 5 NaT 11 1 2010-05-01 5 NaT 12 1 2010-05-15 5 2010-05-15
This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill()) print df company date measure end_date 0 0 2010-01-01 10 2010-02-01 1 0 2010-01-15 10 2010-02-01 2 0 2010-02-01 10 2010-02-01 3 0 2010-02-15 10 2010-03-15 4 0 2010-03-01 10 2010-03-15 5 0 2010-03-15 10 2010-03-15 6 0 2010-04-01 10 NaT 7 1 2010-03-01 5 2010-04-01 8 1 2010-03-15 5 2010-04-01 9 1 2010-04-01 5 2010-04-01 10 1 2010-04-15 5 2010-05-15 11 1 2010-05-01 5 2010-05-15 12 1 2010-05-15 5 2010-05-15 df = df[df.end_date.notnull()] df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D')) print df company date measure end_date beg_date 0 0 2010-01-01 10 2010-02-01 2010-01-02 1 0 2010-01-15 10 2010-02-01 2010-01-02 2 0 2010-02-01 10 2010-02-01 2010-01-02 3 0 2010-02-15 10 2010-03-15 2010-02-13 4 0 2010-03-01 10 2010-03-15 2010-02-13 5 0 2010-03-15 10 2010-03-15 2010-02-13 7 1 2010-03-01 5 2010-04-01 2010-03-02 8 1 2010-03-15 5 2010-04-01 2010-03-02 9 1 2010-04-01 5 2010-04-01 2010-03-02 10 1 2010-04-15 5 2010-05-15 2010-04-15 11 1 2010-05-01 5 2010-05-15 2010-04-15 12 1 2010-05-15 5 2010-05-15 2010-04-15 df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df.groupby(['company','end_date']).sum() measure company end_date 0 2010-02-01 20 2010-03-15 30 1 2010-04-01 10 2010-05-15 15
Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.