How to do/workaround a conditional join in python Pandas?

Question

I am trying to calculate time-based aggregations in Pandas based on date values stored in a separate tables. The top of the first table table_a looks like this: Here is the code to create the table: The second table, table_b, looks like this: and the code to create it is: I want to be able to get the sum of

Accepted Answer

Well, I can think of a few ways:essentially blow up the dataframe by just merging on the exact field (company)… then filter on the 30-day windows after the merge.should be fast but could use lots of memoryMove the merging and filtering on the 30-day window into a groupby().results in a merge for each group, so slower but should use less memoryOption #1Suppose your data looks like the following (I expanded your sample data):print df company date measure0 0 2010-01-01 101 0 2010-01-15 102 0 2010-02-01 103 0 2010-02-15 104 0 2010-03-01 105 0 2010-03-15 106 0 2010-04-01 107 1 2010-03-01 58 1 2010-03-15 59 1 2010-04-01 510 1 2010-04-15 511 1 2010-05-01 512 1 2010-05-15 5print windows company end_date0 0 2010-02-011 0 2010-03-152 1 2010-04-013 1 2010-05-15Create a beginning date for the 30 day windows:windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))print windows company end_date beg_date0 0 2010-02-01 2010-01-021 0 2010-03-15 2010-02-132 1 2010-04-01 2010-03-023 1 2010-05-15 2010-04-15Now do a merge and then select based on if date falls within beg_date and end_date:df = df.merge(windows,on='company',how='left')df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df company date measure end_date beg_date2 0 2010-01-15 10 2010-02-01 2010-01-024 0 2010-02-01 10 2010-02-01 2010-01-027 0 2010-02-15 10 2010-03-15 2010-02-139 0 2010-03-01 10 2010-03-15 2010-02-1311 0 2010-03-15 10 2010-03-15 2010-02-1316 1 2010-03-15 5 2010-04-01 2010-03-0218 1 2010-04-01 5 2010-04-01 2010-03-0221 1 2010-04-15 5 2010-05-15 2010-04-1523 1 2010-05-01 5 2010-05-15 2010-04-1525 1 2010-05-15 5 2010-05-15 2010-04-15You can compute the 30 day window sums by grouping on company and end_date:print df.groupby(['company','end_date']).sum() measurecompany end_date 0 2010-02-01 20 2010-03-15 301 2010-04-01 10 2010-05-15 15Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))def cond_merge(g,windows): g = g.merge(windows,on='company',how='left') g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)] return g.groupby('end_date')['measure'].sum()print df.groupby('company').apply(cond_merge,windows)company end_date 0 2010-02-01 20 2010-03-15 301 2010-04-01 10 2010-05-15 15Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn’t blow up a dataframe but is pretty fast:windows['date'] = windows['end_date']df = df.merge(windows,on=['company','date'],how='outer')print df company date measure end_date0 0 2010-01-01 10 NaT1 0 2010-01-15 10 NaT2 0 2010-02-01 10 2010-02-013 0 2010-02-15 10 NaT4 0 2010-03-01 10 NaT5 0 2010-03-15 10 2010-03-156 0 2010-04-01 10 NaT7 1 2010-03-01 5 NaT8 1 2010-03-15 5 NaT9 1 2010-04-01 5 2010-04-0110 1 2010-04-15 5 NaT11 1 2010-05-01 5 NaT12 1 2010-05-15 5 2010-05-15This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())print df company date measure end_date0 0 2010-01-01 10 2010-02-011 0 2010-01-15 10 2010-02-012 0 2010-02-01 10 2010-02-013 0 2010-02-15 10 2010-03-154 0 2010-03-01 10 2010-03-155 0 2010-03-15 10 2010-03-156 0 2010-04-01 10 NaT7 1 2010-03-01 5 2010-04-018 1 2010-03-15 5 2010-04-019 1 2010-04-01 5 2010-04-0110 1 2010-04-15 5 2010-05-1511 1 2010-05-01 5 2010-05-1512 1 2010-05-15 5 2010-05-15df = df[df.end_date.notnull()]df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))print df company date measure end_date beg_date0 0 2010-01-01 10 2010-02-01 2010-01-021 0 2010-01-15 10 2010-02-01 2010-01-022 0 2010-02-01 10 2010-02-01 2010-01-023 0 2010-02-15 10 2010-03-15 2010-02-134 0 2010-03-01 10 2010-03-15 2010-02-135 0 2010-03-15 10 2010-03-15 2010-02-137 1 2010-03-01 5 2010-04-01 2010-03-028 1 2010-03-15 5 2010-04-01 2010-03-029 1 2010-04-01 5 2010-04-01 2010-03-0210 1 2010-04-15 5 2010-05-15 2010-04-1511 1 2010-05-01 5 2010-05-15 2010-04-1512 1 2010-05-15 5 2010-05-15 2010-04-15df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df.groupby(['company','end_date']).sum() measurecompany end_date 0 2010-02-01 20 2010-03-15 301 2010-04-01 10 2010-05-15 15Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.

Advertisement

Answer