Skip to content
Advertisement

How to do/workaround a conditional join in python Pandas?

I am trying to calculate time-based aggregations in Pandas based on date values stored in a separate tables.

The top of the first table table_a looks like this:

JavaScript

Here is the code to create the table:

JavaScript

The second table, table_b, looks like this:

JavaScript

and the code to create it is:

JavaScript

I want to be able to get the sum of the ‘measure’ column for each ‘COMPANY_ID’ for each 30-day period prior to the ‘END_DATE’ in table_b.

This is (I think) the SQL equivalent:

JavaScript

Advertisement

Answer

Well, I can think of a few ways:

  1. essentially blow up the dataframe by just merging on the exact field (company)… then filter on the 30-day windows after the merge.
  • should be fast but could use lots of memory
  1. Move the merging and filtering on the 30-day window into a groupby().
  • results in a merge for each group, so slower but should use less memory

Option #1

Suppose your data looks like the following (I expanded your sample data):

JavaScript

Create a beginning date for the 30 day windows:

JavaScript

Now do a merge and then select based on if date falls within beg_date and end_date:

JavaScript

You can compute the 30 day window sums by grouping on company and end_date:

JavaScript

Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:

JavaScript

Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn’t blow up a dataframe but is pretty fast:

JavaScript

This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:

JavaScript

Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement