I have two Pandas event dataframes. The first is for longer events, while the second is for shorter events. The start and end times of the events define them. Every long event is a “container” of at least one short event. Each short event is part of only one long event interval.
I try to figure out how many short events/intervals are in each long event.
Minimal example:
# long events start | end | tag 0.5 | 0.7 | AAA 0.7 | 1.2 | BBB 1.2 | 2 | CCC # short events start | end | tag 0.5 | 0.55| a 0.55 | 0.7 | aa 0.7 | 1.2 | b 1.2 | 1.3 | c 1.3 | 1.4 | cc 1.4 | 1.5 | ccc 1.5 | 1.6 | cccc 1.6 | 2 | ccccc
The desired output is:
start | end | tag | count of sub events 0.5 | 0.7 | AAA | 2 0.7 | 1.2 | BBB | 1 1.2 | 2 | CCC | 5
I attempted to use “apply”, so I created a function that looked like this:
def count_records_in_time_range(df, start, end): return len(df[(df['start'] >= start) & (df['end'] <= end)])
But I’m not sure how to proceed; should I use this function on the short events tables or the long events tables? How can I calculate the desired count per record in one table (the long events dataframe) based on the records in the other table (the short events dataframe)?
Advertisement
Answer
We can use numpy broadcasting
here
s, e = long[['start', 'end']].values[None, :].T long['count'] = ((s <= short['start'].values) & (e >= short['end'].values)).sum(1)
start end tag count 0 0.5 0.7 AAA 2 1 0.7 1.2 BBB 1 2 1.2 2.0 CCC 5