I have two dataframes of different length. dfSamples (63012375 rows) and dfFixations (200000 rows).
dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})
I would like to check each value in dfSamples if it is within any two ranges given in dfFixations and then assign a label to this value. I have found this: Check if value in a dataframe is between two values in another dataframe, but the loop solution is terribly slow and I cannot make any other solution work.
Working (but very slow) example:
labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
for i, fixation in dfFix.iterrows():
log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
labels[log_range] = 'fixation'
labels[labels != 'fixation'] = 'no_fixation'
dfSamples['labels'] = labels
Following this example: Performance of Pandas apply vs np.vectorize to create new column from existing columns I have tried to vectorize this but with no success.
def check_range(samples, tstart, tend):
log_range = (samples > tstart) & (samples < tend)
return log_range
fixations = list(map(check_range, dfSamples['tSample'], dfFix['tStart'], dfFix['tEnd']))
Would appreciate any help!
Advertisement
Answer
Use IntervalIndex.from_arrays
with IntervalIndex.get_indexer
, if not match is returned -1
, so checked and set ouput in numpy.where
:
i = pd.IntervalIndex.from_arrays(dfFixations['tStart'],
dfFixations['tEnd'],
closed="both")
pos = i.get_indexer(dfSamples['tSample'])
dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")
print (dfSamples)
tSample labels
0 4 fixation
1 6 fixation
2 8 fixation
3 10 no_fixation
4 12 fixation
5 14 fixation
Performance: In ideal nice sorted not overlap data, in real should be performance different, the best test it.
dfSamples = pd.DataFrame({'tSample':range(10000)})
dfFixations = pd.DataFrame({'tStart':range(0, 10000, 5),'tEnd':range(2, 10000, 5)})
In [165]: %%timeit
labels = np.empty_like(dfSamples['tSample']).astype(np.chararray) :
for i, fixation in dfFixations.iterrows(): :
log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd']) :
labels[log_range] = 'fixation' :
labels[labels != 'fixation'] = 'no_fixation' :
dfSamples['labels'] = labels :
:
:
1.25 s ± 52.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [168]: %%timeit
ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both") :
dfSamples["labels1"] = np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation") :
:
315 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [170]: %%timeit
ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both") :
contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0) :
dfSamples["labels1"] = np.where(contained, "fixation", "no_fixation") :
:
82.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [166]: %%timeit
s = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both") :
pos = s.get_indexer(dfSamples['tSample']) :
dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation") :
:
27.8 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)