Check if value in pandas dataframe is within any two values of two other columns in another dataframe

Question

I have two dataframes of different length. dfSamples (63012375 rows) and dfFixations (200000 rows). I would like to check each value in dfSamples if it is within any two ranges given in dfFixations and then assign a label to this value. I have found this: Check if value in a dataframe is between two values in another dataframe, but the

Accepted Answer

Use IntervalIndex.from_arrays with IntervalIndex.get_indexer , if not match is returned -1, so checked and set ouput in numpy.where:i = pd.IntervalIndex.from_arrays(dfFixations['tStart'],                                 dfFixations['tEnd'],                                  closed="both")pos = i.get_indexer(dfSamples['tSample'])dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")print (dfSamples)   tSample       labels0        4     fixation1        6     fixation2        8     fixation3       10  no_fixation4       12     fixation5       14     fixationPerformance: In ideal nice sorted not overlap data, in real should be performance different, the best test it.dfSamples = pd.DataFrame({'tSample':range(10000)})  dfFixations = pd.DataFrame({'tStart':range(0, 10000, 5),'tEnd':range(2, 10000, 5)})    In [165]: %%timeit     ...: labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)     ...: for i, fixation in dfFixations.iterrows():     ...:     log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])     ...:     labels[log_range] = 'fixation'     ...: labels[labels != 'fixation'] = 'no_fixation'     ...: dfSamples['labels'] = labels     ...:      ...: 1.25 s ± 52.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)In [168]: %%timeit     ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")     ...: dfSamples["labels1"] =  np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")     ...: 315 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)In [170]: %%timeit     ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")     ...: contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)     ...: dfSamples["labels1"] = np.where(contained, "fixation", "no_fixation")     ...: 82.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)In [166]: %%timeit     ...: s = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")     ...: pos = s.get_indexer(dfSamples['tSample'])     ...: dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")     ...: 27.8 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Advertisement

Answer