Extracting latest values in a Dask dataframe with non-unique index column dates

Question

I&#8217;m quite familiar with pandas dataframes but I&#8217;m very new to Dask so I&#8217;m still trying to wrap my head around parallelizing my code. I&#8217;ve obtained my desired results using pandas and pandarallel already so what I&#8217;m trying to figure out is if I can scale up the task or speed it up…

Accepted Answer

The snippet below shows that it&#8217;s a very similar syntax:import dask# generate dask dataframeddf = dask.datasets.timeseries(freq="500ms", partition_freq="1h")# generate a pandas dataframedf = ddf.partitions[0].compute()  # pandas df for example# sample datesdate_minus60 = "2000-01-01 00:00:00.000"date_curr = "2000-01-01 00:00:02.000"# pandas codelast_index_pandas = df.loc[date_minus60:date_curr].index[-1]last_values_pandas = df.loc[last_index_pandas]# dask codelast_index_dask = ddf.loc[date_minus60:date_curr].compute().index[-1]last_values_dask = ddf.loc[last_index_dask].compute()# check equality of the resultsprint(last_values_pandas == last_values_dask)Note that the distinction is in two .compute steps in dask version, since two lazy values need to be computed: first is to find out the correct index location and second is to get the actual value. Also this assumes that the data is already indexed by the timestamp, if it is not, it&#8217;s best to index the data before loading into dask since .set_index is in general a slow operation.However, depending on what you are really after this is probably not a great use of dask. If the underlying idea is to do fast lookups, then a better solution is to use indexed databases (including specialised time-series databases).Finally, the snippet above is using unique index. If the actual data has non-unique indexes, then the requirement on selection by largest id is something that should be handled once the last_values_dask is computed, by using something like this (pseudo code, not expected to work right away):def get_largest_id(last_values):    return last_values.sort_values('id').tail(1)last_values_dask = get_largest_id(last_values_dask)There is scope for designing a better pipeline if the lookup is for batches (rather than specific sample dates).

Advertisement

Answer