When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue:
from datetime import timedelta from pathlib import Path import dask.dataframe as dd import pandas as pd REPORT_DATE_TEST = pd.to_datetime('2019-01-01').date() path = Path('/home/ludwik/Documents/YieldPlanet/research/trials/') observations_nr = 3 dtas = range(0, observations_nr) rds = [REPORT_DATE_TEST - timedelta(days=days) for days in dtas] data_to_export = pd.DataFrame({ 'report_date': rds, 'dta': dtas, 'stay_date': [REPORT_DATE_TEST] * observations_nr, }) .set_index('dta') data_to_export_dask = dd.from_pandas(data_to_export, npartitions=1) file_name = 'trial.parquet' data_to_export_dask.to_parquet(path / file_name, engine='pyarrow', compression='snappy', partition_on=['report_date'], write_index=True ) data_read = dd.read_parquet(path / file_name, engine='pyarrow') print(data_read)
Which gives:
| | stay_date |dta| report_date| |0| 2019-01-01 | 2 | 2018-12-30 | |0| 2019-01-01 | 1 | 2018-12-31 | |0| 2019-01-01 | 0 | 2019-01-01 |
I did not see that described anywhere in the dask documentation.
Does anyone know how to save the index while partitioning the parquet data?
Advertisement
Answer
The issue was in the pyarrow’s backend. I filed a bug report on their JIRA webpage: https://issues.apache.org/jira/browse/ARROW-7782
As stated by pavithraes, this issue was fixed with pyarrow 1.0.0. Thanks for letting me know! :)