Losing index information when using dask.dataframe.to_parquet() with partitioning

When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue:

from datetime import timedelta
from pathlib import Path

import dask.dataframe as dd
import pandas as pd

REPORT_DATE_TEST = pd.to_datetime('2019-01-01').date()
path = Path('/home/ludwik/Documents/YieldPlanet/research/trials/')

observations_nr = 3
dtas = range(0, observations_nr)
rds = [REPORT_DATE_TEST - timedelta(days=days) for days in dtas]
data_to_export = pd.DataFrame({
    'report_date': rds,
    'dta': dtas,
    'stay_date': [REPORT_DATE_TEST] * observations_nr,
    }) 
    .set_index('dta')

data_to_export_dask = dd.from_pandas(data_to_export, npartitions=1)

file_name = 'trial.parquet'
data_to_export_dask.to_parquet(path / file_name,
                               engine='pyarrow',
                               compression='snappy',
                               partition_on=['report_date'],
                               write_index=True
                              )

data_read = dd.read_parquet(path / file_name, engine='pyarrow')
print(data_read)

JavaScript
​x
 
from datetime import timedelta
from pathlib import Path
​
import dask.dataframe as dd
import pandas as pd
​
REPORT_DATE_TEST = pd.to_datetime('2019-01-01').date()
path = Path('/home/ludwik/Documents/YieldPlanet/research/trials/')
​
observations_nr = 3
dtas = range(0, observations_nr)
rds = [REPORT_DATE_TEST - timedelta(days=days) for days in dtas]
data_to_export = pd.DataFrame({
    'report_date': rds,
    'dta': dtas,
    'stay_date': [REPORT_DATE_TEST] * observations_nr,
    }) 
    .set_index('dta')
​
data_to_export_dask = dd.from_pandas(data_to_export, npartitions=1)
​
file_name = 'trial.parquet'
data_to_export_dask.to_parquet(path / file_name,
                               engine='pyarrow',
                               compression='snappy',
                               partition_on=['report_date'],
                               write_index=True
                              )
​
data_read = dd.read_parquet(path / file_name, engine='pyarrow')
print(data_read)
​

Which gives:

| | stay_date  |dta| report_date|
|0| 2019-01-01 | 2 | 2018-12-30 |
|0| 2019-01-01 | 1 | 2018-12-31 |
|0| 2019-01-01 | 0 | 2019-01-01 |

JavaScript
 
| | stay_date  |dta| report_date|
|0| 2019-01-01 | 2 | 2018-12-30 |
|0| 2019-01-01 | 1 | 2018-12-31 |
|0| 2019-01-01 | 0 | 2019-01-01 |
​

I did not see that described anywhere in the dask documentation.

Does anyone know how to save the index while partitioning the parquet data?

Answer

The issue was in the pyarrow’s backend. I filed a bug report on their JIRA webpage: https://issues.apache.org/jira/browse/ARROW-7782

As stated by pavithraes, this issue was fixed with pyarrow 1.0.0. Thanks for letting me know! :)

Advertisement

Answer