I want to start by saying this is the first time I work with Parquet files. I have a list of 2615 parquet files that I downloaded from an S3 bucket and I want to read them into one dataframe. They follow the same folder structure and I am putting an example below:
/Forecasting/as_of_date=2022-02-01/type=full/export_country=Spain/import_country=France/000.parquet'
The file name 000.parquet
is always the same, irrespective of folder.
I saved all of the file locations using the following function:
import os def list_files(dir): r = [] for root, dirs, files in os.walk(dir): for name in files: r.append(os.path.join(root, name)) return r
This generates a list of all file locations, exactly like in the folder example above.
The next thing I tried was using DASK to read all of the parquet files into a dask dataframe but it doesn’t seem to work.
import dask.dataframe as dd dask_df = dd.read_parquet(data_files)
I keep getting this error and I’m not sure how to fix it, although I understand where the issue is. It’s because the files contain the columns export_country
and import_country
, which are also partitions:
ValueError: No partition-columns should be written in the file unless they are ALL written in the file.
Another solution I tried using was iterating through each parquet file using pandas and combining everything into one dataframe.
df = pd.DataFrame() for f in data_files: data = pd.read_parquet(f,engine = 'pyarrow') df = df.append(data)
This seems to take ages and my kernel dies due to no more RAM available.
Advertisement
Answer
A variation of @Learning is a mess’s answer, but using dd.concat
:
from dask.dataframe import read_parquet, concat dask_df = concat([read_parquet(f) for f in data_files])