Skip to content
Advertisement

Retrieving data from multiple parquet files into one dataframe (Python)

I want to start by saying this is the first time I work with Parquet files. I have a list of 2615 parquet files that I downloaded from an S3 bucket and I want to read them into one dataframe. They follow the same folder structure and I am putting an example below:

/Forecasting/as_of_date=2022-02-01/type=full/export_country=Spain/import_country=France/000.parquet'

The file name 000.parquet is always the same, irrespective of folder.

I saved all of the file locations using the following function:

JavaScript

This generates a list of all file locations, exactly like in the folder example above.

The next thing I tried was using DASK to read all of the parquet files into a dask dataframe but it doesn’t seem to work.

JavaScript

I keep getting this error and I’m not sure how to fix it, although I understand where the issue is. It’s because the files contain the columns export_country and import_country, which are also partitions:

JavaScript

Another solution I tried using was iterating through each parquet file using pandas and combining everything into one dataframe.

JavaScript

This seems to take ages and my kernel dies due to no more RAM available.

Advertisement

Answer

A variation of @Learning is a mess’s answer, but using dd.concat:

JavaScript
Advertisement