Skip to content
Advertisement

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3).

First, I can read a single parquet file locally like this:

JavaScript

I can also read a directory of parquet files locally like this:

JavaScript

Both work like a charm. Now I want to achieve the same remotely with files stored in a S3 bucket. I was hoping that something like this would work:

JavaScript

But it does not:

OSError: Passed non-file path: s3n://dsn/to/my/bucket

After reading pyarrow’s documentation thoroughly, this does not seem possible at the moment. So I came out with the following solution:

Reading a single file from S3 and getting a pandas dataframe:

JavaScript

And here my hacky, not-so-optimized, solution to create a pandas dataframe from a S3 folder path:

JavaScript

Is there a better way to achieve this? Maybe some kind of connector for pandas using pyarrow? I would like to avoid using pyspark, but if there is no other solution, then I would take it.

Advertisement

Answer

You should use the s3fs module as proposed by yjk21. However as result of calling ParquetDataset you’ll get a pyarrow.parquet.ParquetDataset object. To get the Pandas DataFrame you’ll rather want to apply .read_pandas().to_pandas() to it:

JavaScript
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement