Tag: parquet

How to read tsv file from vaex and output a pyarrow parquet file?

apache-arrow parquet pyarrow python vaex

On these vaex and pyarrow version: When reading a tsv file and exporting it to arrow, the arrow table couldn’t be properly loaded by pyarrow.read_table(), e.g. given a file, e.g. s2t.tsv: The file looks like this: And when I tried exporting the tsv to arrow as such, then reading it back: It throws the following error: Is there some additional

Databricks – Autoloader – Not Terminating?

azure-databricks blob databricks-autoloader parquet python

I’m new to databricks and I have several azure blob .parquet locations I’m pulling data from and want to put through the autoloader so I can “create table … using delta location ”” in SQL in another step. (Each parquet file is in its own directory at the parent blob dir, so we will iterate over all dirs in the

How to keep dtypes when reading a parquet file(read_parquet()) in pandas?

pandas parquet python

Code: As you can see here, [{‘b’: 1}] becomes [{‘b’: 1.0}]. How can I keep dtypes even in reading the parquet file? Answer You can try to use pyarrow.parquet.read_table and pyarrow.Table.to_pandas with integer_object_nulls (see the doc) a 0 [{‘b’: 1}] 1 [{‘b’: None}] On the other hand, it looks like pandas.read_parquet with use_nullable_dtypes doesn’t work. a 0 [{‘b’: 1.0}] 1

How to use pyarrow parquet with multiprocessing

hdfs parquet pyarrow python python-multiprocessing

I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing. The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs indefinitely. My only guess is that env is different somehow, but all the environment variable should be the same in the child process and parent process. I’ve

Retrieving data from multiple parquet files into one dataframe (Python)

dask dataframe pandas parquet python

I want to start by saying this is the first time I work with Parquet files. I have a list of 2615 parquet files that I downloaded from an S3 bucket and I want to read them into one dataframe. They follow the same folder structure and I am putting an example below: /Forecasting/as_of_date=2022-02-01/type=full/export_country=Spain/import_country=France/000.parquet’ The file name 000.parquet is always

How to retrieve idAdjustedUTC flag value for a TIMESTAMP column in a parquet file?

metadata parquet pyarrow python timestamp

I have a parquet file with a number of columns of type converted_type (legacy): TIMESTAMP_MICROS. I want to check if the flag isAjustedToUTC is true. I can get it this way: This gives me either true or false as string. Is there another way to retrieve the value of isAdjustedToUTC without using a regex? Answer As far as I can

Losing index information when using dask.dataframe.to_parquet() with partitioning

dask parquet partitioning python

When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue: Which gives: I did not see that described anywhere in the dask documentation. Does

How to read a Parquet file into Pandas DataFrame?

blaze dataframe pandas parquet python

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the