On these vaex and pyarrow version: When reading a tsv file and exporting it to arrow, the arrow table couldn’t be properly loaded by pyarrow.read_table(), e.g. given a file, e.g. s2t.tsv: The file looks like this: And when I tried exporting the tsv to arrow as such, then reading it back: It throws the following error: Is there some additional
Tag: parquet
Databricks – Autoloader – Not Terminating?
I’m new to databricks and I have several azure blob .parquet locations I’m pulling data from and want to put through the autoloader so I can “create table … using delta location ”” in SQL in another step. (Each parquet file is in its own directory at the parent blob dir, so we will iterate over all dirs in the
How to keep dtypes when reading a parquet file(read_parquet()) in pandas?
Code: As you can see here, [{‘b’: 1}] becomes [{‘b’: 1.0}]. How can I keep dtypes even in reading the parquet file? Answer You can try to use pyarrow.parquet.read_table and pyarrow.Table.to_pandas with integer_object_nulls (see the doc) a 0 [{‘b’: 1}] 1 [{‘b’: None}] On the other hand, it looks like pandas.read_parquet with use_nullable_dtypes doesn’t work. a 0 [{‘b’: 1.0}] 1
How to use pyarrow parquet with multiprocessing
I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing. The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs indefinitely. My only guess is that env is different somehow, but all the environment variable should be the same in the child process and parent process. I’ve
Retrieving data from multiple parquet files into one dataframe (Python)
I want to start by saying this is the first time I work with Parquet files. I have a list of 2615 parquet files that I downloaded from an S3 bucket and I want to read them into one dataframe. They follow the same folder structure and I am putting an example below: /Forecasting/as_of_date=2022-02-01/type=full/export_country=Spain/import_country=France/000.parquet’ The file name 000.parquet is always
How to retrieve idAdjustedUTC flag value for a TIMESTAMP column in a parquet file?
I have a parquet file with a number of columns of type converted_type (legacy): TIMESTAMP_MICROS. I want to check if the flag isAjustedToUTC is true. I can get it this way: This gives me either true or false as string. Is there another way to retrieve the value of isAdjustedToUTC without using a regex? Answer As far as I can
Losing index information when using dask.dataframe.to_parquet() with partitioning
When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue: Which gives: I did not see that described anywhere in the dask documentation. Does
How to read a Parquet file into Pandas DataFrame?
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the