On these vaex and pyarrow version:
>>> vaex.__version__ {'vaex': '4.12.0', 'vaex-core': '4.12.0', 'vaex-viz': '0.5.3', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'} >>> pyarrow.__version__ 8.0.0
When reading a tsv file and exporting it to arrow, the arrow table couldn’t be properly loaded by pyarrow.read_table()
, e.g. given a file, e.g. s2t.tsv
:
$ printf "test-1nfoobarntest-1nfoobarntest-1nfoobarntest-1nfoobarn" > s $ printf "1-bestnpoo bearn1-bestnpoo bearn1-bestnpoo bearn1-bestnpoo bearn" > t $ paste s t > s2t.tsv
The file looks like this:
test-1 1-best foobar poo bear test-1 1-best foobar poo bear test-1 1-best foobar poo bear test-1 1-best foobar poo bear
And when I tried exporting the tsv to arrow as such, then reading it back:
import vaex import pyarrow as pa df = vaex.from_csv('s2t.tsv', sep='t', header=None) df.export_arrow('s2t.parquet') pa.parquet.read_table('s2t.parquet')
It throws the following error:
--------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) /tmp/ipykernel_17/3649263967.py in <module> 1 import pyarrow as pa 2 ----> 3 pa.parquet.read_table('s2t.parquet') /opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties) 2746 ignore_prefixes=ignore_prefixes, 2747 pre_buffer=pre_buffer, -> 2748 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit 2749 ) 2750 except ImportError: /opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, **kwargs) 2338 2339 self._dataset = ds.FileSystemDataset( -> 2340 [fragment], schema=schema or fragment.physical_schema, 2341 format=parquet_format, 2342 filesystem=fragment.filesystem /opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__() /opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() /opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Could not open Parquet input source 's2t.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Is there some additional args/kwargs that should be added when exporting or reading the parquet files?
Or is the exporting to arrow bugged/broken somehow?
Advertisement
Answer
According to https://github.com/vaexio/vaex/issues/2228
df.export_parquet("file.parquet") # or df.export("file.parquet")
will export to the right format that can be read by
pa.parquet.read_table("file.parquet")