On these vaex and pyarrow version:
JavaScript
x
13
13
1
>>> vaex.__version__
2
{'vaex': '4.12.0',
3
'vaex-core': '4.12.0',
4
'vaex-viz': '0.5.3',
5
'vaex-hdf5': '0.12.3',
6
'vaex-server': '0.8.1',
7
'vaex-astro': '0.9.1',
8
'vaex-jupyter': '0.8.0',
9
'vaex-ml': '0.18.0'}
10
11
>>> pyarrow.__version__
12
8.0.0
13
When reading a tsv file and exporting it to arrow, the arrow table couldn’t be properly loaded by pyarrow.read_table()
, e.g. given a file, e.g. s2t.tsv
:
JavaScript
1
4
1
$ printf "test-1nfoobarntest-1nfoobarntest-1nfoobarntest-1nfoobarn" > s
2
$ printf "1-bestnpoo bearn1-bestnpoo bearn1-bestnpoo bearn1-bestnpoo bearn" > t
3
$ paste s t > s2t.tsv
4
The file looks like this:
JavaScript
1
9
1
test-1 1-best
2
foobar poo bear
3
test-1 1-best
4
foobar poo bear
5
test-1 1-best
6
foobar poo bear
7
test-1 1-best
8
foobar poo bear
9
And when I tried exporting the tsv to arrow as such, then reading it back:
JavaScript
1
8
1
import vaex
2
import pyarrow as pa
3
4
df = vaex.from_csv('s2t.tsv', sep='t', header=None)
5
df.export_arrow('s2t.parquet')
6
7
pa.parquet.read_table('s2t.parquet')
8
It throws the following error:
JavaScript
1
29
29
1
---------------------------------------------------------------------------
2
ArrowInvalid Traceback (most recent call last)
3
/tmp/ipykernel_17/3649263967.py in <module>
4
1 import pyarrow as pa
5
2
6
----> 3 pa.parquet.read_table('s2t.parquet')
7
8
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties)
9
2746 ignore_prefixes=ignore_prefixes,
10
2747 pre_buffer=pre_buffer,
11
-> 2748 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
12
2749 )
13
2750 except ImportError:
14
15
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, **kwargs)
16
2338
17
2339 self._dataset = ds.FileSystemDataset(
18
-> 2340 [fragment], schema=schema or fragment.physical_schema,
19
2341 format=parquet_format,
20
2342 filesystem=fragment.filesystem
21
22
/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()
23
24
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
25
26
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
27
28
ArrowInvalid: Could not open Parquet input source 's2t.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
29
Is there some additional args/kwargs that should be added when exporting or reading the parquet files?
Or is the exporting to arrow bugged/broken somehow?
Advertisement
Answer
According to https://github.com/vaexio/vaex/issues/2228
JavaScript
1
4
1
df.export_parquet("file.parquet")
2
# or
3
df.export("file.parquet")
4
will export to the right format that can be read by
JavaScript
1
2
1
pa.parquet.read_table("file.parquet")
2