Code:
In [31]: df = pd.DataFrame({"a": [[{"b": 1}], [{"b": np.nan}]]})
In [32]: df
Out[32]:
a
0 [{'b': 1}]
1 [{'b': nan}]
In [33]: df.dtypes
Out[33]:
a object
dtype: object
In [34]: df.to_parquet("a.parquet")
In [35]: pd.read_parquet("a.parquet")
Out[35]:
a
0 [{'b': 1.0}]
1 [{'b': None}]
As you can see here, [{'b': 1}] becomes [{'b': 1.0}].
How can I keep dtypes even in reading the parquet file?
Advertisement
Answer
You can try to use pyarrow.parquet.read_table and pyarrow.Table.to_pandas with integer_object_nulls (see the doc)
import pyarrow.parquet as pq
pq.read_table("a.parquet").to_pandas(integer_object_nulls=True)
| a | |
|---|---|
| 0 | [{‘b’: 1}] |
| 1 | [{‘b’: None}] |
On the other hand, it looks like pandas.read_parquet with use_nullable_dtypes doesn’t work.
df = pd.DataFrame({"a": [[{"b": 1}], [{"b": None}]]})
df.to_parquet("a.parquet")
pd.read_parquet("a.parquet", use_nullable_dtypes=True)
| a | |
|---|---|
| 0 | [{‘b’: 1.0}] |
| 1 | [{‘b’: None}] |