When I try to get my dataframe out of the csv file the type of the data changed. Is there a way I can avoid this?
Advertisement
Answer
csv files does not have a datatype definition header or something similar. So when your read a csv pandas tries to guess the types and this can change the datatypes. You have two possibile solutions:
- Provide the datatype list when you do read_csv with dtype and parse_dates keywords (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
- Use a different file format that store data with a schema (ex parquet)
for example:
JavaScript
x
22
22
1
import pandas as pd
2
date = pd.to_datetime('01-01-2020')
3
4
df=pd.DataFrame({'col1':[1,2,3,4],'col2':['a','b','b','d'],'col3':[date,date,date,date]})
5
6
print('original n',df.dtypes)
7
8
df.to_csv('testtype.csv',index=False)
9
df_csv = pd.read_csv('testtype.csv')
10
print('simple csv read n',df_csv.dtypes)
11
12
df_csv = pd.read_csv('testtype.csv')
13
print('csv datatypes n',df_csv.dtypes)
14
15
df_csv = pd.read_csv('testtype.csv',parse_dates=[2])
16
print('csv with parse dates n',df_csv.dtypes)
17
18
df.to_parquet('testtype.pqt')
19
df_pqt=pd.read_parquet('testtype.pqt')
20
21
print('parquet n',df_pqt.dtypes)
22
that output:
JavaScript
1
30
30
1
original
2
col1 int64
3
col2 object
4
col3 datetime64[ns]
5
dtype: object
6
7
simple csv read
8
col1 int64
9
col2 object
10
col3 object
11
dtype: object
12
13
csv datatypes
14
col1 int64
15
col2 object
16
col3 object
17
dtype: object
18
19
csv with parse dates
20
col1 int64
21
col2 object
22
col3 datetime64[ns]
23
dtype: object
24
25
parquet
26
col1 int64
27
col2 object
28
col3 datetime64[ns]
29
dtype: object
30