When I try to get my dataframe out of the csv file the type of the data changed. Is there a way I can avoid this?
Advertisement
Answer
csv files does not have a datatype definition header or something similar. So when your read a csv pandas tries to guess the types and this can change the datatypes. You have two possibile solutions:
- Provide the datatype list when you do read_csv with dtype and parse_dates keywords (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
- Use a different file format that store data with a schema (ex parquet)
for example:
import pandas as pd date = pd.to_datetime('01-01-2020') df=pd.DataFrame({'col1':[1,2,3,4],'col2':['a','b','b','d'],'col3':[date,date,date,date]}) print('original n',df.dtypes) df.to_csv('testtype.csv',index=False) df_csv = pd.read_csv('testtype.csv') print('simple csv read n',df_csv.dtypes) df_csv = pd.read_csv('testtype.csv') print('csv datatypes n',df_csv.dtypes) df_csv = pd.read_csv('testtype.csv',parse_dates=[2]) print('csv with parse dates n',df_csv.dtypes) df.to_parquet('testtype.pqt') df_pqt=pd.read_parquet('testtype.pqt') print('parquet n',df_pqt.dtypes)
that output:
original col1 int64 col2 object col3 datetime64[ns] dtype: object simple csv read col1 int64 col2 object col3 object dtype: object csv datatypes col1 int64 col2 object col3 object dtype: object csv with parse dates col1 int64 col2 object col3 datetime64[ns] dtype: object parquet col1 int64 col2 object col3 datetime64[ns] dtype: object