I have this code that gives this warning:
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning:
Columns (21,22,23) have mixed types.Specify dtype option on import or set low_memory=False
I have searched across both google and stackoverflow and people seem to give two kinds of solutions:
- low_memory = False
- converters
Problem with #1 is it merely silences the warning but does not solve the underlying problem (correct me if I am wrong).
Problem with #2 is converters might do things we don’t like. Some say they are inefficient too but I don’t know.
I have come up with a simpler solution:
- Find what is general data type of the problematic column
- pass the dtype option while reading the data.
e.g. in my case most of the elements in the problematic columns are supposed to be strings, hence I have passed this:
mixed_cols = {'Col_21':str, 'Col_22':str, 'Col_23':str }
df = pd.read_csv('police_killings_MPV.csv', dtype=mixed_cols)
I don’t get the warning anymore but will this preserve the data? Since I can’t check 6000 values in each of the three columns manually, will this convert any integer or float to string without modifying it (3.09 –> “3.09”)? What happens to NaN values?
Advertisement
Answer
You have different choices to read your file
>>> %cat data.csv
Col_21
12
242.24
-232e-3
empty
.90832
Case 1: let Pandas determines datatype
# df = pd.read_csv('data.csv')
>>> df
Col_21
0 12
1 242.24
2 -232e-3
3 empty
4 .90832
>>> df.info()
0 Col_21 5 non-null object
Case 2: add strings to recognize NaN values and let Pandas determines datatype
# df = pd.read_csv('data.csv', na_values='empty')
>>> df
Col_21
0 12.00000
1 242.24000
2 -0.23200
3 NaN
4 0.90832
>>> df.info()
0 Col_21 4 non-null float64
Case 3: add strings to recognize NaN values but keep data as plain text
# df = pd.read_csv('data.csv', na_values='empty', dtype={'Col_21': str})
>>> df
Col_21
0 12
1 242.24
2 -232e-3
3 NaN
4 .90832
>>> df.info()
0 Col_21 4 non-null object