I have this code that gives this warning:
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (21,22,23) have mixed types.Specify dtype option on import or set low_memory=False
I have searched across both google and stackoverflow and people seem to give two kinds of solutions:
- low_memory = False
- converters
Problem with #1 is it merely silences the warning but does not solve the underlying problem (correct me if I am wrong).
Problem with #2 is converters might do things we don’t like. Some say they are inefficient too but I don’t know.
I have come up with a simpler solution:
- Find what is general data type of the problematic column
- pass the dtype option while reading the data.
e.g. in my case most of the elements in the problematic columns are supposed to be strings, hence I have passed this:
mixed_cols = {'Col_21':str, 'Col_22':str, 'Col_23':str } df = pd.read_csv('police_killings_MPV.csv', dtype=mixed_cols)
I don’t get the warning anymore but will this preserve the data? Since I can’t check 6000 values in each of the three columns manually, will this convert any integer or float to string without modifying it (3.09 –> “3.09”)? What happens to NaN values?
Advertisement
Answer
You have different choices to read your file
>>> %cat data.csv Col_21 12 242.24 -232e-3 empty .90832
Case 1: let Pandas determines datatype
# df = pd.read_csv('data.csv') >>> df Col_21 0 12 1 242.24 2 -232e-3 3 empty 4 .90832 >>> df.info() ... 0 Col_21 5 non-null object ...
Case 2: add strings to recognize NaN values and let Pandas determines datatype
# df = pd.read_csv('data.csv', na_values='empty') >>> df Col_21 0 12.00000 1 242.24000 2 -0.23200 3 NaN 4 0.90832 >>> df.info() ... 0 Col_21 4 non-null float64 ...
Case 3: add strings to recognize NaN values but keep data as plain text
# df = pd.read_csv('data.csv', na_values='empty', dtype={'Col_21': str}) >>> df Col_21 0 12 1 242.24 2 -232e-3 3 NaN 4 .90832 >>> df.info() ... 0 Col_21 4 non-null object ...