Skip to content
Advertisement

Python Pandas Mixed Type Warning – “dtype” preserves data?

I have this code that gives this warning:

/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: 
Columns (21,22,23) have mixed types.Specify dtype option on import or set low_memory=False

I have searched across both google and stackoverflow and people seem to give two kinds of solutions:

  1. low_memory = False
  2. converters

Problem with #1 is it merely silences the warning but does not solve the underlying problem (correct me if I am wrong).

Problem with #2 is converters might do things we don’t like. Some say they are inefficient too but I don’t know.

I have come up with a simpler solution:

  • Find what is general data type of the problematic column
  • pass the dtype option while reading the data.

e.g. in my case most of the elements in the problematic columns are supposed to be strings, hence I have passed this:

mixed_cols = {'Col_21':str, 'Col_22':str, 'Col_23':str }
df = pd.read_csv('police_killings_MPV.csv', dtype=mixed_cols)

I don’t get the warning anymore but will this preserve the data? Since I can’t check 6000 values in each of the three columns manually, will this convert any integer or float to string without modifying it (3.09 –> “3.09”)? What happens to NaN values?

Advertisement

Answer

You have different choices to read your file

>>> %cat data.csv
Col_21
12
242.24
-232e-3
empty
.90832

Case 1: let Pandas determines datatype

# df = pd.read_csv('data.csv')
>>> df
    Col_21
0       12
1   242.24
2  -232e-3
3    empty
4   .90832

>>> df.info()
...
 0   Col_21  5 non-null      object
...

Case 2: add strings to recognize NaN values and let Pandas determines datatype

# df = pd.read_csv('data.csv', na_values='empty')
>>> df
      Col_21
0   12.00000
1  242.24000
2   -0.23200
3        NaN
4    0.90832

>>> df.info()
...
 0   Col_21  4 non-null      float64
...

Case 3: add strings to recognize NaN values but keep data as plain text

# df = pd.read_csv('data.csv', na_values='empty', dtype={'Col_21': str})
>>> df
    Col_21
0       12
1   242.24
2  -232e-3
3      NaN
4   .90832

>>> df.info()
...
 0   Col_21  4 non-null      object
...
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement