Skip to content
Advertisement

Why does pandas.DataFrame.merge return dataframes with different column types than the input dataframes?

Slightly expanding the Example 1: Merge on Multiple Columns with Different Names, results in the following Python code using Pandas pandas.DataFrame.merge:

JavaScript

The resulting output (I’ve added line numbers):

JavaScript

Notice the type of a2 and d columns in the resulting df_merge dataframe on lines 24 through 27 have changed from the original int64 to float64. Why would it need to change the types?

Even the example in the manual at df1.merge(df2, how=’left’, on=’a’) shows a 3.0 where I would have expected it to stay an int64:

JavaScript

But it doesn’t explain why. I see How to left merge two dataframes with nan without changing types from integer to float types indicates that NaN‘s seem to be a factor, but doesn’t answer my specific question here as to why the type conversion happens.

If I change df1 to remove the last row:

JavaScript

Then the output becomes what I would expect:

JavaScript

Advertisement

Answer

The reason for this is that NaN is of type float.

JavaScript

There are good reasons why this is the case which I can explain in comments if needed.

So when you do the merge and there are missing values filled with NaN then the column type will be automatically changed to float. This is because all rows of that column must be of the same data type.

If you specifically wanted to have int after the merge then you’d need to use the fillna method and define what integers should replace the missing values. E.g. people sometimes use -1 for counts. A simple example:

JavaScript

Result is:

JavaScript
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement