Slightly expanding the Example 1: Merge on Multiple Columns with Different Names, results in the following Python code using Pandas pandas.DataFrame.merge:
# Create and view first DataFrame: df1 = pd.DataFrame({'a1': [0, 0, 1, 1, 2], 'b': [0, 0, 1, 1, 1], 'c': [11, 8, 10, 6, 6]}) print(df1) print(df1.dtypes) print() # Create and view second DataFrame: df2 = pd.DataFrame({'a2': [0, 1, 1, 1, 3], 'b': [0, 0, 0, 1, 1], 'd': [22, 24, 25, 33, 37]}) print(df2) print(df2.dtypes) print() # Merge df1 and df2: df_merge = pd.merge(df1, df2, how='left', left_on=['a1', 'b'], right_on=['a2', 'b']) print(df_merge) print(df_merge.dtypes)
The resulting output (I’ve added line numbers):
1 a1 b c 2 0 0 0 11 3 1 0 0 8 4 2 1 1 10 5 3 1 1 6 6 4 2 1 6 7 a1 int64 8 b int64 9 c int64 10 dtype: object 11 12 a2 b d 13 0 0 0 22 14 1 1 0 24 15 2 1 0 25 16 3 1 1 33 17 4 3 1 37 18 a2 int64 19 b int64 20 d int64 21 dtype: object 22 23 a1 b c a2 d 24 0 0 0 11 0.00 22.00 25 1 0 0 8 0.00 22.00 26 2 1 1 10 1.00 33.00 27 3 1 1 6 1.00 33.00 28 4 2 1 6 NaN NaN 29 a1 int64 30 b int64 31 c int64 32 a2 float64 33 d float64 34 dtype: object
Notice the type of a2
and d
columns in the resulting df_merge
dataframe on lines 24 through 27 have changed from the original int64
to float64
. Why would it need to change the types?
Even the example in the manual at df1.merge(df2, how=’left’, on=’a’) shows a 3.0
where I would have expected it to stay an int64
:
df1.merge(df2, how='left', on='a') a b c 0 foo 1 3.0 1 bar 2 NaN
But it doesn’t explain why. I see How to left merge two dataframes with nan without changing types from integer to float types indicates that NaN
‘s seem to be a factor, but doesn’t answer my specific question here as to why the type conversion happens.
If I change df1
to remove the last row:
df1 = pd.DataFrame({'a1': [0, 0, 1, 1, ], 'b': [0, 0, 1, 1, ], 'c': [11, 8, 10, 6, ]})
Then the output becomes what I would expect:
a1 b c a2 d 0 0 0 11 0 22 1 0 0 8 0 22 2 1 1 10 1 33 3 1 1 6 1 33 a1 int64 b int64 c int64 a2 int64 d int64 dtype: object
Advertisement
Answer
The reason for this is that NaN
is of type float.
import numpy as np print(type(np.nan)) <class 'float'>
There are good reasons why this is the case which I can explain in comments if needed.
So when you do the merge and there are missing values filled with NaN
then the column type will be automatically changed to float
. This is because all rows of that column must be of the same data type.
If you specifically wanted to have int
after the merge then you’d need to use the fillna
method and define what integers should replace the missing values. E.g. people sometimes use -1 for counts. A simple example:
import numpy as np import pandas as pd df = pd.DataFrame({ 'a': [1, 2, 3, np.nan, 5], 'b': [1, np.nan, 3, 4, 5] }) df = df.fillna(value=-1).astype(int) print(df.dtypes)
Result is:
a int64 b int64 dtype: object