I am curious why a simple concatenation of two dataframes in pandas:
JavaScript
x
6
1
initId.shape # (66441, 1)
2
initId.isnull().sum() # 0
3
4
ypred.shape # (66441, 1)
5
ypred.isnull().sum() # 0
6
of the same shape and both without NaN values
JavaScript
1
4
1
foo = pd.concat([initId, ypred], join='outer', axis=1)
2
foo.shape # (83384, 2)
3
foo.isnull().sum() # 16943
4
can result in a lot of NaN values if joined.
How can I fix this problem and prevent NaN values being introduced? Trying to reproduce it like
JavaScript
1
4
1
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])
2
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
3
pd.concat([aaa, bbb], axis=1)
4
failed e.g. worked just fine as no NaN values were introduced.
Advertisement
Answer
I think there is problem with different index values, so where concat
cannot align get NaN
:
JavaScript
1
33
33
1
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])
2
print(aaa)
3
prediction
4
4 0
5
5 1
6
8 0
7
7 1
8
10 0
9
12 0
10
11
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
12
print(bbb)
13
groundTruth
14
0 0
15
1 0
16
2 1
17
3 0
18
4 1
19
5 1
20
21
print (pd.concat([aaa, bbb], axis=1))
22
prediction groundTruth
23
0 NaN 0.0
24
1 NaN 0.0
25
2 NaN 1.0
26
3 NaN 0.0
27
4 0.0 1.0
28
5 1.0 1.0
29
7 1.0 NaN
30
8 0.0 NaN
31
10 0.0 NaN
32
12 0.0 NaN
33
Solution is reset_index
if indexes values are not necessary:
JavaScript
1
30
30
1
aaa.reset_index(drop=True, inplace=True)
2
bbb.reset_index(drop=True, inplace=True)
3
4
print(aaa)
5
prediction
6
0 0
7
1 1
8
2 0
9
3 1
10
4 0
11
5 0
12
13
print(bbb)
14
groundTruth
15
0 0
16
1 0
17
2 1
18
3 0
19
4 1
20
5 1
21
22
print (pd.concat([aaa, bbb], axis=1))
23
prediction groundTruth
24
0 0 0
25
1 1 0
26
2 0 1
27
3 1 0
28
4 0 1
29
5 0 1
30
EDIT: If need same index like aaa
and length of DataFrames is same use:
JavaScript
1
10
10
1
bbb.index = aaa.index
2
print (pd.concat([aaa, bbb], axis=1))
3
prediction groundTruth
4
4 0 0
5
5 1 0
6
8 0 1
7
7 1 0
8
10 0 1
9
12 0 1
10