I have two dataframes like this
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'key': list('AAABBCCAAC'),
'prop1': list('xyzuuyxzzz'),
'prop2': list('mnbnbbnnnn')
})
df2 = pd.DataFrame({
'key': list('ABBCAA'),
'prop1': [np.nan] * 6,
'prop2': [np.nan] * 6,
'keep_me': ['stuff'] * 6
})
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 C x n
7 A z n
8 A z n
9 C z n
key prop1 prop2 keep_me
0 A NaN NaN stuff
1 B NaN NaN stuff
2 B NaN NaN stuff
3 C NaN NaN stuff
4 A NaN NaN stuff
5 A NaN NaN stuff
I now want to populate columns prop1
and prop2
in df2
using the values of df1
. For each key, we will have more or equal rows in df1
than in df2
(in the example above: 5 times A
vs 3 times A
, 2 times B
vs 2 times B
and 3 times C
vs 1 time C
). For each key, I want to fill df2
using the first n
rows per key from df1
.
So, my expected outcome for df2
would be:
key prop1 prop2 keep_me
0 A x m stuff
1 B u n stuff
2 B u b stuff
3 C y b stuff
4 A y n stuff
5 A z b stuff
As key
is not unique, I cannot simple build a dictionary and then use .map
.
I was hoping that something along these lines would work:
pd.concat([df2.set_index('key'), df1.set_index('key')], axis=1, join='inner')
but that fails with
ValueError: Shape of passed values is (5, 22), indices imply (5, 10)
as – I guess – the index contains non-unique values.
How can I get my desired output?
Advertisement
Answer
Because duplicates in key
values possible solution is create new counter columns in both DataFrame
s by GroupBy.cumcount
, so possible replace missing values from df2
with align by MultiIndex
created by key
and g
columns with DataFrame.fillna
:
df1['g'] = df1.groupby('key').cumcount()
df2['g'] = df2.groupby('key').cumcount()
print (df1)
key prop1 prop2 g
0 A x m 0
1 A y n 1
2 A z b 2
3 B u n 0
4 B u b 1
5 C y b 0
6 C x n 1
7 A z n 3
8 A z n 4
9 C z n 2
print (df2)
key prop1 prop2 keep_me g
0 A NaN NaN stuff 0
1 B NaN NaN stuff 0
2 B NaN NaN stuff 1
3 C NaN NaN stuff 0
4 A NaN NaN stuff 1
5 A NaN NaN stuff 2
df = (df2.set_index(['key','g'])
.fillna(df1.set_index(['key','g']))
.reset_index(level=1, drop=True)
.reset_index())
print (df)
key prop1 prop2 keep_me
0 A x m stuff
1 B u n stuff
2 B u b stuff
3 C y b stuff
4 A y n stuff
5 A z b stuff