I have two dataframes like this
import pandas as pd import numpy as np df1 = pd.DataFrame({ 'key': list('AAABBCCAAC'), 'prop1': list('xyzuuyxzzz'), 'prop2': list('mnbnbbnnnn') }) df2 = pd.DataFrame({ 'key': list('ABBCAA'), 'prop1': [np.nan] * 6, 'prop2': [np.nan] * 6, 'keep_me': ['stuff'] * 6 }) key prop1 prop2 0 A x m 1 A y n 2 A z b 3 B u n 4 B u b 5 C y b 6 C x n 7 A z n 8 A z n 9 C z n key prop1 prop2 keep_me 0 A NaN NaN stuff 1 B NaN NaN stuff 2 B NaN NaN stuff 3 C NaN NaN stuff 4 A NaN NaN stuff 5 A NaN NaN stuff
I now want to populate columns prop1
and prop2
in df2
using the values of df1
. For each key, we will have more or equal rows in df1
than in df2
(in the example above: 5 times A
vs 3 times A
, 2 times B
vs 2 times B
and 3 times C
vs 1 time C
). For each key, I want to fill df2
using the first n
rows per key from df1
.
So, my expected outcome for df2
would be:
key prop1 prop2 keep_me 0 A x m stuff 1 B u n stuff 2 B u b stuff 3 C y b stuff 4 A y n stuff 5 A z b stuff
As key
is not unique, I cannot simple build a dictionary and then use .map
.
I was hoping that something along these lines would work:
pd.concat([df2.set_index('key'), df1.set_index('key')], axis=1, join='inner')
but that fails with
ValueError: Shape of passed values is (5, 22), indices imply (5, 10)
as – I guess – the index contains non-unique values.
How can I get my desired output?
Advertisement
Answer
Because duplicates in key
values possible solution is create new counter columns in both DataFrame
s by GroupBy.cumcount
, so possible replace missing values from df2
with align by MultiIndex
created by key
and g
columns with DataFrame.fillna
:
df1['g'] = df1.groupby('key').cumcount() df2['g'] = df2.groupby('key').cumcount() print (df1) key prop1 prop2 g 0 A x m 0 1 A y n 1 2 A z b 2 3 B u n 0 4 B u b 1 5 C y b 0 6 C x n 1 7 A z n 3 8 A z n 4 9 C z n 2 print (df2) key prop1 prop2 keep_me g 0 A NaN NaN stuff 0 1 B NaN NaN stuff 0 2 B NaN NaN stuff 1 3 C NaN NaN stuff 0 4 A NaN NaN stuff 1 5 A NaN NaN stuff 2
df = (df2.set_index(['key','g']) .fillna(df1.set_index(['key','g'])) .reset_index(level=1, drop=True) .reset_index()) print (df) key prop1 prop2 keep_me 0 A x m stuff 1 B u n stuff 2 B u b stuff 3 C y b stuff 4 A y n stuff 5 A z b stuff