I have 2 dataframes and I wish to grab IDs matching with DF2 into df1 merged as separate columns. There are multiple columns to be added due to df2 having many different country names for a specific ID.
df1 looks like below:
ID URL A example.com/1 B example.com/2 C example.com/3 D example.com/4
df2 is like this:
ID country URL A usa example.com/usa-A B uk example.com/uk-B C canada example.com/canada-C A uk example.com/uk-A C usa example.com/usa-C
What I am expecting df1 to look like:
ID URL uk USA Canada A example.com/1 example.com/uk-A example.com/usa-A NaN B example.com/2 example.com/uk-B NaN NaN C example.com/3 NaN example.com/usa-C example.com/canada-C D example.com/4 NaN NaN NaN
I wish to bring if DF1 ID A is found in DF2 ID against a country then bring the country URL up next to df1 ID in a specific country column.
The way I am trying to achieve this is using a for loop with a map call below:
final = pd.DataFrame()
for a in countries_list:
     b = df2.loc[(df2["country"] == a)]
     df1["country"] = df1['id'].map(df2.set_index('id')['url'])
     final = pd.concat([final, df1])
It runs for a certain amount of countries and then start throwing InvalidIndexError: Reindexing only valid with uniquely valued Index objects which I tried to overcome using a reset_index() function on both df1 and df2 but still after a certain amount of iterations, it throws me the same error.
Can someone suggest a more efficient way to do this or any way i could run it over all possible iterations?
Thanks,
Advertisement
Answer
Try as follows:
res = df.merge(df2.pivot(index='ID',columns='country',values='URL'), 
               left_on='ID', right_index=True, how='left')
print(res)
  ID            URL                canada                uk                usa
0  A  example.com/1                   NaN  example.com/uk-A  example.com/usa-A
1  B  example.com/2                   NaN  example.com/uk-B                NaN
2  C  example.com/3  example.com/canada-C               NaN  example.com/usa-C
3  D  example.com/4                   NaN               NaN                NaN
Explanation
- First, use df.pivotondf2. We get:
print(df2.pivot(index='ID',columns='country',values='URL')) country canada uk usa ID A NaN example.com/uk-A example.com/usa-A B NaN example.com/uk-B NaN C example.com/canada-C NaN example.com/usa-C
- Next, use df.mergeto mergedfand pivoteddf2, joining the two onID, and passingleftto thehowparameter to “use only keys from left frame”. (Leavehow='left'out, if you are not interested in the row forIDwith onlyNaNvalues.)
If you’re set on a particular column order, use e.g.:
res = res.loc[:,['ID','URL','uk','usa','canada']]