Skip to content

Pandas merge stop at first match like vlookup instead of duplicating

I have two tables, PO data and commodity code data. Some genius decided that some material group codes should be the same as they are differentiated at a lower level by GL accounts. Because of that, I can’t merge on material groups, as I’ll get duplicate rows.

Assume the following:

import pandas as pd

d1 = {'PO':[123456,654321,971358], 'matgrp': ["1001",'803A',"803B"]}
d2 = {'matgrp':["1001", "1001", "803A", "803B"], 'commodity':['foo - 10001', 'bar - 10002', 'spam - 100003','eggs - 10003']}

pos = pd.DataFrame(data=d1)
mat_grp = pd.DataFrame(data=d2)

merged = pd.merge(pos, mat_grp, how='left', on='matgrp')
      PO    matgrp  commodity
0   123456  1001    foo - 10001
1   123456  1001    bar - 10002
2   654321  803A    spam - 100003
3   971358  803B    eggs - 10003

As you can see, PO 123456 shows up twice, as there are multiple rows for material 1001 in the material groups table.

The desired behavior is that merge only merges once, finds the first entry for the material group, adds it, and nothing else, like how vlookup works. The long commodity code might be incorrect in some cases (always showing the first one), that’s an acceptable inaccuracy.

ps.: while suggestions are welcome how to tackle this problem outside of the scope of this question (like merging on GL accounts, which is not feasible for other reasons) assume the following: The available data is a PO list from SAP ME81N and an Excel file with the list of material groups/commodity codes.



pandas’ merge behaves (mostly) like a SQL merge and will provide all combinations of matching keys. If you only want the first item, simply remove it from the data you feed to merge.

Use drop_duplicates on mat_grp:

merged = pd.merge(pos, mat_grp.drop_duplicates('matgrp'), how='left', on='matgrp')


       PO matgrp      commodity
0  123456   1001    foo - 10001
1  654321   803A  spam - 100003
2  971358   803B   eggs - 10003