Update pandas cell in one Dataframe from looked up value in second Dataframe
I have a case where I need to update a cell in one Dataframe, ‘Stock‘, which holds records of stock on-hand, looking up its value in a second Dataframe, ‘Items‘, which is the table of all items. This is a simplified example of the Dataframes with relevant fields.
Stock
item_no qty 0 9H.111 101 1 9H.222 230 2 MODEL_B 136 3 9H.444 344 4 MODEL_E 505 5 9H.666 332
Items
item_no model_no 0 9H.111 MODEL_A 1 9H.222 MODEL_B 2 9H.333 MODEL_B 3 9H.444 MODEL_C 4 9H.555 MODEL_D 5 9H.666 MODEL_E 6 9H.777 MODEL_D 7 9H.888 MODEL_F
The challenge
I have previously done this in PostgreSQL but would like to see if I can do all the processing in Pandas (I plan to link to the PostgreSQL table of items). If we look at the Stock table the item_no column should only have item numbers, see Items Dataframe (table), but sometimes the users put in the model_no instead of the item number. So in the Stock dataframe, row 2 incorrectly has the value MODEL_B.
What’s needed
What is needed to be done is to:
- get the value
MODEL_Bfrom theitem_nocolumn in the Stock dataframe - find that in the
model_nocolumn of the Items dataframe - then get the value from the
item_nofield of the Items dataframe - use that value to replace the (incorrect) model number value in the
item_nocolumn of the Stock dataframe
It gets a little more challenging… a model may have more than one part number:
1 9H.222 MODEL_B 2 9H.333 MODEL_B
In this case the ‘highest’ part number, in this case 9H.333, needs to be used. In SQL I use the MAX() operator.
I would like to perform this using ‘set’ operations in pandas (not looping), similar to running a query in SQL. So this would mean (?) joining the two dataframes on the fields stock.item_no <-> items.model_no (?) – I’m not sure how to go about it hence the question marks.
Generate Dataframes
This code will generate the dataframes discussed above.
stock = pd.DataFrame({
'item_no': ['9H.111', '9H.222', 'MODEL_B', '9H.444', 'MODEL_E', '9H.666'],
'qty': [101, 230, 136, 344, 505, 332],
})
items = pd.DataFrame({
'item_no': ['9H.111', '9H.222', '9H.333', '9H.444', '9H.555', '9H.666', '9H.777', '9H.888'],
'model_no': ['MODEL_A', 'MODEL_B', 'MODEL_B', 'MODEL_C', 'MODEL_D', 'MODEL_E', 'MODEL_D', 'MODEL_F']
})
display(stock)
display(items)
Advertisement
Answer
You can use:
# keep max item_no per model_no
# convert to mapping Series
mapper = (items.sort_values(by='item_no')
.drop_duplicates(subset='model_no', keep='last')
.set_index('model_no')['item_no']
)
# identify rows with model_no
m = stock['item_no'].isin(mapper.index)
# replace the values in place
stock.loc[m, 'item_no'] = stock.loc[m, 'item_no'].map(mapper)
updated stock:
item_no qty 0 9H.111 101 1 9H.222 230 2 9H.333 136 3 9H.444 344 4 9H.666 505 5 9H.666 332