Update pandas cell in one Dataframe from looked up value in second Dataframe
I have a case where I need to update a cell in one Dataframe, ‘Stock‘, which holds records of stock on-hand, looking up its value in a second Dataframe, ‘Items‘, which is the table of all items. This is a simplified example of the Dataframes with relevant fields.
Stock
item_no qty 0 9H.111 101 1 9H.222 230 2 MODEL_B 136 3 9H.444 344 4 MODEL_E 505 5 9H.666 332
Items
item_no model_no 0 9H.111 MODEL_A 1 9H.222 MODEL_B 2 9H.333 MODEL_B 3 9H.444 MODEL_C 4 9H.555 MODEL_D 5 9H.666 MODEL_E 6 9H.777 MODEL_D 7 9H.888 MODEL_F
The challenge
I have previously done this in PostgreSQL but would like to see if I can do all the processing in Pandas (I plan to link to the PostgreSQL table of items). If we look at the Stock table the item_no
column should only have item numbers, see Items Dataframe (table), but sometimes the users put in the model_no
instead of the item number. So in the Stock
dataframe, row 2 incorrectly has the value MODEL_B
.
What’s needed
What is needed to be done is to:
- get the value
MODEL_B
from theitem_no
column in the Stock dataframe - find that in the
model_no
column of the Items dataframe - then get the value from the
item_no
field of the Items dataframe - use that value to replace the (incorrect) model number value in the
item_no
column of the Stock dataframe
It gets a little more challenging… a model may have more than one part number:
1 9H.222 MODEL_B 2 9H.333 MODEL_B
In this case the ‘highest’ part number, in this case 9H.333
, needs to be used. In SQL I use the MAX() operator.
I would like to perform this using ‘set’ operations in pandas (not looping), similar to running a query in SQL. So this would mean (?) joining the two dataframes on the fields stock.item_no
<-> items.model_no
(?) – I’m not sure how to go about it hence the question marks.
Generate Dataframes
This code will generate the dataframes discussed above.
stock = pd.DataFrame({ 'item_no': ['9H.111', '9H.222', 'MODEL_B', '9H.444', 'MODEL_E', '9H.666'], 'qty': [101, 230, 136, 344, 505, 332], }) items = pd.DataFrame({ 'item_no': ['9H.111', '9H.222', '9H.333', '9H.444', '9H.555', '9H.666', '9H.777', '9H.888'], 'model_no': ['MODEL_A', 'MODEL_B', 'MODEL_B', 'MODEL_C', 'MODEL_D', 'MODEL_E', 'MODEL_D', 'MODEL_F'] }) display(stock) display(items)
Advertisement
Answer
You can use:
# keep max item_no per model_no # convert to mapping Series mapper = (items.sort_values(by='item_no') .drop_duplicates(subset='model_no', keep='last') .set_index('model_no')['item_no'] ) # identify rows with model_no m = stock['item_no'].isin(mapper.index) # replace the values in place stock.loc[m, 'item_no'] = stock.loc[m, 'item_no'].map(mapper)
updated stock
:
item_no qty 0 9H.111 101 1 9H.222 230 2 9H.333 136 3 9H.444 344 4 9H.666 505 5 9H.666 332