I have a dataset of house sales with timestamped Periods(per quarter). I want to adjust the price according to the house pricing index change per region. I have a separate dataframe with 3 columns, the Quarter, the Region and the % change in price. I am currently achieving this by iterating over both dataframes. Is there a better way?
Minimal Example;
import pandas as pd houses_df = pd.DataFrame({'HousePrice' : [100000, 250000, 125000, 320000], 'Period' : ['2020Q1', '2020Q2', '2020Q1', '2020Q3'], 'Region' : ['NY-West', 'NY-East', 'NY-West', 'NY-East']}) HPindex_df = pd.DataFrame({'Periods' : ['2020Q1', '2020Q2', '2020Q3'] * 2, 'Regions' : ['NY-West', 'NY-West', 'NY-West', 'NY-East', 'NY-East', 'NY-East'], 'PriceIndex' : [1.1, 1.13, 0.87, 1.35, 1.21, 1.11]}) for index, row in houses_df.iterrows(): for row1 in HPindex_df.itertuples(): a = row1 if row['Region'] == row1.Regions and row['Period'] == row1.Periods: houses_df.loc[index, 'HousePrice'] = houses_df.loc[index, 'HousePrice'] * row1.PriceIndex print(houses_df) HousePrice Period Region 0 110000.0 2020Q1 NY-West 1 302500.0 2020Q2 NY-East 2 137500.0 2020Q1 NY-West 3 355200.0 2020Q3 NY-East print(HPindex_df) Periods Regions PriceIndex 0 2020Q1 NY-West 1.10 1 2020Q2 NY-West 1.13 2 2020Q3 NY-West 0.87 3 2020Q1 NY-East 1.35 4 2020Q2 NY-East 1.21 5 2020Q3 NY-East 1.11
I don’t think this is the best way to do this, since this is a naive implementation. Is there a (vectorized/pandas built-in functions)-way to achieve this?
Advertisement
Answer
Use DataFrame.merge
with left_on
and right_on
, then get all 4 column in output:
df = houses_df.merge(HPindex_df, left_on=['Period','Region'], right_on=['Periods','Regions'], how='left') df['HousePrice'] = df['HousePrice'] * df['PriceIndex'] print (df) HousePrice Period Region Periods Regions PriceIndex 0 110000.0 2020Q1 NY-West 2020Q1 NY-West 1.10 1 302500.0 2020Q2 NY-East 2020Q2 NY-East 1.21 2 137500.0 2020Q1 NY-West 2020Q1 NY-West 1.10 3 355200.0 2020Q3 NY-East 2020Q3 NY-East 1.11
For avoid it is possible use rename
:
d = {'Periods':'Period','Regions':'Region'} df = houses_df.merge(HPindex_df.rename(columns=d), on=['Period','Region'], how='left') df['HousePrice'] = df['HousePrice'] * df['PriceIndex'] print (df) HousePrice Period Region PriceIndex 0 110000.0 2020Q1 NY-West 1.10 1 302500.0 2020Q2 NY-East 1.21 2 137500.0 2020Q1 NY-West 1.10 3 355200.0 2020Q3 NY-East 1.11
Or DataFrame.join
with DataFrame.set_index
:
df = houses_df.join(HPindex_df.set_index(['Periods','Regions']), on=['Period','Region']) df['HousePrice'] = df['HousePrice'] * df['PriceIndex'] print (df) HousePrice Period Region PriceIndex 0 110000.0 2020Q1 NY-West 1.10 1 302500.0 2020Q2 NY-East 1.21 2 137500.0 2020Q1 NY-West 1.10 3 355200.0 2020Q3 NY-East 1.11