Skip to content
Advertisement

How to vectorize pandas operation

I have a dataset of house sales with timestamped Periods(per quarter). I want to adjust the price according to the house pricing index change per region. I have a separate dataframe with 3 columns, the Quarter, the Region and the % change in price. I am currently achieving this by iterating over both dataframes. Is there a better way?

Minimal Example;

import pandas as pd
houses_df = pd.DataFrame({'HousePrice' : [100000, 250000, 125000, 320000],
                          'Period' : ['2020Q1', '2020Q2', '2020Q1', '2020Q3'],
                          'Region' : ['NY-West', 'NY-East', 'NY-West', 'NY-East']})

HPindex_df = pd.DataFrame({'Periods' : ['2020Q1', '2020Q2', '2020Q3'] * 2,
                           'Regions' : ['NY-West', 'NY-West', 'NY-West', 'NY-East', 'NY-East', 'NY-East'],
                          'PriceIndex' : [1.1, 1.13, 0.87, 1.35, 1.21, 1.11]})
for index, row in houses_df.iterrows():
    for row1 in HPindex_df.itertuples():
        a = row1
        if row['Region'] == row1.Regions and row['Period'] == row1.Periods:
            houses_df.loc[index, 'HousePrice'] = houses_df.loc[index, 'HousePrice'] * row1.PriceIndex

print(houses_df)
   HousePrice  Period   Region
0    110000.0  2020Q1  NY-West
1    302500.0  2020Q2  NY-East
2    137500.0  2020Q1  NY-West
3    355200.0  2020Q3  NY-East

print(HPindex_df)
  Periods  Regions  PriceIndex
0  2020Q1  NY-West        1.10
1  2020Q2  NY-West        1.13
2  2020Q3  NY-West        0.87
3  2020Q1  NY-East        1.35
4  2020Q2  NY-East        1.21
5  2020Q3  NY-East        1.11

I don’t think this is the best way to do this, since this is a naive implementation. Is there a (vectorized/pandas built-in functions)-way to achieve this?

Advertisement

Answer

Use DataFrame.merge with left_on and right_on, then get all 4 column in output:

df = houses_df.merge(HPindex_df, 
                     left_on=['Period','Region'], 
                     right_on=['Periods','Regions'], 
                     how='left')
df['HousePrice'] = df['HousePrice'] * df['PriceIndex']
print (df)

   HousePrice  Period   Region Periods  Regions  PriceIndex
0    110000.0  2020Q1  NY-West  2020Q1  NY-West        1.10
1    302500.0  2020Q2  NY-East  2020Q2  NY-East        1.21
2    137500.0  2020Q1  NY-West  2020Q1  NY-West        1.10
3    355200.0  2020Q3  NY-East  2020Q3  NY-East        1.11

For avoid it is possible use rename:

d = {'Periods':'Period','Regions':'Region'}
df = houses_df.merge(HPindex_df.rename(columns=d), on=['Period','Region'], how='left')
df['HousePrice'] = df['HousePrice'] * df['PriceIndex']
print (df)

   HousePrice  Period   Region  PriceIndex
0    110000.0  2020Q1  NY-West        1.10
1    302500.0  2020Q2  NY-East        1.21
2    137500.0  2020Q1  NY-West        1.10
3    355200.0  2020Q3  NY-East        1.11

Or DataFrame.join with DataFrame.set_index:

df = houses_df.join(HPindex_df.set_index(['Periods','Regions']), on=['Period','Region'])
df['HousePrice'] = df['HousePrice'] * df['PriceIndex']
print (df)
   HousePrice  Period   Region  PriceIndex
0    110000.0  2020Q1  NY-West        1.10
1    302500.0  2020Q2  NY-East        1.21
2    137500.0  2020Q1  NY-West        1.10
3    355200.0  2020Q3  NY-East        1.11
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement