I have a large dataframe (DF1) that contains a variable containing UK postcode data. Inevitably there are some typos in the data. However, after some work with regular expressions, I have created a second database that contains corrected versions of the postcode data (but only for those rows where the original postcode was incorrect) – DF2. (N.B. the index values are not necessarily consecutive.)
id postcode remark 0 1 L93AP Normal 2 2 LD38AH Normal 4 3 SO224ER Normal 6 4 SO21 Too short 8 5 DN379HJ Normal 10 6 M21ORH Zero replaced with O (oh) 12 7 NP745G S replaced with 5 14 8 SE136R2 Z replaced with 2 16 9 BN251ESBN Too long 18 10 TD152EH Normal
The dataframe containing the corrected data is:
0 1 2 3 pcCorrected 10 M21 0 R H M210RH 12 NP7 4 S G NP74SG 14 SE13 6 R Z SE136RZ
I want to combine the 2 databases such that the new values in the pcCorrected column in DF2 replace the old postcode values in the DF1 dataframe but, for other cells, the existing postcode values remain in tact. The final database should look like:
id postcode remark 0 1 L93AP Normal 2 2 LD38AH Normal 4 3 SO224ER Normal 6 4 SO21 Too short 8 5 DN379HJ Normal 10 6 M210RH Normal 12 7 NP74SG Normal 14 8 SE136RZ Normal 16 9 BN251ESBN Too long 18 10 TD152EH Normal
The databases are quite large (>1 million rows). Does this action have a name and what is the most efficient way to do this?
Advertisement
Answer
You can try merge by indexes , create mask by notnull and add new values by loc:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='left')
mask = pd.notnull(df['pcCorrected'])
print mask
0     False
2     False
4     False
6     False
8     False
10     True
12     True
14     True
16    False
18    False
Name: pcCorrected, dtype: bool
df.loc[mask, 'remark'] = 'Normal'
df.loc[mask, 'postcode'] = df['pcCorrected']
print df[['id','postcode','remark']]
    id   postcode     remark
0    1      L93AP     Normal
2    2     LD38AH     Normal
4    3    SO224ER     Normal
6    4       SO21  Too short
8    5    DN379HJ     Normal
10   6     M210RH     Normal
12   7     NP74SG     Normal
14   8    SE136RZ     Normal
16   9  BN251ESBN   Too long
18  10    TD152EH     Normal