I have a dataframe with columns like this:
JavaScript
x
5
1
A B
2
0 - 5923FoxRd 5923 Fox Rd
3
1 631 Newhaven Ave Modesto
4
2 Saratoga Street, Suite 200 Saratoga Street, Suite 200
5
I want to create a list with values from A that matches values from B. The list should look like [- 5923FoxRd, Saratoga Street, Suite 200…]. What is the easiest way to do this?
Advertisement
Answer
To make a little go a long way, do the following:
- Create a new series for each column and pass the regex pattern
W+
tostr.replace()
- use
str.lower()
- create replace lists to normalize
drive
todr
,avenue
toave
, etc.
JavaScript
1
6
1
s1 = df['A'].str.replace('W+', '').str.lower()
2
s2 = df['B'].str.replace('W+', '').str.lower()
3
lst = [*df[s1==s2]['A']]
4
lst
5
Out[1]: ['- 5923FoxRd', 'Saratoga Street, Suite 200']
6
This is what s1
and s2
look like:
JavaScript
1
12
12
1
print(s1,s2)
2
3
0 5923foxrd
4
1 631newhavenave
5
2 saratogastreetsuite200
6
Name: A, dtype: object
7
8
0 5923foxrd
9
1 modesto
10
2 saratogastreetsuite200
11
Name: B, dtype: object
12
From there, you might want to create some replace values in order to normalize your data even further like:
JavaScript
1
20
20
1
to_replace = ['drive', 'avenue', 'street']
2
replaced = ['dr', 'ave', 'str']
3
4
to_replace = ['drive', 'avenue', 'street']
5
replaced = ['dr', 'ave', 'str']
6
s1 = df['A'].str.replace('W+', '').str.lower().replace(to_replace, replaced, regex=True)
7
s2 = df['B'].str.replace('W+', '').str.lower().replace(to_replace, replaced, regex=True)
8
lst = [*df[s1==s2]['A']]
9
lst
10
print(s1,s2)
11
0 5923foxrd
12
1 631newhavenave
13
2 saratogastrsuite200
14
Name: A, dtype: object
15
16
0 5923foxrd
17
1 modesto
18
2 saratogastrsuite200
19
Name: B, dtype: object
20