I have a pandas df of addresses like this:
JavaScript
x
10
10
1
df['address']
2
3
0. ALL that certain piece, parcel or tract of land situate, lying and being in the City
4
of Travelers Rest, County of Greenville, State of South Carolina
5
6
1. Townes Street on the West, in the City of Greenville, County of Greenville, State of
7
South Carolina
8
9
2. State of South Carolina, County of Greenville, City of Hampton on the southern side
10
I want to extract the name of city
such that expected results:
JavaScript
1
4
1
Travelers Rest
2
Greenville
3
Hampton
4
My code is below:
JavaScript
1
2
1
df['city'] = df['address'].str.extract(r'b(?:City of?) (.+?(?=[,]))')
2
My results:
JavaScript
1
4
1
Travelers Rest
2
Greenville
3
City of Hampton on the
4
However, when the city name doesn’t end with a ,
it will pick up the rest of the string. If i don’t end my regex in ,
I won’t get the full city name in some cases. How can I resolve this?
Advertisement
Answer
One option for the example data could be matching the following words starting with a capital A-Z and optional non whitespace chars excluding a comma:
JavaScript
1
2
1
bCitys+ofs+([A-Z][^s,]+(?:s+[A-Z][^s,]+)*)
2
JavaScript
1
10
10
1
data = [
2
"ALL that certain piece, parcel or tract of land situate, lying and being in the City of Travelers Rest, County of Greenville, State of South Carolina",
3
"Townes Street on the West, in the City of Greenville, County of Greenville, State of South Carolina",
4
"State of South Carolina, County of Greenville, City of Hampton on the southern side"
5
]
6
7
df = pd.DataFrame(data, columns=["address"])
8
df["city"] = df["address"].str.extract(r"bCitys+ofs+([A-Z][^s,]+(?:s+[A-Z][^s,]+)*)")
9
print(df)
10
Output
JavaScript
1
5
1
address city
2
0 ALL that certain piece, parcel or tract of lan Travelers Rest
3
1 Townes Street on the West, in the City of Gree Greenville
4
2 State of South Carolina, County of Greenville, Hampton
5