Skip to content
Advertisement

Pandas str.extract() regex to extract city info

I have a pandas df of addresses like this:

df['address']

0. ALL that certain piece, parcel or tract of land situate, lying and being in the City 
   of Travelers Rest, County of Greenville, State of South Carolina

1. Townes Street on the West, in the City of Greenville, County of Greenville, State of 
   South Carolina

2. State of South Carolina, County of Greenville, City of Hampton on the southern side

I want to extract the name of city such that expected results:

Travelers Rest
Greenville
Hampton

My code is below:

df['city'] = df['address'].str.extract(r'b(?:City of?) (.+?(?=[,]))')

My results:

Travelers Rest
Greenville
City of Hampton on the...

However, when the city name doesn’t end with a , it will pick up the rest of the string. If i don’t end my regex in , I won’t get the full city name in some cases. How can I resolve this?

Advertisement

Answer

One option for the example data could be matching the following words starting with a capital A-Z and optional non whitespace chars excluding a comma:

bCitys+ofs+([A-Z][^s,]+(?:s+[A-Z][^s,]+)*)

Regex demo

data = [
    "ALL that certain piece, parcel or tract of land situate, lying and being in the City   of Travelers Rest, County of Greenville, State of South Carolina",
    "Townes Street on the West, in the City of Greenville, County of Greenville, State of South Carolina",
    "State of South Carolina, County of Greenville, City of Hampton on the southern side"
]

df = pd.DataFrame(data, columns=["address"])
df["city"] = df["address"].str.extract(r"bCitys+ofs+([A-Z][^s,]+(?:s+[A-Z][^s,]+)*)")
print(df)

Output

                                             address            city
0  ALL that certain piece, parcel or tract of lan...  Travelers Rest
1  Townes Street on the West, in the City of Gree...      Greenville
2  State of South Carolina, County of Greenville,...         Hampton
Advertisement