I have a pandas df of addresses like this:
df['address'] 0. ALL that certain piece, parcel or tract of land situate, lying and being in the City of Travelers Rest, County of Greenville, State of South Carolina 1. Townes Street on the West, in the City of Greenville, County of Greenville, State of South Carolina 2. State of South Carolina, County of Greenville, City of Hampton on the southern side
I want to extract the name of city
such that expected results:
Travelers Rest Greenville Hampton
My code is below:
df['city'] = df['address'].str.extract(r'b(?:City of?) (.+?(?=[,]))')
My results:
Travelers Rest Greenville City of Hampton on the...
However, when the city name doesn’t end with a ,
it will pick up the rest of the string. If i don’t end my regex in ,
I won’t get the full city name in some cases. How can I resolve this?
Advertisement
Answer
One option for the example data could be matching the following words starting with a capital A-Z and optional non whitespace chars excluding a comma:
bCitys+ofs+([A-Z][^s,]+(?:s+[A-Z][^s,]+)*)
data = [ "ALL that certain piece, parcel or tract of land situate, lying and being in the City of Travelers Rest, County of Greenville, State of South Carolina", "Townes Street on the West, in the City of Greenville, County of Greenville, State of South Carolina", "State of South Carolina, County of Greenville, City of Hampton on the southern side" ] df = pd.DataFrame(data, columns=["address"]) df["city"] = df["address"].str.extract(r"bCitys+ofs+([A-Z][^s,]+(?:s+[A-Z][^s,]+)*)") print(df)
Output
address city 0 ALL that certain piece, parcel or tract of lan... Travelers Rest 1 Townes Street on the West, in the City of Gree... Greenville 2 State of South Carolina, County of Greenville,... Hampton