Well, I’m cleaning a dataset, using Pandas. I have a column called “Country”, where different rows could have numbers or other information into parenthesis and I have to remove them, for example: Australia1, Perú (country), 3Costa Rica, etc. To do this, I’m getting the column and I make a mapping over it.
JavaScript
x
3
1
pattern = "([a-zA-Z]+[s]*[a-aZ-Z]+)(?:[(]*.*[)]*)"
2
df['Country'] = df['Country'].str.extract(pattern)
3
But I have a problem with this regex, I cannot match names as “United States of America”, because it only takes “United “. How can I repeat unlimited the pattern of the fisrt group to match the whole name?
Thanks!
Advertisement
Answer
In this situation, I will clean the data step by step.
JavaScript
1
16
16
1
df_str = '''
2
Country
3
Australia1
4
Perú (country)
5
3Costa Rica
6
United States of America
7
'''
8
df = pd.read_csv(io.StringIO(df_str.strip()), sep='n')
9
10
# handle the data
11
(df['Country']
12
.str.replace('d+', '', regex=True) # remove number
13
.str.split('(').str[0] # get items before `(`
14
.str.strip() # strip spaces
15
)
16