I have dataframe which contains id
column with the following sample values
JavaScript
x
14
14
1
16620625 5686
2
3
16310427-5502
4
5
16501010 4957
6
7
16110430 8679
8
9
16990624/4174
10
11
16230404.1177
12
13
16820221/3388
14
I want to standardise to XXXXXXXX-XXXX (i.e. 8 and 4 digits separated by a dash), How can I achieve that using python.
here’s my code
JavaScript
1
3
1
df['id']
2
df.replace(" ", "-")
3
Advertisement
Answer
Can use DataFrame.replace() function using a regular expression like this:
JavaScript
1
2
1
df = df.replace(regex=r'^(d{8})D(d{4})$', value=r'1-2')
2
Here’s example code with sample data.
JavaScript
1
14
14
1
import pandas as pd
2
df = pd.DataFrame({'id': [
3
'16620625 5686',
4
'16310427-5502',
5
'16501010 4957',
6
'16110430 8679',
7
'16990624/4174',
8
'16230404.1177',
9
'16820221/3388']})
10
11
# normalize matching strings with 8-digits + delimiter + 4-digits
12
df = df.replace(regex=r'^(d{8})D(d{4})$', value=r'1-2')
13
print(df)
14
Output:
JavaScript
1
9
1
id
2
0 16620625-5686
3
1 16310427-5502
4
2 16501010-4957
5
3 16110430-8679
6
4 16990624-4174
7
5 16230404-1177
8
6 16820221-3388
9
If any value does not match the regexp of the expected format then it’s value will not be changed.