I have a complex regex pattern to match mixed dates for a csv column in pandas df. I would like to replace everything except the regex pattern match with “” . I have tried pretty much all the negation cases (^ ?! and others). But I keep replacing the regex match with “” (empty string). My Code:
JavaScript
x
7
1
import pandas as pd
2
df.read_csv('path')
3
df=DataFrame(df)
4
df.columns=['Date']
5
Date=df.Date
6
df['Date']=df['Date'].str.replace(r'^((b(0?[1-9]|[12]d|30|31)[^wdrn:](0?[1-9]|1[0-2])[^wdrn:](d{4}|d{2})b)|(b(0?[1-9]|1[0-2])[^wdrn:](0?[1-9]|[12]d|30|31)[^wdrn:](d{4}|d{2})b))','')
7
Some examples of my data:
JavaScript
1
20
20
1
Date
2
21/04/2004
3
[N/F]
4
6/07/2004
5
{}
6
[N/F]
7
6/10/2004
8
16/06/2004
9
{}
10
21/06/2004
11
[N/F]
12
1/03/2018
13
23/03/17
14
{}
15
{}
16
4/04/2006
17
19/05/2006
18
"**3/04/2006/-2/06
19
2006**"
20
Expected Output
JavaScript
1
18
18
1
21/04/2004
2
3
6/07/2004
4
5
6
6/10/2004
7
16/06/2004
8
9
21/06/2004
10
11
1/03/2018
12
23/03/17
13
14
15
4/04/2006
16
19/05/2006
17
3/04/2006
18
I would appreciate your help. Many thanks.
Advertisement
Answer
I have simplified your regex a little and am extracting rather than replacing:
Loading your data to a DataFrame:
JavaScript
1
6
1
import pandas as pd
2
import numpy as np
3
4
df = pd.read_csv('data.csv')
5
print(df)
6
Gives:
JavaScript
1
19
19
1
Date
2
0 21/04/2004
3
1 [N/F]
4
2 6/07/2004
5
3 {}
6
4 [N/F]
7
5 6/10/2004
8
6 16/06/2004
9
7 {}
10
8 21/06/2004
11
9 [N/F]
12
10 1/03/2018
13
11 23/03/17
14
12 {}
15
13 {}
16
14 4/04/2006
17
15 19/05/2006
18
16 **3/04/2006/-2/06n2006**
19
Now extract anything that can be parsed as a date:
JavaScript
1
5
1
pattern = r'(([1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(20[01][0-9]|[0-9]{2}))'
2
df['extracted_date'] = df['Date'].astype(str).str.extract(pattern)[0]
3
df = df.fillna('')
4
print(df)
5
Which returns:
JavaScript
1
19
19
1
Date extracted_date
2
0 21/04/2004 21/04/2004
3
1 [N/F]
4
2 6/07/2004 6/07/2004
5
3 {}
6
4 [N/F]
7
5 6/10/2004 6/10/2004
8
6 16/06/2004 16/06/2004
9
7 {}
10
8 21/06/2004 21/06/2004
11
9 [N/F]
12
10 1/03/2018 1/03/2018
13
11 23/03/17 23/03/17
14
12 {}
15
13 {}
16
14 4/04/2006 4/04/2006
17
15 19/05/2006 19/05/2006
18
16 **3/04/2006/-2/06n2006** 3/04/2006
19