I have a complex regex pattern to match mixed dates for a csv column in pandas df. I would like to replace everything except the regex pattern match with “” . I have tried pretty much all the negation cases (^ ?! and others). But I keep replacing the regex match with “” (empty string). My Code:
import pandas as pd df.read_csv('path') df=DataFrame(df) df.columns=['Date'] Date=df.Date df['Date']=df['Date'].str.replace(r'^((b(0?[1-9]|[12]d|30|31)[^wdrn:](0?[1-9]|1[0-2])[^wdrn:](d{4}|d{2})b)|(b(0?[1-9]|1[0-2])[^wdrn:](0?[1-9]|[12]d|30|31)[^wdrn:](d{4}|d{2})b))','')
Some examples of my data:
Date 21/04/2004 [N/F] 6/07/2004 {} [N/F] 6/10/2004 16/06/2004 {} 21/06/2004 [N/F] 1/03/2018 23/03/17 {} {} 4/04/2006 19/05/2006 "**3/04/2006/-2/06 2006**"
Expected Output
21/04/2004 6/07/2004 6/10/2004 16/06/2004 21/06/2004 1/03/2018 23/03/17 4/04/2006 19/05/2006 3/04/2006
I would appreciate your help. Many thanks.
Advertisement
Answer
I have simplified your regex a little and am extracting rather than replacing:
Loading your data to a DataFrame:
import pandas as pd import numpy as np df = pd.read_csv('data.csv') print(df)
Gives:
Date 0 21/04/2004 1 [N/F] 2 6/07/2004 3 {} 4 [N/F] 5 6/10/2004 6 16/06/2004 7 {} 8 21/06/2004 9 [N/F] 10 1/03/2018 11 23/03/17 12 {} 13 {} 14 4/04/2006 15 19/05/2006 16 **3/04/2006/-2/06n2006**
Now extract anything that can be parsed as a date:
pattern = r'(([1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(20[01][0-9]|[0-9]{2}))' df['extracted_date'] = df['Date'].astype(str).str.extract(pattern)[0] df = df.fillna('') print(df)
Which returns:
Date extracted_date 0 21/04/2004 21/04/2004 1 [N/F] 2 6/07/2004 6/07/2004 3 {} 4 [N/F] 5 6/10/2004 6/10/2004 6 16/06/2004 16/06/2004 7 {} 8 21/06/2004 21/06/2004 9 [N/F] 10 1/03/2018 1/03/2018 11 23/03/17 23/03/17 12 {} 13 {} 14 4/04/2006 4/04/2006 15 19/05/2006 19/05/2006 16 **3/04/2006/-2/06n2006** 3/04/2006