Finding regex patterns regardless of spaces

Question

There are strings (which are rows of a pandas data frame): 2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,05 4.7197189.46 09/2020 172,0 5 4.7861901.25 9/2020 8 9,16 2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0 4.7861901.25 10/2020 94 ,32 What I need is to extract the following information from these lines (comma is decimal separator here): order_id date

Accepted Answer

Input df    vals0   2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 51   4.7861901.25 9/2020 8 9,162   2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 03   4.7861901.25 10/2020 94 ,32Now, as multiple rows in the expected df is combined in a single row in original df, it is better to first convert the whole vals column to a single stringstr1 = "n".join(df['vals'].values)str12.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 54.7861901.25 9/2020 8 9,162.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 04.7861901.25 10/2020 94 ,32Now using findall get all the final records. All the three required columns are in separate capture groups. order_id is ([d.]+). As it has no space, it is straight forward. date is (ds?d?s?/s?(?:ds?){3}d) where space can be anywhere in the date. sum is [ds]+,s?ds?d) which has two digits after the comma.req_vals = re.findall(r'([d.]+)s*(ds?d?s?/s?(?:ds?){3}d)s*([ds]+,s?ds?d)',str1)req_vals[('2.5807003.49', '9/2020', '24,54'), ('4.7103181.69', '9 /2020', '172,0 5'), ('4.7197189.46', '09/2020', '172,0 5'), ('4.7861901.25', '9/2020', '8 9,16'), ('2.5807003.49', '10/2020', '35,65'), ('4.7103181.69', '10/2020', '185,50'), ('4.7197189.46', '1 0/2020', '185,5 0'), ('4.7861901.25', '10/2020', '94 ,32')]Lastly, In Output dataframe, space can be removed.final_df = (pd.DataFrame(req_vals, columns=['order_id', 'date', 'sum'])            .replace(r's', '', regex=True))final_df      order_id      date    sum0   2.5807003.49    9/2020  24,541   4.7103181.69    9/2020  172,052   4.7197189.46    09/2020 172,053   4.7861901.25    9/2020  89,164   2.5807003.49    10/2020 35,655   4.7103181.69    10/2020 185,506   4.7197189.46    10/2020 185,507   4.7861901.25    10/2020 94,32

Advertisement

Answer