Finding regex patterns regardless of spaces

Tags: , ,



There are strings (which are rows of a pandas data frame):

2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,05 4.7197189.46 09/2020 172,0 5 4.7861901.25 9/2020 8 9,16
2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0 4.7861901.25 10/2020 94 ,32

What I need is to extract the following information from these lines (comma is decimal separator here):

order_id date sum
2.5807003.49 09/2020 24,54
4.7103181.69 09/2020 172,05
4.7197189.46 09/2020 172,05
4.7861901.25 09/2020 89,16
2.5807003.49 10/2020 35,65
4.7103181.69 10/2020 185,50
4.7197189.46 10/2020 185,50
4.7861901.25 10/2020 94,32

  • There are only 4 distinct order_id and they always have the same structure;
  • month in date can have one or two digits;
  • date itself may (or may not) contain spaces! E.g. 9 /2020 or 9/2020, 10/2020 or 1 0/2020;
  • sum always has two digits after , but, unfortunately, may (or not) also contain spaces: 24,54, 172,0 5, 185,5 0, 94 ,32.

This is how I thought about it: First step would be to split orders. Obviously, the information related to a given order_id is anchored by , — there are 4 orders and 4 commas in each line. So, find comma, then go forward two digits (regardless of spaces) after it, and finally look back till the beginning of line or the end of previous order. But even here I’m stuck since I don’t understand how to catch two digits after comma regardless of spaces (if any).

Answer

Input df

    vals
0   2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 5
1   4.7861901.25 9/2020 8 9,16
2   2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0
3   4.7861901.25 10/2020 94 ,32

Now, as multiple rows in the expected df is combined in a single row in original df, it is better to first convert the whole vals column to a single string

str1 = "n".join(df['vals'].values)
str1

2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 5
4.7861901.25 9/2020 8 9,16
2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0
4.7861901.25 10/2020 94 ,32

Now using findall get all the final records. All the three required columns are in separate capture groups. order_id is ([d.]+). As it has no space, it is straight forward. date is (ds?d?s?/s?(?:ds?){3}d) where space can be anywhere in the date. sum is [ds]+,s?ds?d) which has two digits after the comma.

req_vals = re.findall(r'([d.]+)s*(ds?d?s?/s?(?:ds?){3}d)s*([ds]+,s?ds?d)',str1)
req_vals
[('2.5807003.49', '9/2020', '24,54'),
 ('4.7103181.69', '9 /2020', '172,0 5'),
 ('4.7197189.46', '09/2020', '172,0 5'),
 ('4.7861901.25', '9/2020', '8 9,16'),
 ('2.5807003.49', '10/2020', '35,65'),
 ('4.7103181.69', '10/2020', '185,50'),
 ('4.7197189.46', '1 0/2020', '185,5 0'),
 ('4.7861901.25', '10/2020', '94 ,32')]

Lastly, In Output dataframe, space can be removed.

final_df = (pd.DataFrame(req_vals, columns=['order_id', 'date', 'sum'])
            .replace(r's', '', regex=True))
final_df

      order_id      date    sum
0   2.5807003.49    9/2020  24,54
1   4.7103181.69    9/2020  172,05
2   4.7197189.46    09/2020 172,05
3   4.7861901.25    9/2020  89,16
4   2.5807003.49    10/2020 35,65
5   4.7103181.69    10/2020 185,50
6   4.7197189.46    10/2020 185,50
7   4.7861901.25    10/2020 94,32


Source: stackoverflow