There are strings (which are rows of a pandas data frame):
2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,05 4.7197189.46 09/2020 172,0 5 4.7861901.25 9/2020 8 9,16
2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0 4.7861901.25 10/2020 94 ,32
What I need is to extract the following information from these lines (comma is decimal separator here):
order_id date sum
2.5807003.49 09/2020 24,54
4.7103181.69 09/2020 172,05
4.7197189.46 09/2020 172,05
4.7861901.25 09/2020 89,16
2.5807003.49 10/2020 35,65
4.7103181.69 10/2020 185,50
4.7197189.46 10/2020 185,50
4.7861901.25 10/2020 94,32
- There are only 4 distinct
order_idand they always have the same structure; - month in
datecan have one or two digits; - date itself may (or may not) contain spaces! E.g.
9 /2020or9/2020,10/2020or1 0/2020; sumalways has two digits after,but, unfortunately, may (or not) also contain spaces:24,54,172,0 5,185,5 0,94 ,32.
This is how I thought about it:
First step would be to split orders.
Obviously, the information related to a given order_id is anchored by , — there are 4 orders and 4 commas in each line.
So, find comma, then go forward two digits (regardless of spaces) after it, and finally look back till the beginning of line or the end of previous order.
But even here I’m stuck since I don’t understand how to catch two digits after comma regardless of spaces (if any).
Advertisement
Answer
Input df
vals 0 2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 5 1 4.7861901.25 9/2020 8 9,16 2 2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0 3 4.7861901.25 10/2020 94 ,32
Now, as multiple rows in the expected df is combined in a single row in original df, it is better to first convert the whole vals column to a single string
str1 = "n".join(df['vals'].values) str1 2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 5 4.7861901.25 9/2020 8 9,16 2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0 4.7861901.25 10/2020 94 ,32
Now using findall get all the final records. All the three required columns are in separate capture groups. order_id is ([d.]+). As it has no space, it is straight forward. date is (ds?d?s?/s?(?:ds?){3}d) where space can be anywhere in the date. sum is [ds]+,s?ds?d) which has two digits after the comma.
req_vals = re.findall(r'([d.]+)s*(ds?d?s?/s?(?:ds?){3}d)s*([ds]+,s?ds?d)',str1)
req_vals
[('2.5807003.49', '9/2020', '24,54'),
('4.7103181.69', '9 /2020', '172,0 5'),
('4.7197189.46', '09/2020', '172,0 5'),
('4.7861901.25', '9/2020', '8 9,16'),
('2.5807003.49', '10/2020', '35,65'),
('4.7103181.69', '10/2020', '185,50'),
('4.7197189.46', '1 0/2020', '185,5 0'),
('4.7861901.25', '10/2020', '94 ,32')]
Lastly, In Output dataframe, space can be removed.
final_df = (pd.DataFrame(req_vals, columns=['order_id', 'date', 'sum'])
.replace(r's', '', regex=True))
final_df
order_id date sum
0 2.5807003.49 9/2020 24,54
1 4.7103181.69 9/2020 172,05
2 4.7197189.46 09/2020 172,05
3 4.7861901.25 9/2020 89,16
4 2.5807003.49 10/2020 35,65
5 4.7103181.69 10/2020 185,50
6 4.7197189.46 10/2020 185,50
7 4.7861901.25 10/2020 94,32