There are strings (which are rows of a pandas
data frame):
2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,05 4.7197189.46 09/2020 172,0 5 4.7861901.25 9/2020 8 9,16
2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0 4.7861901.25 10/2020 94 ,32
What I need is to extract the following information from these lines (comma is decimal separator here):
order_id date sum
2.5807003.49 09/2020 24,54
4.7103181.69 09/2020 172,05
4.7197189.46 09/2020 172,05
4.7861901.25 09/2020 89,16
2.5807003.49 10/2020 35,65
4.7103181.69 10/2020 185,50
4.7197189.46 10/2020 185,50
4.7861901.25 10/2020 94,32
- There are only 4 distinct
order_id
and they always have the same structure; - month in
date
can have one or two digits; - date itself may (or may not) contain spaces! E.g.
9 /2020
or9/2020
,10/2020
or1 0/2020
; sum
always has two digits after,
but, unfortunately, may (or not) also contain spaces:24,54
,172,0 5
,185,5 0
,94 ,32
.
This is how I thought about it:
First step would be to split orders.
Obviously, the information related to a given order_id
is anchored by ,
— there are 4 orders and 4 commas in each line.
So, find comma, then go forward two digits (regardless of spaces) after it, and finally look back till the beginning of line or the end of previous order.
But even here I’m stuck since I don’t understand how to catch two digits after comma regardless of spaces (if any).
Advertisement
Answer
Input df
vals 0 2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 5 1 4.7861901.25 9/2020 8 9,16 2 2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0 3 4.7861901.25 10/2020 94 ,32
Now, as multiple rows in the expected df is combined in a single row in original df, it is better to first convert the whole vals
column to a single string
str1 = "n".join(df['vals'].values) str1 2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 5 4.7861901.25 9/2020 8 9,16 2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0 4.7861901.25 10/2020 94 ,32
Now using findall
get all the final records. All the three required columns are in separate capture groups. order_id
is ([d.]+)
. As it has no space, it is straight forward. date
is (ds?d?s?/s?(?:ds?){3}d)
where space can be anywhere in the date. sum
is [ds]+,s?ds?d)
which has two digits after the comma.
req_vals = re.findall(r'([d.]+)s*(ds?d?s?/s?(?:ds?){3}d)s*([ds]+,s?ds?d)',str1) req_vals [('2.5807003.49', '9/2020', '24,54'), ('4.7103181.69', '9 /2020', '172,0 5'), ('4.7197189.46', '09/2020', '172,0 5'), ('4.7861901.25', '9/2020', '8 9,16'), ('2.5807003.49', '10/2020', '35,65'), ('4.7103181.69', '10/2020', '185,50'), ('4.7197189.46', '1 0/2020', '185,5 0'), ('4.7861901.25', '10/2020', '94 ,32')]
Lastly, In Output dataframe, space can be removed.
final_df = (pd.DataFrame(req_vals, columns=['order_id', 'date', 'sum']) .replace(r's', '', regex=True)) final_df order_id date sum 0 2.5807003.49 9/2020 24,54 1 4.7103181.69 9/2020 172,05 2 4.7197189.46 09/2020 172,05 3 4.7861901.25 9/2020 89,16 4 2.5807003.49 10/2020 35,65 5 4.7103181.69 10/2020 185,50 6 4.7197189.46 10/2020 185,50 7 4.7861901.25 10/2020 94,32