[update:] Accepted answer suggests, this can not be done with the python re
library in one step. If you know otherwise, please comment.
I’m reverse-engineering a massive ETL pipeline, I’d like to extract the full data lineage from stored procedures and views.
I’m struggling with the following regexp.
import re select_clause = "`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`) AS `birth_date`,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`) AS `death_date`,(case when (isnull(`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`)) and (`landing_boggle_replica`.`CUST`.`u_cust_type` <> 'E')) then timestampdiff(YEAR,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`),curdate()) else NULL end) AS `age_in_years`,nullif(`landing_boggle_replica`.`CUST`.`u_occupationCode`,'') AS `occupation_code`,nullif(`landing_boggle_replica`.`CUST`.`u_industryCode`,'') AS `industry_code`,((`landing_boggle_replica`.`CUST`.`u_intebank` = 'Y') or (`sso`.`u_mySecondaryCust` is not null)) AS `online_web_enabled`,(`landing_boggle_replica`.`CUST`.`u_telebank` = 'Y') AS `online_phone_enabled`,(`landing_boggle_replica`.`CUST`.`u_hasProBank` = 1) AS `has_pro_bank`" # this captures every occurrence of the source fields, but not the target okay_pattern = r"(?i)((`[a-z0-9_]+`.`[a-z0-9_]+`)[ ,)=<>]).*?" # this captures the target too, but captures only the first input field wrong_pattern = r"(?i)((((`[a-z0-9_]+`.`[a-z0-9_]+`)[ ,)=<>]).*?AS (`[a-z0-9_]+)`).*?)" re.findall(okay_pattern, select_clause) re.findall(wrong_pattern, select_clause)
TLDR: I’d like to capture
[aaa, bbb, XXX], [eee, fff, ..., ooo, YYY], [ppp, ZZZ]
from a string like
"...aaa....bbb...XXX....eee...fff...[many]...ooo... YYY...ppp...ZZZ...."
where a,b,e,f,h
match one pattern, X,Y,Z
match another, and the first pattern might occur up to ~20 times, before the second one appears, which always appears alone.
I’m open to solutions with the sqlglot
, sql-metadata
, or sqlparse
libraries as well, it is just regex is better documented.
(Probably I’m code golfing, and I should do this in several steps, starting with splitting the string into individual expressions.)
Advertisement
Answer
You may use this regex with 3 capture and 1 non-capture groups:
(w+).+(w+)(?:.+(w+))?
Code:
import re s = '...aaa....bbb...XXX....eee...fff...YYY...hhh...ZZZ....' print (re.findall(r'(w+).+(w+)(?:.+(w+))?', s))
Output:
[('aaa', 'bbb', 'XXX'), ('eee', 'fff', 'YYY'), ('hhh', 'ZZZ', '')]