I have the following strings:
2020-10-2125Chavez and Sons 2020-05-02Bean Inc NaNRobinson, Mcmahon and Atkins 2020-04-25Hill-Fisher 2020-04-02Nothing and Sons 52457Carpenter and Sons 0Carpenter and Sons Carpenter and Sons NoneEconomy and Sons 2020-04-02
I want to have it separated:
myRegex = '^([-d]{0,}|[NnaAOoEe]{0,})(.*)' or '^([0-9]{4}-[0-9]{2}-[0-9]{2,}|[d]{0,}|[NnaAOoEe]{0,})([D]{0,})$'
I want all numbers, exact matches for (na, nan, none)-upper and lower cases and “” in first group like:
[2020-10-2125][Chavez and Sons] [2020-05-02][Bean Inc] [NaN][Robinson, Mcmahon and Atkins] [2020-04-25][Hill-Fisher] [2020-04-02][Nothing and Sons] [52457][Carpenter and Sons] [0][Carpenter and Sons] [][Carpenter and Sons] [None][Economy and Sons] [2020-04-02][]
This would be wrong:
[2020-04-02No][thing and Sons]
I want
[2020-04-02][Nothing and Sons]
How do I write a regex which checks exact matches like ‘none’ – not case sensitive (should recognize also ‘None’,’nOne’ etc.)?
https://regex101.com/r/HvnZ47/3
Advertisement
Answer
What about the following with re.I:
(None|NaN?|[-d]+)?(.*)
https://regex101.com/r/d4XPPb/3
Explanation:
(None|NaN?|[-d]+)?
- Either None
- Or NaN from which the last N is optional (due to
?
) so it also matches NA - Or digits and dashes one or more times
- The whole group
()
is optional due to?
which means it might not be there
(.*)
Any character to the end
However, there can still be edge cases. Consider the following:
National Geographic ---Test
would be parsed as
[Na][tional Geographic] [---][Test]
An alternative:
From here we can keep on making the regex more complex, however, I think that it would be a lot simpler for you to implement custom parsing without regex. Loop characters in each line and:
- if it starts with digit, parse all digits and dashes into group 1, the rest in group 2 (ie when you hit a character, change group)
- Take the first 4 chars of the string and if they are “none”, split them out. At the same time ensure that the 5th character is Upper case (case insensitive
line[:4].lower() == "none" and line[4].isupper()
) - Similar to the above step but for NA and NaN:
line[:3].lower() == "nan" and line[3].isupper()
line[:2].lower() == "na" and line[2].isupper()
The above should produce more accurate result and should be a lot easier to read.
Example code:
with open("/tmp/data") as f: lines = f.readlines() results = [] for line in lines: # Remove spaces and n line = line.strip() if line[0].isdigit() or line[0] == "-": i = 0 while line[i].isdigit() or line[i] == "-": i += 1 if i == len(line) - 1: i = len(line) break results.append((line[:i], line[i:])) elif line[:4].lower() == "none" and line[4].isupper(): results.append((line[:4], line[4:])) elif line[:3].lower() == "nan" and line[3].isupper(): results.append((line[:3], line[3:])) elif line[:2].lower() == "na" and line[2].isupper(): results.append((line[:2], line[2:])) else: # Assume group1 is missing! Everything is group2 results.append((None, line)) for g1, g2 in results: print(f"[{g1 or ''}][{g2}]")
Data:
$ cat /tmp/data 2020-10-2125Chavez and Sons 2020-05-02Bean Inc NaNRobinson, Mcmahon and Atkins 2020-04-25Hill-Fisher 2020-04-02Nothing and Sons 52457Carpenter and Sons 0Carpenter and Sons Carpenter and Sons NoneEconomy and Sons NoNeEconomy and Sons 2020-04-02 NAEconomy and Sons ---Test National Geographic
Output:
$ python ~/tmp/so.py [2020-10-2125][Chavez and Sons] [2020-05-02][Bean Inc] [NaN][Robinson, Mcmahon and Atkins] [2020-04-25][Hill-Fisher] [2020-04-02][Nothing and Sons] [52457][Carpenter and Sons] [0][Carpenter and Sons] [][Carpenter and Sons] [None][Economy and Sons] [NoNe][Economy and Sons] [2020-04-02][] [NA][Economy and Sons] [---][Test] [][National Geographic]