I have the following strings:
2020-10-2125Chavez and Sons 2020-05-02Bean Inc NaNRobinson, Mcmahon and Atkins 2020-04-25Hill-Fisher 2020-04-02Nothing and Sons 52457Carpenter and Sons 0Carpenter and Sons Carpenter and Sons NoneEconomy and Sons 2020-04-02
I want to have it separated:
myRegex = '^([-d]{0,}|[NnaAOoEe]{0,})(.*)' or '^([0-9]{4}-[0-9]{2}-[0-9]{2,}|[d]{0,}|[NnaAOoEe]{0,})([D]{0,})$'
I want all numbers, exact matches for (na, nan, none)-upper and lower cases and “” in first group like:
[2020-10-2125][Chavez and Sons] [2020-05-02][Bean Inc] [NaN][Robinson, Mcmahon and Atkins] [2020-04-25][Hill-Fisher] [2020-04-02][Nothing and Sons] [52457][Carpenter and Sons] [0][Carpenter and Sons] [][Carpenter and Sons] [None][Economy and Sons] [2020-04-02][]
This would be wrong:
[2020-04-02No][thing and Sons]
I want
[2020-04-02][Nothing and Sons]
How do I write a regex which checks exact matches like ‘none’ – not case sensitive (should recognize also ‘None’,’nOne’ etc.)?
https://regex101.com/r/HvnZ47/3
Advertisement
Answer
What about the following with re.I:
(None|NaN?|[-d]+)?(.*)
https://regex101.com/r/d4XPPb/3
Explanation:
(None|NaN?|[-d]+)?- Either None
- Or NaN from which the last N is optional (due to
?) so it also matches NA - Or digits and dashes one or more times
- The whole group
()is optional due to?which means it might not be there
(.*)Any character to the end
However, there can still be edge cases. Consider the following:
National Geographic ---Test
would be parsed as
[Na][tional Geographic] [---][Test]
An alternative:
From here we can keep on making the regex more complex, however, I think that it would be a lot simpler for you to implement custom parsing without regex. Loop characters in each line and:
- if it starts with digit, parse all digits and dashes into group 1, the rest in group 2 (ie when you hit a character, change group)
- Take the first 4 chars of the string and if they are “none”, split them out. At the same time ensure that the 5th character is Upper case (case insensitive
line[:4].lower() == "none" and line[4].isupper()) - Similar to the above step but for NA and NaN:
line[:3].lower() == "nan" and line[3].isupper()line[:2].lower() == "na" and line[2].isupper()
The above should produce more accurate result and should be a lot easier to read.
Example code:
with open("/tmp/data") as f:
lines = f.readlines()
results = []
for line in lines:
# Remove spaces and n
line = line.strip()
if line[0].isdigit() or line[0] == "-":
i = 0
while line[i].isdigit() or line[i] == "-":
i += 1
if i == len(line) - 1:
i = len(line)
break
results.append((line[:i], line[i:]))
elif line[:4].lower() == "none" and line[4].isupper():
results.append((line[:4], line[4:]))
elif line[:3].lower() == "nan" and line[3].isupper():
results.append((line[:3], line[3:]))
elif line[:2].lower() == "na" and line[2].isupper():
results.append((line[:2], line[2:]))
else:
# Assume group1 is missing! Everything is group2
results.append((None, line))
for g1, g2 in results:
print(f"[{g1 or ''}][{g2}]")
Data:
$ cat /tmp/data 2020-10-2125Chavez and Sons 2020-05-02Bean Inc NaNRobinson, Mcmahon and Atkins 2020-04-25Hill-Fisher 2020-04-02Nothing and Sons 52457Carpenter and Sons 0Carpenter and Sons Carpenter and Sons NoneEconomy and Sons NoNeEconomy and Sons 2020-04-02 NAEconomy and Sons ---Test National Geographic
Output:
$ python ~/tmp/so.py [2020-10-2125][Chavez and Sons] [2020-05-02][Bean Inc] [NaN][Robinson, Mcmahon and Atkins] [2020-04-25][Hill-Fisher] [2020-04-02][Nothing and Sons] [52457][Carpenter and Sons] [0][Carpenter and Sons] [][Carpenter and Sons] [None][Economy and Sons] [NoNe][Economy and Sons] [2020-04-02][] [NA][Economy and Sons] [---][Test] [][National Geographic]