I have the following strings:
2020-10-2125Chavez and Sons
2020-05-02Bean Inc
NaNRobinson, Mcmahon and Atkins
2020-04-25Hill-Fisher
2020-04-02Nothing and Sons
52457Carpenter and Sons
0Carpenter and Sons
Carpenter and Sons
NoneEconomy and Sons
2020-04-02
I want to have it separated:
myRegex = '^([-d]{0,}|[NnaAOoEe]{0,})(.*)' or '^([0-9]{4}-[0-9]{2}-[0-9]{2,}|[d]{0,}|[NnaAOoEe]{0,})([D]{0,})$'
I want all numbers, exact matches for (na, nan, none)-upper and lower cases and “” in first group like:
[2020-10-2125][Chavez and Sons]
[2020-05-02][Bean Inc]
[NaN][Robinson, Mcmahon and Atkins]
[2020-04-25][Hill-Fisher]
[2020-04-02][Nothing and Sons]
[52457][Carpenter and Sons]
[0][Carpenter and Sons]
[][Carpenter and Sons]
[None][Economy and Sons]
[2020-04-02][]
This would be wrong:
[2020-04-02No][thing and Sons]
I want
[2020-04-02][Nothing and Sons]
How do I write a regex which checks exact matches like ‘none’ – not case sensitive (should recognize also ‘None’,’nOne’ etc.)?
https://regex101.com/r/HvnZ47/3
Advertisement
Answer
What about the following with re.I:
(None|NaN?|[-d]+)?(.*)
https://regex101.com/r/d4XPPb/3
Explanation:
(None|NaN?|[-d]+)?
- Either None
- Or NaN from which the last N is optional (due to
?
) so it also matches NA - Or digits and dashes one or more times
- The whole group
()
is optional due to?
which means it might not be there
(.*)
Any character to the end
However, there can still be edge cases. Consider the following:
National Geographic
---Test
would be parsed as
[Na][tional Geographic]
[---][Test]
An alternative:
From here we can keep on making the regex more complex, however, I think that it would be a lot simpler for you to implement custom parsing without regex. Loop characters in each line and:
- if it starts with digit, parse all digits and dashes into group 1, the rest in group 2 (ie when you hit a character, change group)
- Take the first 4 chars of the string and if they are “none”, split them out. At the same time ensure that the 5th character is Upper case (case insensitive
line[:4].lower() == "none" and line[4].isupper()
) - Similar to the above step but for NA and NaN:
line[:3].lower() == "nan" and line[3].isupper()
line[:2].lower() == "na" and line[2].isupper()
The above should produce more accurate result and should be a lot easier to read.
Example code:
with open("/tmp/data") as f:
lines = f.readlines()
results = []
for line in lines:
# Remove spaces and n
line = line.strip()
if line[0].isdigit() or line[0] == "-":
i = 0
while line[i].isdigit() or line[i] == "-":
i += 1
if i == len(line) - 1:
i = len(line)
break
results.append((line[:i], line[i:]))
elif line[:4].lower() == "none" and line[4].isupper():
results.append((line[:4], line[4:]))
elif line[:3].lower() == "nan" and line[3].isupper():
results.append((line[:3], line[3:]))
elif line[:2].lower() == "na" and line[2].isupper():
results.append((line[:2], line[2:]))
else:
# Assume group1 is missing! Everything is group2
results.append((None, line))
for g1, g2 in results:
print(f"[{g1 or ''}][{g2}]")
Data:
$ cat /tmp/data
2020-10-2125Chavez and Sons
2020-05-02Bean Inc
NaNRobinson, Mcmahon and Atkins
2020-04-25Hill-Fisher
2020-04-02Nothing and Sons
52457Carpenter and Sons
0Carpenter and Sons
Carpenter and Sons
NoneEconomy and Sons
NoNeEconomy and Sons
2020-04-02
NAEconomy and Sons
---Test
National Geographic
Output:
$ python ~/tmp/so.py
[2020-10-2125][Chavez and Sons]
[2020-05-02][Bean Inc]
[NaN][Robinson, Mcmahon and Atkins]
[2020-04-25][Hill-Fisher]
[2020-04-02][Nothing and Sons]
[52457][Carpenter and Sons]
[0][Carpenter and Sons]
[][Carpenter and Sons]
[None][Economy and Sons]
[NoNe][Economy and Sons]
[2020-04-02][]
[NA][Economy and Sons]
[---][Test]
[][National Geographic]