regex or does not work – I do not know what is wrong in my pattern

Question

I have the following strings: I want to have it separated: I want all numbers, exact matches for (na, nan, none)-upper and lower cases and "" in first group like: This would be wrong: I want How do I write a regex which checks exact matches like 'none' - not case sensitive (should recognize also 'None','nOne' etc.)? https://regex101.com/r/HvnZ47/3 Answer What

Accepted Answer

What about the following with re.I:(None|NaN?|[-d]+)?(.*)https://regex101.com/r/d4XPPb/3Explanation:(None|NaN?|[-d]+)?Either NoneOr NaN from which the last N is optional (due to ?) so it also matches NAOr digits and dashes one or more timesThe whole group () is optional due to ? which means it might not be there(.*) Any character to the endHowever, there can still be edge cases. Consider the following:National Geographic---Testwould be parsed as[Na][tional Geographic][---][Test]An alternative:From here we can keep on making the regex more complex, however, I think that it would be a lot simpler for you to implement custom parsing without regex. Loop characters in each line and:if it starts with digit, parse all digits and dashes into group 1, the rest in group 2 (ie when you hit a character, change group)Take the first 4 chars of the string and if they are &#8220;none&#8221;, split them out. At the same time ensure that the 5th character is Upper case (case insensitive line[:4].lower() == "none" and line[4].isupper())Similar to the above step but for NA and NaN:line[:3].lower() == "nan" and line[3].isupper()line[:2].lower() == "na" and line[2].isupper()The above should produce more accurate result and should be a lot easier to read.Example code:with open("/tmp/data") as f:    lines = f.readlines()results = []for line in lines:    # Remove spaces and n    line = line.strip()    if line[0].isdigit() or line[0] == "-":        i = 0        while line[i].isdigit() or line[i] == "-":            i += 1            if i == len(line) - 1:                i = len(line)                break        results.append((line[:i], line[i:]))    elif line[:4].lower() == "none" and line[4].isupper():        results.append((line[:4], line[4:]))    elif line[:3].lower() == "nan" and line[3].isupper():        results.append((line[:3], line[3:]))    elif line[:2].lower() == "na" and line[2].isupper():        results.append((line[:2], line[2:]))    else:         # Assume group1 is missing! Everything is group2         results.append((None, line))for g1, g2 in results:    print(f"[{g1 or ''}][{g2}]")Data:$ cat /tmp/data 2020-10-2125Chavez and Sons2020-05-02Bean IncNaNRobinson, Mcmahon and Atkins2020-04-25Hill-Fisher2020-04-02Nothing and Sons52457Carpenter and Sons0Carpenter and SonsCarpenter and SonsNoneEconomy and SonsNoNeEconomy and Sons2020-04-02NAEconomy and Sons---TestNational GeographicOutput:$ python ~/tmp/so.py [2020-10-2125][Chavez and Sons][2020-05-02][Bean Inc][NaN][Robinson, Mcmahon and Atkins][2020-04-25][Hill-Fisher][2020-04-02][Nothing and Sons][52457][Carpenter and Sons][0][Carpenter and Sons][][Carpenter and Sons][None][Economy and Sons][NoNe][Economy and Sons][2020-04-02][][NA][Economy and Sons][---][Test][][National Geographic]

Advertisement

Answer