Skip to content
Advertisement

How to have categorical regex groups with Python

I have a text which corresponds to a pattern can must be split into categories. I thought of using groups to capture parts of the text that correspond to a particular category patern, and then map that part to my category. Unfortunately, as far as I know group names in Python regex cannot have the same name, and I cannot think of an efficient way to distribute groups into categories. Nor can I reuse the same group, as the patern for each group is unique

example/desired output:

pattern = "(?P<catA>[A-Z]{4})(?P<catB>dd.dd-)(?P<catA>[0-9]{4})(?P<catA>[A-Z]{3})(?P<catB>.*)"

text = " someNooise AABT12.20-1215BTTFFFF SomemoreNoize"

result = some_funtion(pattern, text)

print(result) # => [("AABT", 'catA'), ("12.20-", 'catB'), ("1215", 'catA'), ("BTT", 'catA'), ("FFFF", 'catB')]

# functionally similar outputs such as dictionaries, or other, are acceptable

Best of what I thought so far is to use unique group names which then map to a dictionary with category keys and list of names, but this seems to me very inefficient and involves hardcoding names every time my pattern changes, or the number of categories is different.

Another way was to use patters to generate names, but this becomes way to complex as the patterns evolve, and there is some risk of overlap.

Advertisement

Answer

You can install PyPi regex library, use your current pattern without any modifications and upon getting a match using regex.search, access the match.capturesdict:

import regex

pattern = r"(?P<catA>[A-Z]{4})(?P<catB>dd.dd-)(?P<catA>[0-9]{4})(?P<catA>[A-Z]{3})(?P<catB>.*)"
text = " someNooise AABT12.20-1215BTTFFFF SomemoreNoize"
result = regex.search(pattern, text)
print(result.capturesdict() )
# => {'catA': ['AABT', '1215', 'BTT'], 'catB': ['12.20-', 'FFFF SomemoreNoize']}

See the Python demo.

The PyPi regex module supports patterns with identically named capturing groups.

Advertisement