Ruby’s regular expressions have a feature called atomic grouping (?>regexp)
, described here, is there any equivalent in Python’s re
module?
Advertisement
Answer
Python does not directly support this feature, but you can emulate it by using a zero-width lookahead assert ((?=RE)
), which matches from the current point with the same semantics you want, putting a named group ((?P<name>RE)
) inside the lookahead, and then using a named backreference ((?P=name)
) to match exactly whatever the zero-width assertion matched. Combined together, this gives you the same semantics, at the cost of creating an additional matching group, and a lot of syntax.
For example, the link you provided gives the Ruby example of
/"(?>.*)"/.match('"Quote"') #=> nil
We can emulate that in Python as such:
re.search(r'"(?=(?P<tmp>.*))(?P=tmp)"', '"Quote"') # => None
We can show that I’m doing something useful and not just spewing line noise, because if we change it so that the inner group doesn’t eat the final "
, it still matches:
re.search(r'"(?=(?P<tmp>[A-Za-z]*))(?P=tmp)"', '"Quote"').groupdict() # => {'tmp': 'Quote'}
You can also use anonymous groups and numeric backreferences, but this gets awfully full of line-noise:
re.search(r'"(?=(.*))1"', '"Quote"') # => None
(Full disclosure: I learned this trick from perl’s perlre
documentation, which mentions it under the documentation for (?>...)
.)
In addition to having the right semantics, this also has the appropriate performance properties. If we port an example out of perlre
:
[nelhage@anarchique:~/tmp]$ cat re.py import re import timeit re_1 = re.compile(r'''( ( [^()]+ # x+ | ( [^()]* ) )+ ) ''', re.X) re_2 = re.compile(r'''( ( (?=(?P<tmp>[^()]+ ))(?P=tmp) # Emulate (?> x+) | ( [^()]* ) )+ )''', re.X) print timeit.timeit("re_1.search('((()' + 'a' * 25)", setup = "from __main__ import re_1", number = 10) print timeit.timeit("re_2.search('((()' + 'a' * 25)", setup = "from __main__ import re_2", number = 10)
We see a dramatic improvement:
[nelhage@anarchique:~/tmp]$ python re.py 96.0800571442 7.41481781006e-05
Which only gets more dramatic as we extend the length of the search string.