I want to be able to capture the value of an HTML attribute with a python regexp. currently I use
re.compile( r'=(["'].*?["'])', re.IGNORECASE | re.DOTALL )
My problem is that I want the regular expression to “remember” whether the attribute started with a single or a double quote.
I found the bug in my current approach with the following attribute
href="javascript:foo('bar')"
my regex catches
"javascript:foo('
Advertisement
Answer
You can capture the first quote and then use a backreference:
r'=((["']).*?2)'
However, regular expressions are not the proper approach to parsing HTML. You should consider using a DOM parser instead.