I want to be able to capture the value of an HTML attribute with a python regexp. currently I use
JavaScript
x
2
1
re.compile( r'=(["'].*?["'])', re.IGNORECASE | re.DOTALL )
2
My problem is that I want the regular expression to “remember” whether the attribute started with a single or a double quote.
I found the bug in my current approach with the following attribute
JavaScript
1
2
1
href="javascript:foo('bar')"
2
my regex catches
JavaScript
1
2
1
"javascript:foo('
2
Advertisement
Answer
You can capture the first quote and then use a backreference:
JavaScript
1
2
1
r'=((["']).*?2)'
2
However, regular expressions are not the proper approach to parsing HTML. You should consider using a DOM parser instead.