regular expression match starting clause with end

I want to be able to capture the value of an HTML attribute with a python regexp. currently I use

re.compile( r'=(["'].*?["'])', re.IGNORECASE | re.DOTALL )

JavaScript
​x
 
re.compile( r'=(["'].*?["'])', re.IGNORECASE | re.DOTALL )
​

My problem is that I want the regular expression to “remember” whether the attribute started with a single or a double quote.

I found the bug in my current approach with the following attribute

href="javascript:foo('bar')"

JavaScript
 
href="javascript:foo('bar')"
​

my regex catches

"javascript:foo('

JavaScript
 
"javascript:foo('
​

Answer

You can capture the first quote and then use a backreference:

r'=((["']).*?2)'

JavaScript
 
r'=((["']).*?2)'
​

However, regular expressions are not the proper approach to parsing HTML. You should consider using a DOM parser instead.