Skip to content
Advertisement

regular expression match starting clause with end

I want to be able to capture the value of an HTML attribute with a python regexp. currently I use

re.compile( r'=(["'].*?["'])', re.IGNORECASE | re.DOTALL )

My problem is that I want the regular expression to “remember” whether the attribute started with a single or a double quote.

I found the bug in my current approach with the following attribute

href="javascript:foo('bar')"

my regex catches

"javascript:foo('

Advertisement

Answer

You can capture the first quote and then use a backreference:

r'=((["']).*?2)'

However, regular expressions are not the proper approach to parsing HTML. You should consider using a DOM parser instead.

Advertisement