Skip to content
Advertisement

Tag: html

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this: Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags? Answer Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn’t find

Validate (X)HTML in Python

What’s the best way to go about validating that a document follows some version of HTML (prefereably that I can specify)? I’d like to be able to know where the failures occur, as in a web-based validator, except in a native Python app. Answer XHTML is easy, use lxml. HTML is harder, since there’s traditionally not been as much interest

Advertisement