I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group() if title: title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?
Advertisement
Answer
Use (
)
in regexp and group(1)
in python to retrieve the captured string (re.search
will return None
if it doesn’t find the result, so don’t use group()
directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE) if title_search: title = title_search.group(1)