I want a regular expression to extract the title from a HTML page. Currently I have this:
JavaScript
x
4
1
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
2
if title:
3
title = title.replace('<title>', '').replace('</title>', '')
4
Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?
Advertisement
Answer
Use (
)
in regexp and group(1)
in python to retrieve the captured string (re.search
will return None
if it doesn’t find the result, so don’t use group()
directly):
JavaScript
1
5
1
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
2
3
if title_search:
4
title = title_search.group(1)
5