Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '')

JavaScript
​x
 
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 
​

Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?

Answer

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn’t find the result, so don’t use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

JavaScript
 
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
​
if title_search:
    title = title_search.group(1)
​

Advertisement

Answer