Skip to content
Advertisement

Python regex to extract html paragraph

I’m trying to extract parapgraphs from HTML by using the following line of code:

paragraphs = re.match(r'<p>.{1,}</p>', html)

but it returns none even though I know there is. Why?

Advertisement

Answer

Why don’t use an HTML parser to, well, parse HTML. Example using BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
...     <div>
...         <p>text1</p>
...         <p></p>
...         <p>text2</p>
...     </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']

Note that text=True helps to filter out empty paragraphs.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement