I’m trying to extract parapgraphs from HTML by using the following line of code:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is. Why?
Advertisement
Answer
Why don’t use an HTML parser to, well, parse HTML. Example using BeautifulSoup
:
>>> from bs4 import BeautifulSoup >>> >>> data = """ ... <div> ... <p>text1</p> ... <p></p> ... <p>text2</p> ... </div> ... """ >>> soup = BeautifulSoup(data, "html.parser") >>> [p.get_text() for p in soup.find_all("p", text=True)] [u'text1', u'text2']
Note that text=True
helps to filter out empty paragraphs.