Python regex to extract html paragraph

Question

I'm trying to extract parapgraphs from HTML by using the following line of code: but it returns none even though I know there is. Why? Answer Why don't use an HTML parser to, well, parse HTML. Example using BeautifulSoup: Note that text=True helps to filter out empty paragraphs.

Accepted Answer

Why don’t use an HTML parser to, well, parse HTML. Example using BeautifulSoup:>>> from bs4 import BeautifulSoup>>> >>> data = """...

...

text1

...

text2

...

... """>>> soup = BeautifulSoup(data, "html.parser")>>> [p.get_text() for p in soup.find_all("p", text=True)][u'text1', u'text2']Note that text=True helps to filter out empty paragraphs.

Advertisement

Answer