I’m trying to extract parapgraphs from HTML by using the following line of code:
JavaScript
x
2
1
paragraphs = re.match(r'<p>.{1,}</p>', html)
2
but it returns none even though I know there is. Why?
Advertisement
Answer
Why don’t use an HTML parser to, well, parse HTML. Example using BeautifulSoup
:
JavaScript
1
13
13
1
>>> from bs4 import BeautifulSoup
2
>>>
3
>>> data = """
4
<div>
5
<p>text1</p>
6
<p></p>
7
<p>text2</p>
8
</div>
9
"""
10
>>> soup = BeautifulSoup(data, "html.parser")
11
>>> [p.get_text() for p in soup.find_all("p", text=True)]
12
[u'text1', u'text2']
13
Note that text=True
helps to filter out empty paragraphs.