I’m trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I’m not sure how to do it. This is the code I have so far:
JavaScript
x
14
14
1
import urllib.request
2
from html.parser import HTMLParser
3
4
class MyHTMLParser(HTMLParser):
5
def handle_data(self, data):
6
print("Encountered some data:", data)
7
8
9
url = "website"
10
page = urllib.request.urlopen(url).read()
11
12
parser = MyHTMLParser(strict=False)
13
parser.feed(str(page))
14
If I understand correctly, I can use the handle_data()
function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?
Advertisement
Answer
JavaScript
1
13
13
1
html_code = urllib2.urlopen("xxx")
2
html_code_list = html_code.readlines()
3
data = ""
4
for line in html_code_list:
5
line = line.strip()
6
7
if line.startswith("<h2"):
8
data = data+line
9
10
hp = MyHTMLParser()
11
hp.feed(data)
12
hp.close()
13
thus you can extract data from h2 tag, hope it can help