I’m trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I’m not sure how to do it. This is the code I have so far:
import urllib.request
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print("Encountered some data:", data)
url = "website"
page = urllib.request.urlopen(url).read()
parser = MyHTMLParser(strict=False)
parser.feed(str(page))
If I understand correctly, I can use the handle_data() function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?
Advertisement
Answer
html_code = urllib2.urlopen("xxx")
html_code_list = html_code.readlines()
data = ""
for line in html_code_list:
line = line.strip()
if line.startswith("<h2"):
data = data+line
hp = MyHTMLParser()
hp.feed(data)
hp.close()
thus you can extract data from h2 tag, hope it can help