I’m trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I’m not sure how to do it. This is the code I have so far:
import urllib.request from html.parser import HTMLParser class MyHTMLParser(HTMLParser): def handle_data(self, data): print("Encountered some data:", data) url = "website" page = urllib.request.urlopen(url).read() parser = MyHTMLParser(strict=False) parser.feed(str(page))
If I understand correctly, I can use the handle_data()
function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?
Advertisement
Answer
html_code = urllib2.urlopen("xxx") html_code_list = html_code.readlines() data = "" for line in html_code_list: line = line.strip() if line.startswith("<h2"): data = data+line hp = MyHTMLParser() hp.feed(data) hp.close()
thus you can extract data from h2 tag, hope it can help