Skip to content
Advertisement

Python and HTMLParser.handle_data() – How to get data from tags?

I’m trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I’m not sure how to do it. This is the code I have so far:

import urllib.request
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)


url = "website"
page = urllib.request.urlopen(url).read()

parser = MyHTMLParser(strict=False)
parser.feed(str(page))

If I understand correctly, I can use the handle_data() function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?

Advertisement

Answer

html_code = urllib2.urlopen("xxx")
html_code_list = html_code.readlines()
data = ""
for line in html_code_list:
    line = line.strip()

    if line.startswith("<h2"):
       data = data+line

hp = MyHTMLParser()
hp.feed(data)
hp.close()

thus you can extract data from h2 tag, hope it can help

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement