I'm trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I'm not sure how to do it. This is the code I have so far: If I understand correctly, I can use the handle_data() function to get the data between tags. How do I specify which tags to get

Python

I’m trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I’m not sure how to do it. This is the code I have so far:

import urllib.request
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)


url = "website"
page = urllib.request.urlopen(url).read()

parser = MyHTMLParser(strict=False)
parser.feed(str(page))

JavaScript
​x
 
import urllib.request
from html.parser import HTMLParser
​
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)
​
​
url = "website"
page = urllib.request.urlopen(url).read()
​
parser = MyHTMLParser(strict=False)
parser.feed(str(page))
​

If I understand correctly, I can use the handle_data() function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?

Answer

html_code = urllib2.urlopen("xxx")
html_code_list = html_code.readlines()
data = ""
for line in html_code_list:
    line = line.strip()

    if line.startswith("<h2"):
       data = data+line

hp = MyHTMLParser()
hp.feed(data)
hp.close()

JavaScript
 
html_code = urllib2.urlopen("xxx")
html_code_list = html_code.readlines()
data = ""
for line in html_code_list:
    line = line.strip()
​
    if line.startswith("<h2"):
       data = data+line
​
hp = MyHTMLParser()
hp.feed(data)
hp.close()
​

thus you can extract data from h2 tag, hope it can help

Python and HTMLParser.handle_data() – How to get data from tags?

Advertisement

Answer