Skip to content
Advertisement

Creating custom web scraping tool to count unique words in python

I’m trying to create a function that has 2 arguments, a web URL, and a search word. The function should print out the number of times the word is seen on the page.

I am currently unsure of what I’m doing wrong, as my output isn’t giving me neither an error nor an output…

from html.parser import HTMLParser
from urllib.request import urlopen

class customWebScraper(HTMLParser):
  def __init__(self, searchWord, desiredURL):
      HTMLParser.__init__(self)
      self.searchWord= ''
      self.desiredURL = ''


def scrapePage(searchWord, desiredURL):
  wordCount = 0
  if searchWord.count(searchWord) > 0:
      wordCount += 1
      print(wordCount)

searchWord= ''
desiredURL = ''

urlContents = urlopen(desiredURL).read().decode('utf-8')

parseURL = customWebScraper(searchWord, desiredURL)
parseURL.feed(urlContents)

So if a user types:

customWebScraper(‘name’,’http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm‘)

it should output: 6

Advertisement

Answer

Here’s a simple example script that defines the function you want.

from urllib.request import urlopen

class customWebScraper:
    def __init__(self, searchWord, desiredURL):
        self.searchWord = searchWord
        self.desiredURL = desiredURL

    def scrapePage(self):
        url_content = urlopen(self.desiredURL).read().decode('utf-8')
        return url_content.lower().count(self.searchWord.lower())



parseURL = customWebScraper('name', 'http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm')
count = parseURL.scrapePage()
print('"{}" appears in {} exactly {} times'.format(parseURL.searchWord, parseURL.desiredURL, count))

when I run it the output is:

“name” appears in http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm exactly 6 times

I assumed you wanted case-insensitive match because in the page you provided, name appears 6 times only if you also count appName, etc. name 6 matches

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement