Tag: beautifulsoup

How to use BeautifulSoup to parse google search results in Python

I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far: The part I am stuck on now is going down the HTML path to parse the specific data that I want. Everything I have tried so far has just thrown an

Beautiful Soup Nested Tag Search

beautifulsoup html python

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class=”hello”> inside <div>). Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it

Exact website links from google through BeautifulSoup

beautifulsoup python

I want to search google using BeautifulSoup and open the first link. But when I opened the link it shows error. The reason i think is that because google is not providing exact link of website, it has added several parameters in url. How to get exact url? When i tried to use cite tag it worked but for big

BeautifulSoup find_all limited to 50 results?

beautifulsoup python

I’m trying to get the results from a page using BeautifulSoup: I read this previous solution: Beautiful Soup findAll doesn’t find them all and I tried html.parser, lxml and html5lib, but none of them return more than 50 results. Any suggestions? Answer Try using css-selector query.

Accessing the dataLayer (JS variable) when scraping with python

beautifulsoup javascript python python-2.7 web-scraping

I’m using beautiful soup to scrape a webpages. I want to access the dataLayer (a javascript variable) that is present on this webpage? How can I retrieve it using python? Answer You can parse it from the source with the help of re and json.loads to find the correct script tag that contains the json: Running it you see we

Beautiful Soup if Class “Contains” or Regex?

beautifulsoup python regex web-scraping

If my class names are constantly different say for example: Normally I could do: There are way too many class names to work with here so a bunch of these are out. I know Python doesn’t have a “.contains” I would normally use but it does have an “in”. Though I haven’t been able to work out a way to

BeautifulSoup – find table with specified class on Wikipedia page

beautifulsoup python

I am trying to find a table in a Wikipedia page using BeautifulSoup and for some reason I don’t get the table. Can anyone tell why I don’t get the table? my code: prints: None Answer You shouldn’t use jquery-tablesorter to select against in the response you get from requests because it is dynamically applied after the page loads. If

BeautifulSoup – search by text inside a tag

beautifulsoup python regex

Observe the following problem: For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag: Alright. Looks good. Let’s try

Replacing tags of one kind with tags of another in BeautifulSoup

beautifulsoup html parsing python python-3.x

I have a collection of HTML files. I wish to iterate over them, one by one, editing the mark-up of a particular class. The code I wish to edit is of the following form, using the following class names : This can occur multiple times in the same document, with different text instead of “Put me Elsewhere”, but always the

UnicodeEncodeError: ‘charmap’ codec can’t encode characters

beautifulsoup python urllib

I’m trying to scrape a website, but it gives me an error. I’m using the following code: And I’m getting the following error: What can I do to fix this? Answer I fixed it by adding .encode(“utf-8”) to soup. That means that print(soup) becomes print(soup.encode(“utf-8”)).