I am very new to this concept, but I am trying to learn how to use python to manipulate HTML data.
I wrote a python (ver. 3.4.1) script which fetches the URL and returns some information, which I parse using BeautifulSoup (ver. 4).
import urllib.request from bs4 import BeautifulSoup response = urllib.request.urlopen('http://www.walmart.ca/en/ip/xbox-one/6000187109065') html = response.read() soup = BeautifulSoup(html) print(soup.find_all('div', {"class" : "price-current"}))
In this example, I am attempting to obtain the price of the Xbox One. I chose this div because it is the one which displays the price to the user on the webpage. I am aware that there is a <span itemprop="price">$399.99</span>
which is available to scrape. Scraping from that is fairly straight-forward, I am more curious as to why the price won’t show up in the div that it is supposed to.
I hypothesize that it is something to do with HTML headers, or perhaps some kind of POST/GET data which is sent automatically when browsing with a standard web browser. Could anyone explain why the prices do not show up and what I would have to do to get them to show as expected?
Advertisement
Answer
The price is loaded using javascript. This is the request:
Request URL:http://www.walmart.ca/ws/online/products Request Method:POST Form Data: products:[{"productid":"6000187109066","skus":[[{"skuid":"6000187109066","status":"10"}]]}] csrfToken:b08bfe580f3d9a0d893435fb
Since it includes an csrfToken, you will want to figure out how it is generated or provided before making the post request. They might be relying on session cookies as well.
It appears that walmart has an API though: https://developer.walmartlabs.com/