Walmart Price Scraping with Python 3

I am very new to this concept, but I am trying to learn how to use python to manipulate HTML data.

I wrote a python (ver. 3.4.1) script which fetches the URL and returns some information, which I parse using BeautifulSoup (ver. 4).

import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen('http://www.walmart.ca/en/ip/xbox-one/6000187109065')
html = response.read()

soup = BeautifulSoup(html)

print(soup.find_all('div', {"class" : "price-current"}))

JavaScript
​x
 
import urllib.request
from bs4 import BeautifulSoup
​
response = urllib.request.urlopen('http://www.walmart.ca/en/ip/xbox-one/6000187109065')
html = response.read()
​
soup = BeautifulSoup(html)
​
print(soup.find_all('div', {"class" : "price-current"}))
​

In this example, I am attempting to obtain the price of the Xbox One. I chose this div because it is the one which displays the price to the user on the webpage. I am aware that there is a <span itemprop="price">$399.99</span> which is available to scrape. Scraping from that is fairly straight-forward, I am more curious as to why the price won’t show up in the div that it is supposed to.

I hypothesize that it is something to do with HTML headers, or perhaps some kind of POST/GET data which is sent automatically when browsing with a standard web browser. Could anyone explain why the prices do not show up and what I would have to do to get them to show as expected?

Answer

The price is loaded using javascript. This is the request:

Request URL:http://www.walmart.ca/ws/online/products
Request Method:POST

Form Data:
products:[{"productid":"6000187109066","skus":[[{"skuid":"6000187109066","status":"10"}]]}]
csrfToken:b08bfe580f3d9a0d893435fb

JavaScript
 
Request URL:http://www.walmart.ca/ws/online/products
Request Method:POST
​
Form Data:
products:[{"productid":"6000187109066","skus":[[{"skuid":"6000187109066","status":"10"}]]}]
csrfToken:b08bfe580f3d9a0d893435fb
​

Since it includes an csrfToken, you will want to figure out how it is generated or provided before making the post request. They might be relying on session cookies as well.

It appears that walmart has an API though: https://developer.walmartlabs.com/

Advertisement

Answer