Python

I am trying to scrape the noon.com. Here is the product which I am interested to scrape https://www.noon.com/uae-en/face-and-beard-wash-multicolour-80ml/N22130693A/p?o=f7adb85c3296590b.

I am able to get all information of product except Ratings/Review data. Issue here is that website is loading the Ratings data through API link https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list, which is basically POST request method.

I tried with including headers and appropriate payload in the scrapy request. But I am getting 400, 405 — HTTP status code is not handled or not allowed as response.

This is how I am trying to pull Ratings data

def start_requests(self):
    headers = {"authority": "www.noon.com",
    "method": "POST",
    "path": "/_svc/reviews/fetch/v1/product-reviews/list",
    "scheme": "https",
    "accept": "application/json, text/plain, */*",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "no-cache, max-age=0, must-revalidate, no-store",
    "content-type": "application/json",
    "origin": "https://www.noon.com",
    "referer": "https://www.noon.com/uae-en/face-and-beard-wash-multicolour-80ml/N22130693A/p?o=f7adb85c3296590b",
    "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    }
    url = "https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list"

    payload = [{"catalogCode":"noon","sku":"N22130693A","lang":None,"ratings":[1,2,3,4,5],"provideBreakdown":True,"page":1}]

    yield  scrapy.Request(url,method = "POST",body=json.dumps(payload),headers = headers,callback=self.parse)


def parse(self, response):
    data = json.loads(response.body)
    print(data)

JavaScript
​x
 
def start_requests(self):
    headers = {"authority": "www.noon.com",
    "method": "POST",
    "path": "/_svc/reviews/fetch/v1/product-reviews/list",
    "scheme": "https",
    "accept": "application/json, text/plain, */*",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "no-cache, max-age=0, must-revalidate, no-store",
    "content-type": "application/json",
    "origin": "https://www.noon.com",
    "referer": "https://www.noon.com/uae-en/face-and-beard-wash-multicolour-80ml/N22130693A/p?o=f7adb85c3296590b",
    "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    }
    url = "https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list"
​
    payload = [{"catalogCode":"noon","sku":"N22130693A","lang":None,"ratings":[1,2,3,4,5],"provideBreakdown":True,"page":1}]
​
    yield  scrapy.Request(url,method = "POST",body=json.dumps(payload),headers = headers,callback=self.parse)
​
​
def parse(self, response):
    data = json.loads(response.body)
    print(data)
​

Any solution for this issue ? Any help would be appreciated.

Answer

I tried this and it works for me, if it doesn’t work for you maybe you’ve been IP blocked and may have to use a proxy api. Try if this works for you.

def start_requests(self):
    return [scrapy.Request(
        url='https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list',
        method='POST',
        body='{"catalogCode":"noon","sku":"N22130693A","lang":null,"ratings":[1,2,3,4,5],"provideBreakdown":true,"page":1}',
        headers={
            'content-type': 'application/json'
        }
    )]

def parse(self, response):
    print(response.body)

JavaScript
 
def start_requests(self):
    return [scrapy.Request(
        url='https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list',
        method='POST',
        body='{"catalogCode":"noon","sku":"N22130693A","lang":null,"ratings":[1,2,3,4,5],"provideBreakdown":true,"page":1}',
        headers={
            'content-type': 'application/json'
        }
    )]
​
def parse(self, response):
    print(response.body)
​

My output:

2020-12-23 13:12:35 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list> (referer: None)
b'{"list":[],"summary":{"rating":5.0,"count":1,"commentCount":0},"breakdown":[{"rating":5.0,"count":1,"commentCount":0}],"languages":[],"pagination":{"totalPages":1,"page":1,"perPage":10}}'

JavaScript
 
2020-12-23 13:12:35 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.noon.com/_svc/reviews/fetch/v1/product-reviews/list> (referer: None)
b'{"list":[],"summary":{"rating":5.0,"count":1,"commentCount":0},"breakdown":[{"rating":5.0,"count":1,"commentCount":0}],"languages":[],"pagination":{"totalPages":1,"page":1,"perPage":10}}'
​

Scrapy – Request Payload format and types for AJAX based websites

Advertisement

Answer