Skip to content
Advertisement

invalid xpath in scrapy (python)

hello i’m trying to build a crawler using scrapy my crawler code is :

import scrapy
from shop.items import ShopItem


class ShopspiderSpider(scrapy.Spider):
    name = 'shopspider'
    allowed_domains = ['www.organics.com']
    start_urls = ['https://www.organics.com/product-tag/special-offers/']



    def parse(self, response):
      items = ShopItem()
      title = response.xpath('//*[@id="content"]/div[2]/div[1]/ul/li[1]/a/h3').extract()
      sale_price = response.xpath('//*[@id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/del/span').extract()
      product_original_price = response.xpath('//*[@id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
      category = response.xpath('//*[@id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()

      items['product_name'] = ''.join(title).strip()
      items['product_sale_price'] = ''.join(sale_price).strip()
      items['product_original_price'] = ''.join(product_original_price).strip()
      items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
      yield items


but when i run the command : scrapy crawl shopspider -o info.csv
to see the output i can find just the informations about the first product not all the products in this page.
so i remove the numbers between [ ] in the xpath for exemple the xpath of the title ://*[@id=”content”]/div/div/ul/li/a/h3
but still get the same result.
the result is : <span class="amount">£40.00</span>,<h3>Halo Skincare Organic Gift Set</h3>,"<span class=""amount"">£40.00</span>","<span class=""amount"">£58.00</span>"
kindely help please

Advertisement

Answer

If you remove the indexes on your XPaths, they will find all the items in the page:

response.xpath('//*[@id="content"]/div/div/ul/li/a/h3').extract() # Returns 7 items

However, you should observe that this will return a list of strings of the selected html elements. You should add /text() in the XPath if you want the text inside the element. (Which looks like you do)

Also, the reason you only get one return is because you are concatenating all the items into a single string when assigning them to the item:

items['product_name'] = ''.join(title).strip()

Here title is a list of elements and you concatenate them all in a single string. Same logic applies for the other vars

If that’s really what you want you can disregard the following, but I believe a better approach would be to execute a for loop and yield them separately?

My suggestion would be:

def parse(self, response):
  products = response.xpath('//*[@id="content"]/div/div/ul/li')
  for product in products:
      items = ShopItem()
      items['product_name'] = product.xpath('a/h3/text()').get()
      items['product_sale_price'] = product.xpath('a/span/del/span/text()').get()
      items['product_original_price'] = product.xpath('a/span/ins/span/text()').get()
      items['product_category'] = product.xpath('a/span/ins/span/text()').get()

      yield items

Notice that in your original code your category var has the same XPath that your product_original_price, I kept the logic in the code, but it’s probably a mistake.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement