I’m trying to make a spider that gets some outdated urls from database, parses it and updates data in database. I need to get urls to scrape and ids to use it pipeline that saves the scraped data.
I made this code, but I don’t know why scrapy changes the order of scraped links, looks like its random, so my code assing ids wrong. How can I assing id for every link?
   def start_requests(self):
        urls = self.get_urls_from_database()
        # urls looks like [('link1', 1), ('link2', 2), ('link3', 3)]
        for url in urls:
            # url ('link1', 1)
            self.links_ids.append(url[1])
            yield scrapy.Request(url=url[0], callback=self.parse, dont_filter=True)
    def get_urls_from_database(self):
        self.create_connection()
        self.dbcursor.execute("""SELECT link, id FROM urls_table""")
        urls = self.dbcursor.fetchall()
        return urls
    def parse(self, response):
        item = ScrapyItem()
        link_id = self.links_ids[0]
        self.links_ids.remove(link_id)
        ...
        item['name'] = name
        item['price'] = price
        item['price_currency'] = price_currency
        item['link_id'] = link_id
        yield item
Because the links are not processed in order output is assinged to wrong item in database: Name of item 1 is saved as name of item 3, price of item 8 is price of item 1 etc.
Advertisement
Answer
async
Scrapy appears to be scheduling GETs asynchronously.
Your code does not deal gracefully with that.
naming
What you get from the DB is not urls,
but rather rows or pairs.
Rather than writing:
for url in urls:
and using [0] or [1] subscripts,
it would be more pythonic to unpack the two items:
for url, id in pairs:
url → id
You attempt to recover an ID in this way:
link_id = self.links_ids[0]
Consider storing DB results in a dict
rather than a list:
        for url, id in pairs:
            self.url_to_id[url] = id
Then later you can just look up the required ID
with link_id = self.url_to_id[url].
iterating
Ok, let’s see what was happening in this loop:
    for url in urls:
        self.links_ids.append(url[1])
        yield scrapy.Request(url=url[0], callback=self.parse, dont_filter=True)
Within that loop you wind up running this line:
self.links_ids.remove(link_id)
It appears you’re trying to use
a list, that has either zero or one elements,
as a scalar variable,
at least in a setting where Scrapy behaves synchronously.
That is an odd usage; using e.g. the dict I suggested
would probably make you happier.
Furthermore, your code assumes callbacks will happen
in the sequence they were enqueued;
this is not the case.
A dict would sort out that difficulty for you.
