Skip to content
Advertisement

How to get url and row id from database before scraping to use it in pipeline to store data?

I’m trying to make a spider that gets some outdated urls from database, parses it and updates data in database. I need to get urls to scrape and ids to use it pipeline that saves the scraped data.

I made this code, but I don’t know why scrapy changes the order of scraped links, looks like its random, so my code assing ids wrong. How can I assing id for every link?

   def start_requests(self):
        urls = self.get_urls_from_database()
        # urls looks like [('link1', 1), ('link2', 2), ('link3', 3)]
        for url in urls:
            # url ('link1', 1)
            self.links_ids.append(url[1])
            yield scrapy.Request(url=url[0], callback=self.parse, dont_filter=True)

    def get_urls_from_database(self):
        self.create_connection()
        self.dbcursor.execute("""SELECT link, id FROM urls_table""")
        urls = self.dbcursor.fetchall()
        return urls

    def parse(self, response):
        item = ScrapyItem()
        link_id = self.links_ids[0]
        self.links_ids.remove(link_id)

        ...

        item['name'] = name
        item['price'] = price
        item['price_currency'] = price_currency
        item['link_id'] = link_id

        yield item

Because the links are not processed in order output is assinged to wrong item in database: Name of item 1 is saved as name of item 3, price of item 8 is price of item 1 etc.

Advertisement

Answer

async

Scrapy appears to be scheduling GETs asynchronously.

Your code does not deal gracefully with that.

naming

What you get from the DB is not urls, but rather rows or pairs.

Rather than writing:

        for url in urls:

and using [0] or [1] subscripts, it would be more pythonic to unpack the two items:

        for url, id in pairs:

url → id

You attempt to recover an ID in this way:

        link_id = self.links_ids[0]

Consider storing DB results in a dict rather than a list:

        for url, id in pairs:
            self.url_to_id[url] = id

Then later you can just look up the required ID with link_id = self.url_to_id[url].

iterating

Ok, let’s see what was happening in this loop:

    for url in urls:
        self.links_ids.append(url[1])
        yield scrapy.Request(url=url[0], callback=self.parse, dont_filter=True)

Within that loop you wind up running this line:

        self.links_ids.remove(link_id)

It appears you’re trying to use a list, that has either zero or one elements, as a scalar variable, at least in a setting where Scrapy behaves synchronously. That is an odd usage; using e.g. the dict I suggested would probably make you happier.

Furthermore, your code assumes callbacks will happen in the sequence they were enqueued; this is not the case. A dict would sort out that difficulty for you.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement