How to get url and row id from database before scraping to use it in pipeline to store data?

Question

I'm trying to make a spider that gets some outdated urls from database, parses it and updates data in database. I need to get urls to scrape and ids to use it pipeline that saves the scraped data. I made this code, but I don't know why scrapy changes the order of scraped links, looks like its random, so my

Accepted Answer

asyncScrapy appears to be scheduling GETs asynchronously.Your code does not deal gracefully with that.namingWhat you get from the DB is not urls,but rather rows or pairs.Rather than writing:        for url in urls:and using [0] or [1] subscripts,it would be more pythonic to unpack the two items:        for url, id in pairs:url → idYou attempt to recover an ID in this way:        link_id = self.links_ids[0]Consider storing DB results in a dictrather than a list:        for url, id in pairs:            self.url_to_id[url] = idThen later you can just look up the required IDwith link_id = self.url_to_id[url].iteratingOk, let&#8217;s see what was happening in this loop:    for url in urls:        self.links_ids.append(url[1])        yield scrapy.Request(url=url[0], callback=self.parse, dont_filter=True)Within that loop you wind up running this line:        self.links_ids.remove(link_id)It appears you&#8217;re trying to usea list, that has either zero or one elements,as a scalar variable,at least in a setting where Scrapy behaves synchronously.That is an odd usage; using e.g. the dict I suggestedwould probably make you happier.Furthermore, your code assumes callbacks will happenin the sequence they were enqueued;this is not the case.A dict would sort out that difficulty for you.

Advertisement

Answer

async

naming

url → id

iterating