I’m trying to make a spider that gets some outdated urls from database, parses it and updates data in database. I need to get urls to scrape and ids to use it pipeline that saves the scraped data.
I made this code, but I don’t know why scrapy changes the order of scraped links, looks like its random, so my code assing ids wrong. How can I assing id for every link?
def start_requests(self):
urls = self.get_urls_from_database()
# urls looks like [('link1', 1), ('link2', 2), ('link3', 3)]
for url in urls:
# url ('link1', 1)
self.links_ids.append(url[1])
yield scrapy.Request(url=url[0], callback=self.parse, dont_filter=True)
def get_urls_from_database(self):
self.create_connection()
self.dbcursor.execute("""SELECT link, id FROM urls_table""")
urls = self.dbcursor.fetchall()
return urls
def parse(self, response):
item = ScrapyItem()
link_id = self.links_ids[0]
self.links_ids.remove(link_id)
item['name'] = name
item['price'] = price
item['price_currency'] = price_currency
item['link_id'] = link_id
yield item
Because the links are not processed in order output is assinged to wrong item in database: Name of item 1 is saved as name of item 3, price of item 8 is price of item 1 etc.
Advertisement
Answer
async
Scrapy appears to be scheduling GETs asynchronously.
Your code does not deal gracefully with that.
naming
What you get from the DB is not urls
,
but rather rows
or pairs
.
Rather than writing:
for url in urls:
and using [0]
or [1]
subscripts,
it would be more pythonic to unpack the two items:
for url, id in pairs:
url → id
You attempt to recover an ID in this way:
link_id = self.links_ids[0]
Consider storing DB results in a dict
rather than a list
:
for url, id in pairs:
self.url_to_id[url] = id
Then later you can just look up the required ID
with link_id = self.url_to_id[url]
.
iterating
Ok, let’s see what was happening in this loop:
for url in urls:
self.links_ids.append(url[1])
yield scrapy.Request(url=url[0], callback=self.parse, dont_filter=True)
Within that loop you wind up running this line:
self.links_ids.remove(link_id)
It appears you’re trying to use
a list
, that has either zero or one elements,
as a scalar variable,
at least in a setting where Scrapy behaves synchronously.
That is an odd usage; using e.g. the dict
I suggested
would probably make you happier.
Furthermore, your code assumes callbacks will happen
in the sequence they were enqueued;
this is not the case.
A dict
would sort out that difficulty for you.