Skip to content
Advertisement

SQL optimization to increase batch insert using Scrapy

In my previous post, I asked how I can record items in bulk using scrapy. The topic is here:

Buffered items and bulk insert to Mysql using scrapy

With the help of @Alexander, I can keep 1000 items in cache. However, my problem here is that the items in the cache are recording one by one while they are being transferred to mysql. My only issue here is speed. I think this problem is caused by SQL codes that I can’t optimize enough.

The logic to save in SQL is as follows;

Add the items to the products table, if the product_id doesn’t exist add it to the new_products table. (I’m running a script in the background that deletes these rows from old to new. I have no problem here. In other words, a maximum of 50k rows are recorded in total.)

Probably mysql is slowing down during insertion to new_products table. Because it checks if product_id exists in existing rows.

I would be very happy if you could suggest a method where I can save 1000 items in a database at once.

The pipeline.py I am using:

JavaScript

Advertisement

Answer

You can eliminate the first query of the save method by executing it upon initialization and storing a copy of it as an instance variable, and then updating it with new entries in the save method. And another performance booster would probably come from using an executemany feature of the mysql cursor by passing all the rows to the save method instead of one at a time.

JavaScript

I am actually curious how much a performance boost this will have so please share the difference in time.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement