I have a Python program which is running in a loop and downloading 20k RSS feeds using feedparser and inserting feed data into RDBMS.
I have observed that it starts from 20-30 feeds a min and gradually slows down. After couple of hours it comes down to 4-5 feeds an hour. If I kill the program and restart from where it left, again the throughput is 20-30 feeds a min.
It certainly is not MySQL which is slowing down.
What could be potential issues with the program?
Advertisement
Answer
In all likelihood the issue is to do with memory. You are probably holding the feeds in memory or somehow accumulating memory that isn’t getting garbage collected. To diagnose:
- Look at the size of your task (task manager if windows and top if unix/Linux) and monitor it as it grows with the feeds.
- Then you can use a memory profiler to figure what exactly is consuming the memory
- Once you have found that you can code differently maybe
A few tips:
- Do an explicit garbage collection call (gc.collect()) after setting any relevant unused data structures to empty
- Use a multiprocessing scheme where you spawn multiple processes that each handle a smaller number of feeds
- Maybe go on a 64 bit system if you are using a 32 bit
Some suggestions on memory profiler:
- https://pypi.python.org/pypi/memory_profiler This one is quite good and the decorators are helpful
- https://stackoverflow.com/a/110826/559095