I have a list of search queries to build a dataset:
classes = [...]
. There are 100 search queries in this list.
Basically, I divide the list into 4 chunks of 25 queries.
def divide_chunks(l, n): for i in range(0, len(l), n): yield classes[i:i + n] classes = list(divide_chunks(classes, 25))
And below, I’ve created a function that downloads queries from each chunk iteratively:
def download_chunk(n): for label in classes[n]: try: downloader.download(label, limit=1000, output_dir='dataset', adult_filter_off=True, force_replace=False,verbose=True) except: pass
However, I want to run each 4 chunks concurrently. In other words, I want to run 4 separate iterative operations concurrently. I took both the Threading
and Multiprocessing
approaches but both of them don’t work:
process_1 = Process(target=download_chunk(0)) process_1.start() process_2 = Process(target=download_chunk(1)) process_2.start() process_3 = Process(target=download_chunk(2)) process_3.start() process_4 = Process(target=download_chunk(3)) process_4.start() process_1.join() process_2.join() process_3.join() process_4.join() ########################################################### thread_1 = threading.Thread(target=download_chunk(0)).start() thread_2 = threading.Thread(target=download_chunk(1)).start() thread_3 = threading.Thread(target=download_chunk(2)).start() thread_4 = threading.Thread(target=download_chunk(3)).start()
Advertisement
Answer
You’re running download_chunk
outside of the thread/process. You need to provide the function and arguments separately in order to delay execution:
For example:
Process(target=download_chunk, args=(0,))
Refer to the multiprocessing docs for more information about using the multiprocessing.Process
class.
For this use-case, I would suggest using multiprocessing.Pool
:
from multiprocessing import Pool if __name__ == '__main__': with Pool(4) as pool: pool.map(download_chunk, range(4))
It handles the work of creating, starting, and later joining the 4 processes. Each process calls download_chunk
with each of the arguments provided in the iterable, which is range(4)
in this case.
More info about multiprocessing.Pool
can be found in the docs.