Skip to content
Advertisement

Using threading/multiprocessing in Python to download images concurrently

I have a list of search queries to build a dataset:

classes = [...]. There are 100 search queries in this list.

Basically, I divide the list into 4 chunks of 25 queries.

def divide_chunks(l, n):
    for i in range(0, len(l), n):
        yield classes[i:i + n]

classes = list(divide_chunks(classes, 25))

And below, I’ve created a function that downloads queries from each chunk iteratively:

def download_chunk(n):
    for label in classes[n]:
        try:
            downloader.download(label, limit=1000, output_dir='dataset', adult_filter_off=True, force_replace=False,verbose=True)
        except:
            pass

However, I want to run each 4 chunks concurrently. In other words, I want to run 4 separate iterative operations concurrently. I took both the Threading and Multiprocessing approaches but both of them don’t work:

process_1 = Process(target=download_chunk(0))
process_1.start()
process_2 = Process(target=download_chunk(1))
process_2.start()
process_3 = Process(target=download_chunk(2))
process_3.start()
process_4 = Process(target=download_chunk(3))
process_4.start()

process_1.join()
process_2.join()
process_3.join()
process_4.join()

###########################################################

thread_1 = threading.Thread(target=download_chunk(0)).start()
thread_2 = threading.Thread(target=download_chunk(1)).start()
thread_3 = threading.Thread(target=download_chunk(2)).start()
thread_4 = threading.Thread(target=download_chunk(3)).start()

Advertisement

Answer

You’re running download_chunk outside of the thread/process. You need to provide the function and arguments separately in order to delay execution:

For example:

Process(target=download_chunk, args=(0,))

Refer to the multiprocessing docs for more information about using the multiprocessing.Process class.

For this use-case, I would suggest using multiprocessing.Pool:

from multiprocessing import Pool

if __name__ == '__main__':
    with Pool(4) as pool:
        pool.map(download_chunk, range(4))

It handles the work of creating, starting, and later joining the 4 processes. Each process calls download_chunk with each of the arguments provided in the iterable, which is range(4) in this case.

More info about multiprocessing.Pool can be found in the docs.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement