Skip to content
Advertisement

Right way to do parallel work in python with files and db?

I have very large number of file names (from my PC) inserted in db with status New by default. I want for every file name do some operations (change file). During change file change file status to Proccesing. After operations change status on Processed. I deside to do it with multiprocessing python module. Right now i have this solution but i think it incorrect becasuse functions run for every file not once.

...imports

myclient = pymongo.MongoClient(...)
mydb = myclient["file_list"]
mycol = mydb["file_list"]

def test_func(path_to_files):
    for file in glob.glob(path_to_files + "/*.jpg"):
        fileDB = mycol.find_one({'name': file})
        if (fileDB.get('status') == 'new'):
            query = {"name": file}
            processing = {"$set": {"status": "processing"}}
            mycol.update_one(query, processing)

            print('update', file)
            ...operations with file...

            processed = {"$set": {"status": "processed"}}
            mycol.update_one(query, processed)
        else: continue


if __name__ == '__main__':
    start = time.time()
    processes = []
    num_processes = mp.cpu_count()

    for i in range(num_processes):
        process = mp.Process(target=test_func, args=(path_to_files,))
        processes.append(process)

    for process in processes:
        process.start()

    for process in processes:
        process.join()

    end = time.time()
    print(end - start)

My print(‘update’, file) show same file for every process.I wanna do this work in parallel for increasing my programm speed and mark already processed files.

Tell me please what i’m doing wrong. It is correct way to do what i want or i can do this in different way?

I would be happy any suggestion.

I am new in python.

Advertisement

Answer

All your processes are run the same work. If you have 10 files all your processes are running on the same 10 files, your lock (setting status to processing) is too slow, and by the time the first process is setting the status to processing the next process is passed the if new check.

Look in to split up the file for each process, so if you have 100 files and 5 processes, 1. process handles 1-20 and 2. process handles 21 – 40 and so on

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement