Skip to content

How can I improve processing time with threads on Spyder?

I’m trying to change the date format of a column in a CSV. I know there are easier ways to do this, but the goal here is to get the threads working properly. I work with Spyder and Python 3.8. My code works as follows:

  • I create a thread class with a function to change the date format
  • I split my dataframe in several dataframes according to the number of threads
  • I assign to each thread a part of the dataframe
  • each thread changes the date formats in its dataframe
  • at the end, I concatenate all the dataframes into one

“serie” is my original dataframe. Here is my code:

import pandas as pd
import numpy as np
import threading
import time

from datetime import datetime
from threading import Thread
from time import process_time

serie=pd.read_csv('XXX.csv')

in_format = "%d/%m/%Y"
out_format = "%Y-%m-%d"

class MonThread (threading.Thread):
    def __init__(self, num_thread):
        threading.Thread.__init__(self)
        self.num_thread = num_thread
    
    #Thread function
    def run(self):
        for self.i in range(dataframes[self.num_thread].index[0], dataframes[self.num_thread].index[0] + dataframes[self.num_thread].shape[0]):
            date_formatee = datetime.strptime(dataframes[self.num_thread].loc[self.i, 'Date'], in_format).strftime(out_format)
            dataframes[self.num_thread].loc[self.i, 'Date'] = date_formatee

nb_thread = 80
dataframes = []

#Df divided in several
for j in range(nb_thread):
    a = j * (serie.shape[0] // nb_thread)
    if j != nb_thread - 1 :
        b = (j + 1) * (serie.shape[0] // nb_thread)
        df = serie.iloc[a:b,:]
    else: 
        df = serie.iloc[a:,:]
        b = serie.shape[0]
    dataframes.append(df)
    print("Intervalle", j, ": [", a, ",", b, "]")

tps1 = process_time()
print(tps1)

threads = []
for n in range(nb_thread):
    t = MonThread(n)
    t.start()
    threads.append(t)

for t in threads:
    t.join()
    
dataframe_finale = pd.concat(dataframes)

print("nnn")
tps2 = process_time()
print(tps2)
print("temps d'éxécution : ")
print(tps2 - tps1)  

It’s working, but I find the execution time quite long, for a total of 100000 values it takes me about 1min30 to process with no threads, but with 80 threads it takes me about 30 seconds, and with 200 or 400 threads I stagnate at 30 seconds. Is my code bad or am I limited by something?

Answer

Have you tried just letting Pandas do the work over the series?

import pandas as pd

df = pd.read_csv('XXX.csv')

in_format = "%d/%m/%Y"
out_format = "%Y-%m-%d"
df['Date'] = pd.to_datetime(df['Date'], format=in_format).dt.strftime(out_format)

On my Macbook, this processes a million entries in 5 seconds.

Another way to do the same (without date validation, though), is

df['Date'] = df['Date'].str.replace(r"(d+)/(d+)/(d+)", r"3-2-1", regex=True)

which finishes the job in about 3.3 seconds.