I’m trying to change the date format of a column in a CSV. I know there are easier ways to do this, but the goal here is to get the threads working properly. I work with Spyder and Python 3.8. My code works as follows:
- I create a thread class with a function to change the date format
- I split my dataframe in several dataframes according to the number of threads
- I assign to each thread a part of the dataframe
- each thread changes the date formats in its dataframe
- at the end, I concatenate all the dataframes into one
“serie” is my original dataframe. Here is my code:
import pandas as pd import numpy as np import threading import time from datetime import datetime from threading import Thread from time import process_time serie=pd.read_csv('XXX.csv') in_format = "%d/%m/%Y" out_format = "%Y-%m-%d" class MonThread (threading.Thread): def __init__(self, num_thread): threading.Thread.__init__(self) self.num_thread = num_thread #Thread function def run(self): for self.i in range(dataframes[self.num_thread].index[0], dataframes[self.num_thread].index[0] + dataframes[self.num_thread].shape[0]): date_formatee = datetime.strptime(dataframes[self.num_thread].loc[self.i, 'Date'], in_format).strftime(out_format) dataframes[self.num_thread].loc[self.i, 'Date'] = date_formatee nb_thread = 80 dataframes = [] #Df divided in several for j in range(nb_thread): a = j * (serie.shape[0] // nb_thread) if j != nb_thread - 1 : b = (j + 1) * (serie.shape[0] // nb_thread) df = serie.iloc[a:b,:] else: df = serie.iloc[a:,:] b = serie.shape[0] dataframes.append(df) print("Intervalle", j, ": [", a, ",", b, "]") tps1 = process_time() print(tps1) threads = [] for n in range(nb_thread): t = MonThread(n) t.start() threads.append(t) for t in threads: t.join() dataframe_finale = pd.concat(dataframes) print("nnn") tps2 = process_time() print(tps2) print("temps d'éxécution : ") print(tps2 - tps1)
It’s working, but I find the execution time quite long, for a total of 100000 values it takes me about 1min30 to process with no threads, but with 80 threads it takes me about 30 seconds, and with 200 or 400 threads I stagnate at 30 seconds. Is my code bad or am I limited by something?
Advertisement
Answer
Have you tried just letting Pandas do the work over the series?
import pandas as pd df = pd.read_csv('XXX.csv') in_format = "%d/%m/%Y" out_format = "%Y-%m-%d" df['Date'] = pd.to_datetime(df['Date'], format=in_format).dt.strftime(out_format)
On my Macbook, this processes a million entries in 5 seconds.
Another way to do the same (without date validation, though), is
df['Date'] = df['Date'].str.replace(r"(d+)/(d+)/(d+)", r"3-2-1", regex=True)
which finishes the job in about 3.3 seconds.