Skip to content
Advertisement

How to improve the computation speed of subsetting a pandas dataframe?

I have a large df (14*1’000’000) and I want to subset it. The calculation seems to take unsurprisingly a lot of time though and I wonder how to improve the speed.

What I want is to subset for each Name the lowest value of Total_time while ignoring zero values and picking only the first one if there is more than one row has the lowest value of Total_time. And then I want it to be all appended into a new dataframe unique.

Is there a general mistake in my code that makes it inefficient?

unique = pd.DataFrame([])
i=0
for pair in df['Name'].unique():
    i=i+1
    temp =df[df["Name"]== pair]
    temp2 = temp.loc[df['Total_time']  != 0]
    lowest = temp2['Total_time'].min()
    temp3 = temp2[temp2["Total_time"] == lowest].head(1)
    unique = unique.append(temp3)
    print("finished "+ pair + " "+ str(i))

Advertisement

Answer

in general, you don’t want to iterate over each item.

if you want the Name with the smallest time:

new_df = df[df["Total_time"] != 0].copy() # you seem to be throwing away 0
out = new_df.groupby("Name")["Total_time"].min()

If you need the rest of the columns:

new_df.loc[new_df.groupby("Name")["total_time"].idxmin()] 
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement