Skip to content
Advertisement

Fastest way to append a row to an existing data frame?

I know this question has been asked many a time, but none of the solutions already posted on this site is ideal.

I have tested various methods found here, and timed them using IPython, I will post the results below:

In [161]: %%timeit
     ...: s = Series([1] * 15, index = songs.columns)
     ...: songs.append(s, ignore_index=True)
     ...:
     ...:
4.82 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [162]: %timeit songs.loc[4464] = [1] * 15
4.5 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [163]: %%timeit
     ...: df = DataFrame([[1] * 15], columns=songs.columns)
     ...: songs.append(df, ignore_index=True)
     ...:
     ...:
4.63 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [164]: %timeit songs.append(dict(zip(songs.columns.tolist(), [1] * 15)), ignore_index=True)
4.62 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [165]: %%timeit
     ...: df = DataFrame([[1] * 15], columns=songs.columns)
     ...: concat([songs, df], ignore_index=True)
     ...:
     ...:
4.64 ms ± 465 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [166]: %timeit songs.loc[4464]
131 µs ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [167]: songs.loc[4464, 'language']
Out[167]: 1

In [168]: songs.loc[4464, 'language'] = 'A'

In [169]: %timeit songs.loc[4464, 'language'] = 'A'
340 µs ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

songs is a DataFrame with 4464 rows (initially) and 15 columns.

I am fully aware DataFrame indexes are IMMUTABLE, so it is technically impossible to add a new row to a DataFrame without creating a complete copy of the dataframe and it is what slows the code down, the overhead is unavoidable, but, can it be reduced?

Is there any conceivable way to bring the execution time down to, say 1 millisecond?

Please don’t suggest using PyPy, true it is fast, but it isn’t exactly as extendable as CPython, because installing any C based library on PyPy takes dozens of minutes and sometimes hours, and I have installed hundreds of libs…

Advertisement

Answer

First we establish the time needed to create a dataframe:

%%timeit 
songs = pd.DataFrame(index=np.arange(4464 ), columns=np.arange(15))
100 loops, best of 5: 5.21 ms per loop

It takes around 5.2 ms to create this dataframe and so we can use it as a reference for the next cases (to prevent caching for e.g).

Case with append:

%%timeit
songs = pd.DataFrame(index=np.arange(4464 ), columns=np.arange(15))
s = pd.Series([1] * 15, index = songs.columns)
songs.append(s, ignore_index=True) 
100 loops, best of 5: 10.1 ms per loop

10 ms meaning it takes around 5ms which is similar to your answer.

With loc:

%%timeit 
songs = pd.DataFrame(index=np.arange(4464 ), columns=np.arange(15))
songs.loc[4464] = [1]*15
100 loops, best of 5: 10.2 ms per loop

again around 5ms.

Solution: Use iloc

%%timeit 
songs = pd.DataFrame(index=np.arange(4464 ), columns=np.arange(15))
songs.iloc[-1] = [1]*15
100 loops, best of 5: 5.25 ms per loop

This solution gets the answer to around ~70µs.

Tested alone, (by creating the dataframe at first):

%%timeit
songs.iloc[-1] = [1]*15
10000 loops, best of 5: 67.1 µs per loop
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement