I know this question has been asked many a time, but none of the solutions already posted on this site is ideal.
I have tested various methods found here, and timed them using IPython, I will post the results below:
In [161]: %%timeit ...: s = Series([1] * 15, index = songs.columns) ...: songs.append(s, ignore_index=True) ...: ...: 4.82 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [162]: %timeit songs.loc[4464] = [1] * 15 4.5 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [163]: %%timeit ...: df = DataFrame([[1] * 15], columns=songs.columns) ...: songs.append(df, ignore_index=True) ...: ...: 4.63 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [164]: %timeit songs.append(dict(zip(songs.columns.tolist(), [1] * 15)), ignore_index=True) 4.62 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [165]: %%timeit ...: df = DataFrame([[1] * 15], columns=songs.columns) ...: concat([songs, df], ignore_index=True) ...: ...: 4.64 ms ± 465 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [166]: %timeit songs.loc[4464] 131 µs ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [167]: songs.loc[4464, 'language'] Out[167]: 1 In [168]: songs.loc[4464, 'language'] = 'A' In [169]: %timeit songs.loc[4464, 'language'] = 'A' 340 µs ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
songs
is a DataFrame with 4464 rows (initially) and 15 columns.
I am fully aware DataFrame indexes are IMMUTABLE, so it is technically impossible to add a new row to a DataFrame without creating a complete copy of the dataframe and it is what slows the code down, the overhead is unavoidable, but, can it be reduced?
Is there any conceivable way to bring the execution time down to, say 1 millisecond?
Please don’t suggest using PyPy, true it is fast, but it isn’t exactly as extendable as CPython, because installing any C based library on PyPy takes dozens of minutes and sometimes hours, and I have installed hundreds of libs…
Advertisement
Answer
First we establish the time needed to create a dataframe:
%%timeit songs = pd.DataFrame(index=np.arange(4464 ), columns=np.arange(15)) 100 loops, best of 5: 5.21 ms per loop
It takes around 5.2 ms to create this dataframe and so we can use it as a reference for the next cases (to prevent caching for e.g).
Case with append:
%%timeit songs = pd.DataFrame(index=np.arange(4464 ), columns=np.arange(15)) s = pd.Series([1] * 15, index = songs.columns) songs.append(s, ignore_index=True) 100 loops, best of 5: 10.1 ms per loop
10 ms meaning it takes around 5ms which is similar to your answer.
With loc:
%%timeit songs = pd.DataFrame(index=np.arange(4464 ), columns=np.arange(15)) songs.loc[4464] = [1]*15 100 loops, best of 5: 10.2 ms per loop
again around 5ms.
Solution: Use iloc
%%timeit songs = pd.DataFrame(index=np.arange(4464 ), columns=np.arange(15)) songs.iloc[-1] = [1]*15 100 loops, best of 5: 5.25 ms per loop
This solution gets the answer to around ~70µs.
Tested alone, (by creating the dataframe at first):
%%timeit songs.iloc[-1] = [1]*15 10000 loops, best of 5: 67.1 µs per loop