Skip to content
Advertisement

Pandas: efficiently inserting a large number of rows

I have a large dataframe in this format, call this df:

index val1 val2
0 0.2 0.1
1 0.5 0.7
2 0.3 0.4

I have a row I will be inserting, call this myrow:

index val1 val2
-1 0.9 0.9

I wish to insert this row 3 times after every row in the original dataframe, i.e.:

index val1 val2
0 0.2 0.1
-1 0.9 0.9
-1 0.9 0.9
-1 0.9 0.9
1 0.5 0.7
-1 0.9 0.9
-1 0.9 0.9
-1 0.9 0.9
2 0.3 0.4
-1 0.9 0.9
-1 0.9 0.9
-1 0.9 0.9

This is straightforward with a bit of looping. TLDR: how do I do this more efficiently?

Let’s make a repeat rows function, and create our set of 3 repeats:

import pandas as pd
import numpy as np

def repeat_rows(df, n):
    newdf = pd.DataFrame(np.repeat(df.values, n, axis=0))
    newdf.columns = df.columns
    return newdf

repeats = repeat_rows(myrow.to_frame().T, 3)

Now we have our 3 repeats:

index val1 val2
-1 0.9 0.9
-1 0.9 0.9
-1 0.9 0.9

Finally, we can loop over the original df‘s rows, and concat repeats to the row, and concat the result of all of those together:

blocks = []
for _, row in df.iterrows():
    blocks.append(pd.concat([row.to_frame().T, repeats]))
result = pd.concat(blocks)

We now have the desired result!

The problem is, this is very slow, and I’m looking for a faster solution.

I’m guessing a better solution would follow this pattern:

result = repeat_rows(df, 4)
result.loc[LAST_3_ROWS_IN_EACH_BLOCK_OF_4] = myrow

However, I’m not sure how to do such a loc assignment. How can I make my solution more efficient?

Advertisement

Answer

reset_index so that df has a simple RangeIndex. Then we can do math with tiling and repeats to create an Index that when sorted will place 3 of the myrow rows between each row of your DataFrame. Finally remove this Index and get back to a normal RangeIndex.

Sample Data

import pandas as pd
import numpy as np

myrow = pd.DataFrame({'index': [-1], 'val1': [0.9], 'val2': [0.9]})
df = pd.DataFrame({'index': [0,1,2],
                   'val1': [0.2, 0.5, 0.3],
                   'val2': [0.1, 0.7, 0.4]})

Code

# Ensure starting from a RangeIndex
df = df.reset_index(drop=True)

NR = 3  # Number of repeats

mr = pd.concat([myrow]*len(df)*NR, ignore_index=True)
mr.index = df.index.repeat(NR) + np.tile(np.arange(0, 1, 1/NR), len(df))

# `mr` second in the `concat` so rows go below
df = pd.concat([df, mr]).sort_index().reset_index(drop=True)

    index  val1  val2
0       0   0.2   0.1
1      -1   0.9   0.9
2      -1   0.9   0.9
3      -1   0.9   0.9
4       1   0.5   0.7
5      -1   0.9   0.9
6      -1   0.9   0.9
7      -1   0.9   0.9
8       2   0.3   0.4
9      -1   0.9   0.9
10     -1   0.9   0.9
11     -1   0.9   0.9
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement