I have a large dataframe in this format, call this df
:
index | val1 | val2 |
---|---|---|
0 | 0.2 | 0.1 |
1 | 0.5 | 0.7 |
2 | 0.3 | 0.4 |
I have a row I will be inserting, call this myrow
:
index | val1 | val2 |
---|---|---|
-1 | 0.9 | 0.9 |
I wish to insert this row 3 times after every row in the original dataframe, i.e.:
index | val1 | val2 |
---|---|---|
0 | 0.2 | 0.1 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
1 | 0.5 | 0.7 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
2 | 0.3 | 0.4 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
This is straightforward with a bit of looping. TLDR: how do I do this more efficiently?
Let’s make a repeat rows function, and create our set of 3 repeats:
import pandas as pd
import numpy as np
def repeat_rows(df, n):
newdf = pd.DataFrame(np.repeat(df.values, n, axis=0))
newdf.columns = df.columns
return newdf
repeats = repeat_rows(myrow.to_frame().T, 3)
Now we have our 3 repeats:
index | val1 | val2 |
---|---|---|
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
Finally, we can loop over the original df
‘s rows, and concat repeats
to the row, and concat the result of all of those together:
blocks = []
for _, row in df.iterrows():
blocks.append(pd.concat([row.to_frame().T, repeats]))
result = pd.concat(blocks)
We now have the desired result!
The problem is, this is very slow, and I’m looking for a faster solution.
I’m guessing a better solution would follow this pattern:
result = repeat_rows(df, 4)
result.loc[LAST_3_ROWS_IN_EACH_BLOCK_OF_4] = myrow
However, I’m not sure how to do such a loc assignment. How can I make my solution more efficient?
Advertisement
Answer
reset_index
so that df
has a simple RangeIndex. Then we can do math with tiling and repeats to create an Index that when sorted will place 3 of the myrow
rows between each row of your DataFrame. Finally remove this Index
and get back to a normal RangeIndex
.
Sample Data
import pandas as pd
import numpy as np
myrow = pd.DataFrame({'index': [-1], 'val1': [0.9], 'val2': [0.9]})
df = pd.DataFrame({'index': [0,1,2],
'val1': [0.2, 0.5, 0.3],
'val2': [0.1, 0.7, 0.4]})
Code
# Ensure starting from a RangeIndex
df = df.reset_index(drop=True)
NR = 3 # Number of repeats
mr = pd.concat([myrow]*len(df)*NR, ignore_index=True)
mr.index = df.index.repeat(NR) + np.tile(np.arange(0, 1, 1/NR), len(df))
# `mr` second in the `concat` so rows go below
df = pd.concat([df, mr]).sort_index().reset_index(drop=True)
index val1 val2
0 0 0.2 0.1
1 -1 0.9 0.9
2 -1 0.9 0.9
3 -1 0.9 0.9
4 1 0.5 0.7
5 -1 0.9 0.9
6 -1 0.9 0.9
7 -1 0.9 0.9
8 2 0.3 0.4
9 -1 0.9 0.9
10 -1 0.9 0.9
11 -1 0.9 0.9