I have a large dataframe in this format, call this df
:
index | val1 | val2 |
---|---|---|
0 | 0.2 | 0.1 |
1 | 0.5 | 0.7 |
2 | 0.3 | 0.4 |
I have a row I will be inserting, call this myrow
:
index | val1 | val2 |
---|---|---|
-1 | 0.9 | 0.9 |
I wish to insert this row 3 times after every row in the original dataframe, i.e.:
index | val1 | val2 |
---|---|---|
0 | 0.2 | 0.1 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
1 | 0.5 | 0.7 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
2 | 0.3 | 0.4 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
This is straightforward with a bit of looping. TLDR: how do I do this more efficiently?
Let’s make a repeat rows function, and create our set of 3 repeats:
import pandas as pd import numpy as np def repeat_rows(df, n): newdf = pd.DataFrame(np.repeat(df.values, n, axis=0)) newdf.columns = df.columns return newdf repeats = repeat_rows(myrow.to_frame().T, 3)
Now we have our 3 repeats:
index | val1 | val2 |
---|---|---|
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
-1 | 0.9 | 0.9 |
Finally, we can loop over the original df
‘s rows, and concat repeats
to the row, and concat the result of all of those together:
blocks = [] for _, row in df.iterrows(): blocks.append(pd.concat([row.to_frame().T, repeats])) result = pd.concat(blocks)
We now have the desired result!
The problem is, this is very slow, and I’m looking for a faster solution.
I’m guessing a better solution would follow this pattern:
result = repeat_rows(df, 4) result.loc[LAST_3_ROWS_IN_EACH_BLOCK_OF_4] = myrow
However, I’m not sure how to do such a loc assignment. How can I make my solution more efficient?
Advertisement
Answer
reset_index
so that df
has a simple RangeIndex. Then we can do math with tiling and repeats to create an Index that when sorted will place 3 of the myrow
rows between each row of your DataFrame. Finally remove this Index
and get back to a normal RangeIndex
.
Sample Data
import pandas as pd import numpy as np myrow = pd.DataFrame({'index': [-1], 'val1': [0.9], 'val2': [0.9]}) df = pd.DataFrame({'index': [0,1,2], 'val1': [0.2, 0.5, 0.3], 'val2': [0.1, 0.7, 0.4]})
Code
# Ensure starting from a RangeIndex df = df.reset_index(drop=True) NR = 3 # Number of repeats mr = pd.concat([myrow]*len(df)*NR, ignore_index=True) mr.index = df.index.repeat(NR) + np.tile(np.arange(0, 1, 1/NR), len(df)) # `mr` second in the `concat` so rows go below df = pd.concat([df, mr]).sort_index().reset_index(drop=True)
index val1 val2 0 0 0.2 0.1 1 -1 0.9 0.9 2 -1 0.9 0.9 3 -1 0.9 0.9 4 1 0.5 0.7 5 -1 0.9 0.9 6 -1 0.9 0.9 7 -1 0.9 0.9 8 2 0.3 0.4 9 -1 0.9 0.9 10 -1 0.9 0.9 11 -1 0.9 0.9