Skip to content
Advertisement

Row-level cumulative sum with condition

I have a table that looks like this.

m1 m2 m3 m4 m5 m6 m7 m8 s
0 1 0 0 5 0 4 10 4
4 1 8 0 15 0 4 10 10

I need to know at which position or column the row-level cumulative sum for the first six columns (m1 to m6) either equals or exceeds the value at s.

For this table, at row 1, it is 0 + 1 + 0 + 0 + 5 = 6, which is greater than the value at s (4); this occurs at m5, or the fifth column from the left, so the expected output is 5.

Similarly, for the second row, it is 4 + 1 + 8 = 13, which is greater than the value at s (10); this occurs at m3, or the third column from the left, so the expected output is 3.

The output also needs to be added as a column to the table, so that the final version looks like this:

m1 m2 m3 m4 m5 m6 m7 m8 s output
0 1 0 0 5 0 4 10 4 5
4 1 8 0 15 0 4 10 10 3

The table has around 7000 rows. I couldn’t figure out a way to use cumsum() conditionally, either with pandas or numpy.

Is there a way to accomplish without a loop? (Though loops are ok too)

Advertisement

Answer

you can use cumsum on axis=1 with get_indexer on the df.columns:

df['output'] = df.columns.get_indexer(df.drop("s",1).cumsum(axis=1)
                       .ge(df['s'],axis=0).idxmax(axis=1))+1

print(df)

   m1  m2  m3  m4  m5  m6  m7  m8   s  output
0   0   1   0   0   5   0   4  10   4       5
1   4   1   8   0  15   0   4  10  10       3

EDIT:

There can be situations where none of the column in a row satisfies this condition , in that case, you may use a condition to check (expect a -1 where the condition doesnot match for any column in a row):

c = df.drop("s",1).cumsum(axis=1).ge(df['s'],axis=0)
df['output'] = df.columns.get_indexer(c.idxmax(1).where(c.any(1)))+1
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement