Skip to content
Advertisement

GroupBy Pandas with ratio

I am working on a dataset which looks something like this:

ID  Amount Type
1   50    A
1   1000  A
1   500   B
1   200   B
2   1000  A
2   500   B

I am trying to do 2 things: Find length of longest sequence of each type and find ratio of A/B and B/A for those sequences for each ID.

Ratio attribute explanation: Calculate the total amount in the longest sequence for each ID(say length n). If the sequence is that of Type A, then get ratio with immediate amount of type B(n+1 amount). If sequence is of type B, then get ratio with the immediate predecessor of type A(n-1 amount).

So in the case I have mentioned the final result looks like this:

ID Longest_Sequence_A Longest_Sequence_B Ratio_A_B  Ratio B_A
1  2                  2                  0.47        0.7
2  1                  1                  0.5         0.5

Explanation of Row 1: Longest Sequences are simply the length of the longest sequence of type A and type B.
For Ratio A_B: This is case 1 as defined earlier, so sum of all type A’s in the longest sequence is calculated(1050) , then a ratio is taken with the immediate successive B type amount(500), ratio 500/1050=0.47
For Ratio B_A: This is case 2 as defined earlier, so sum of all type B’s in the longest sequence is calculated(700), then a ratio is taken with the immediate predecessor of type A amount(1000), ratio 700/1000=0.7

This is a fairly complex issue I am unable to solve. Would appreciate it a lot if someone helps with this.

Advertisement

Answer

Here’s an attempt:

def ratios(df):
    df = df.reset_index(drop=True)
    groups = (df.Type != df.Type.shift(1)).cumsum()
    result = {}
    for t in ('A', 'B'):
        if t in df.Type.values:
            max_num = groups[df.Type.eq(t)].mode().iat[-1]
            max_group = df[groups.eq(max_num)]
            result[f'Longest_Sequence_{t}'] = len(max_group)
            amounts = max_group.Amount.sum()
            idx = max_group.index
            ratio = None
            if t == 'A':
                if idx[-1] != df.index[-1] and amounts != 0:
                    ratio = df.Amount.at[idx[-1] + 1] / amounts
            elif t == 'B':
                if idx[0] != df.index[0]:
                    denom = df.Amount.at[idx[0] - 1]
                    if denom != 0:
                        ratio = amounts / denom
            result[f'Ratio_{t}'] = ratio
        else:
            result[f'Longest_Sequence_{t}'] = 0
            result[f'Ratio_{t}'] = None
    return pd.DataFrame([result])

df = df.groupby('ID').apply(ratios).reset_index(level=1, drop=True)

The result for dataframe

   ID  Amount Type
0   1      50    A
1   1    1000    A
2   1     500    B
3   1     200    B
4   2    1000    A
5   2     500    B

is

    Longest_Sequence_A  Ratio_A  Longest_Sequence_B  Ratio_B
ID                                                          
1                    2  0.47619                   2      0.7
2                    1  0.50000                   1      0.5

The naming and ordering of the columns is a bit different, but this shouldn’t matter.


Some explanations (I’m using the whole dataframe as sample):

This groups = (df.Type != df.Type.shift(1)).cumsum() identifies the sequences:

0    1
1    1
2    2
3    2
4    3
5    4
Name: Type, dtype: int64

For groups[df.Type.eq('A')]

0    1
1    1
4    3
Name: Type, dtype: int64

.mode() identifies the 'A'-sequence number for the sequence of maximal length (in case there are maximal sequences of equal length the .iat[-1] selects the last one):

0    1
dtype: int64

Now here with max_num == 1 this max_group = df[groups.eq(max_num)] selects the respective group with the index from df (the last point is important for the rest):

   ID  Amount Type
0   1      50    A
1   1    1000    A

The rest is trying to follow your calculation instructions and thereby taking care of the edge cases. The use of an index idx relative to df allows to step back and forth in df to select the other values needed for the ratios. (At the beginnig of the function the index is transformed into the standard index, just to make sure, because I want to be able to use +/- on it.)

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement