I am working on a dataset which looks something like this:
ID Amount Type 1 50 A 1 1000 A 1 500 B 1 200 B 2 1000 A 2 500 B
I am trying to do 2 things: Find length of longest sequence of each type and find ratio of A/B and B/A for those sequences for each ID.
Ratio attribute explanation: Calculate the total amount in the longest sequence for each ID(say length n). If the sequence is that of Type A, then get ratio with immediate amount of type B(n+1 amount). If sequence is of type B, then get ratio with the immediate predecessor of type A(n-1 amount).
So in the case I have mentioned the final result looks like this:
ID Longest_Sequence_A Longest_Sequence_B Ratio_A_B Ratio B_A 1 2 2 0.47 0.7 2 1 1 0.5 0.5
Explanation of Row 1: Longest Sequences are simply the length of the longest sequence of type A and type B.
For Ratio A_B: This is case 1 as defined earlier, so sum of all type A’s in the longest sequence is calculated(1050) , then a ratio is taken with the immediate successive B type amount(500), ratio 500/1050=0.47
For Ratio B_A: This is case 2 as defined earlier, so sum of all type B’s in the longest sequence is calculated(700), then a ratio is taken with the immediate predecessor of type A amount(1000),
ratio 700/1000=0.7
This is a fairly complex issue I am unable to solve. Would appreciate it a lot if someone helps with this.
Advertisement
Answer
Here’s an attempt:
def ratios(df): df = df.reset_index(drop=True) groups = (df.Type != df.Type.shift(1)).cumsum() result = {} for t in ('A', 'B'): if t in df.Type.values: max_num = groups[df.Type.eq(t)].mode().iat[-1] max_group = df[groups.eq(max_num)] result[f'Longest_Sequence_{t}'] = len(max_group) amounts = max_group.Amount.sum() idx = max_group.index ratio = None if t == 'A': if idx[-1] != df.index[-1] and amounts != 0: ratio = df.Amount.at[idx[-1] + 1] / amounts elif t == 'B': if idx[0] != df.index[0]: denom = df.Amount.at[idx[0] - 1] if denom != 0: ratio = amounts / denom result[f'Ratio_{t}'] = ratio else: result[f'Longest_Sequence_{t}'] = 0 result[f'Ratio_{t}'] = None return pd.DataFrame([result]) df = df.groupby('ID').apply(ratios).reset_index(level=1, drop=True)
The result for dataframe
ID Amount Type 0 1 50 A 1 1 1000 A 2 1 500 B 3 1 200 B 4 2 1000 A 5 2 500 B
is
Longest_Sequence_A Ratio_A Longest_Sequence_B Ratio_B ID 1 2 0.47619 2 0.7 2 1 0.50000 1 0.5
The naming and ordering of the columns is a bit different, but this shouldn’t matter.
Some explanations (I’m using the whole dataframe as sample):
This groups = (df.Type != df.Type.shift(1)).cumsum()
identifies the sequences:
0 1 1 1 2 2 3 2 4 3 5 4 Name: Type, dtype: int64
For groups[df.Type.eq('A')]
0 1 1 1 4 3 Name: Type, dtype: int64
.mode()
identifies the 'A'
-sequence number for the sequence of maximal length (in case there are maximal sequences of equal length the .iat[-1]
selects the last one):
0 1 dtype: int64
Now here with max_num == 1
this max_group = df[groups.eq(max_num)]
selects the respective group with the index from df
(the last point is important for the rest):
ID Amount Type 0 1 50 A 1 1 1000 A
The rest is trying to follow your calculation instructions and thereby taking care of the edge cases. The use of an index idx
relative to df
allows to step back and forth in df
to select the other values needed for the ratios. (At the beginnig of the function the index is transformed into the standard index, just to make sure, because I want to be able to use +/-
on it.)