Create a dataframe based on 3 linked dataframes using a constraint on cumsum

Question

I do have three dataframes like this: that looks as follows and I would like to create another dataframe using these 3 dataframes that looks as follows: Here is the logic for C1: First, one checks the first value in column C1 in df3 which is an a. Second, one checks in df2 where one first finds the letter det…

Accepted Answer

The answer by @mitoRibo got me on the right track; pd.melt is indeed key to solve it, it seems. Here is my solution with a few comments:import pandas as pdimport numpy as npdef assign_group_memberships(aniterable, max_sum):    label = 0    total_sum = 0    for val in aniterable:        total_sum += val        if total_sum > max_sum:            total_sum = val            label += 1        yield label# copy df1, df2 and df3 from the questiondesired = pd.DataFrame(    {        'position1': [11, 11, 13, 13, 14, 15, 12, 16, 16, 16, 12, 12],        'position2': list(range(1, 7)) + list(range(1, 7)),        'value': [2, 7, 3, 6, 5, 3, 0, 8, 0, 1, 0, 0]    })threshold = 10# Convert df1 and df3 to long formdf1_long = df1.melt(    var_name='column')df3_long = df3.melt(    id_vars='position2',    var_name='column',    value_name='mapper',)df3_long['value'] = df1_long['value'].copy()Now we can assign groups to the individual rows based on threshold: whenever threshold is exceeded, a new label is created for each column, mapper group.df3_long['group'] = (    df3_long.groupby(['column', 'mapper'])['value'].transform(        lambda x: assign_group_memberships(x, threshold)    ))    position2 column mapper  value  group0           1     C1      a      2      01           2     C1      a      7      02           3     C1      b      3      03           4     C1      b      6      04           5     C1      a      5      15           6     C1      b      3      16           1     C2      a      0      07           2     C2      b      8      08           3     C2      b      0      09           4     C2      b      1      010          5     C2      a      0      011          6     C2      a      0      0Now we can also determine the respective group labels in df2df2['group'] = df2.groupby(['column', 'mapper']).cumcount()   position1 column mapper  group0         11     C1      a      01         12     C2      a      02         13     C1      b      03         14     C1      a      14         15     C1      b      15         16     C2      b      0and the only thing left to do is to merge df2 and df3_longresult = df3_long.merge(df2, on=['column', 'mapper', 'group'])    position2 column mapper  value  group  position10           1     C1      a      2      0         111           2     C1      a      7      0         112           3     C1      b      3      0         133           4     C1      b      6      0         134           5     C1      a      5      1         145           6     C1      b      3      1         156           1     C2      a      0      0         127           5     C2      a      0      0         128           6     C2      a      0      0         129           2     C2      b      8      0         1610          3     C2      b      0      0         1611          4     C2      b      1      0         16Now we can check whether result is equal to desiredresult = (    result[        ['position1', 'position2', 'value']    ].sort_values(['position1', 'position2']).reset_index(drop=True))desired = (    desired.sort_values(        ['position1', 'position2']    ).reset_index(drop=True))print(result.equals(desired))which is indeed the case.Might be better options, so, please post them! And thanks again to mitoRibo for the inspiration!

Advertisement

Answer