Skip to content
Advertisement

Create a dataframe based on 3 linked dataframes using a constraint on cumsum

I do have three dataframes like this:

import pandas as pd

df1 = pd.DataFrame(
    {
        'C1': [2, 7, 3, 6, 5, 3],
        'C2': [0, 8, 0, 1, 0, 0]
    }
)

df2 = pd.DataFrame(
    {
        'position1': range(11, 17),
        'column': ['C1', 'C2', 'C1', 'C1', 'C1', 'C2'],
        'mapper': list('aababb')
    }
)

df3 = pd.DataFrame(
    {
        'position2': range(1, 7),
        'C1': list('aabbab'),
        'C2': list('abbbaa')
    }
)

that looks as follows

   C1  C2
0   2   0
1   7   8
2   3   0
3   6   1
4   5   0
5   3   0

  position1 column mapper
0         11     C1      a
1         12     C2      a
2         13     C1      b
3         14     C1      a
4         15     C1      b
5         16     C2      b

   position2 C1 C2
0          1  a  a
1          2  a  b
2          3  b  b
3          4  b  b
4          5  a  a
5          6  b  a

and I would like to create another dataframe using these 3 dataframes that looks as follows:

   position1  position2  value
0          11          1      2
1          11          2      7
2          13          3      3
3          13          4      6
4          14          5      5
5          15          6      3
6          12          1      0
7          16          2      8
8          16          3      0
9          16          4      1
10         12          5      0
11         12          6      0

Here is the logic for C1:

  1. First, one checks the first value in column C1 in df3 which is an a.
  2. Second, one checks in df2 where one first finds the letter determined in 1) – in our case an a for the respective column (here: C1) and notes down the value of position1 (here: 11).
  3. Now one goes to df1 and notes down the respective value for C1 (here: 2)
  4. That gives us the first row of the desired outcome: position2 = 1, position1 = 11 and the value = 2.

So far, so good. The issue comes in due to a constraint:

In df2 each position1 can only be used as long as the sum of all corresponding values from df1 do not exceed 10; if that happens the next valid position in df2 should be found.

So, for the example above: In df3 if I go to the next row in C1 I again find an a, therefore I again check df2 and end up again with position1 = 11. If I check in df1 I find a value of 7, the cumulative sum would be 9 which is below 10, so all good and I have the next row of my desired dataframe: position2 = 2, position1 = 11 and the value = 7.

Now I go to the next row in df3 in column C1 and find a b, checking df2 gives me position 13 and the value from df is 3, so I get the row: position2 = 3, position1 = 13 and the value = 3.

Doing it once more gives position2 = 4, position1 = 13 and the value = 6.

Doing it again, gives me now letter a again which would point to position1 = 11 in df2. The value from df1 is 5; as the cumulative sum is already 9, I cannot use this position but have to find the next one in df2 which is position2 = 14. Therefore I can add the row: position2 = 5, position1 = 14 and the value = 5.

And so on…

I am struggling with incorporating the check for the cumsum. Does anyone see an elegant solution to create the desired dataframe from the 3 inputs? Only solutions I have contain several loops and the code is not very readable.

The example might be tricky to follow but I could not design an easier one.

Advertisement

Answer

The answer by @mitoRibo got me on the right track; pd.melt is indeed key to solve it, it seems. Here is my solution with a few comments:

import pandas as pd
import numpy as np


def assign_group_memberships(aniterable, max_sum):
    label = 0
    total_sum = 0
    for val in aniterable:
        total_sum += val
        if total_sum > max_sum:
            total_sum = val
            label += 1
        yield label

# copy df1, df2 and df3 from the question
desired = pd.DataFrame(
    {
        'position1': [11, 11, 13, 13, 14, 15, 12, 16, 16, 16, 12, 12],
        'position2': list(range(1, 7)) + list(range(1, 7)),
        'value': [2, 7, 3, 6, 5, 3, 0, 8, 0, 1, 0, 0]
    }
)

threshold = 10

# Convert df1 and df3 to long form
df1_long = df1.melt(
    var_name='column'
)

df3_long = df3.melt(
    id_vars='position2',
    var_name='column',
    value_name='mapper',
)

df3_long['value'] = df1_long['value'].copy()

Now we can assign groups to the individual rows based on threshold: whenever threshold is exceeded, a new label is created for each column, mapper group.

df3_long['group'] = (
    df3_long.groupby(['column', 'mapper'])['value'].transform(
        lambda x: assign_group_memberships(x, threshold)
    )
)

    position2 column mapper  value  group
0           1     C1      a      2      0
1           2     C1      a      7      0
2           3     C1      b      3      0
3           4     C1      b      6      0
4           5     C1      a      5      1
5           6     C1      b      3      1
6           1     C2      a      0      0
7           2     C2      b      8      0
8           3     C2      b      0      0
9           4     C2      b      1      0
10          5     C2      a      0      0
11          6     C2      a      0      0

Now we can also determine the respective group labels in df2

df2['group'] = df2.groupby(['column', 'mapper']).cumcount()

   position1 column mapper  group
0         11     C1      a      0
1         12     C2      a      0
2         13     C1      b      0
3         14     C1      a      1
4         15     C1      b      1
5         16     C2      b      0

and the only thing left to do is to merge df2 and df3_long

result = df3_long.merge(df2, on=['column', 'mapper', 'group'])

    position2 column mapper  value  group  position1
0           1     C1      a      2      0         11
1           2     C1      a      7      0         11
2           3     C1      b      3      0         13
3           4     C1      b      6      0         13
4           5     C1      a      5      1         14
5           6     C1      b      3      1         15
6           1     C2      a      0      0         12
7           5     C2      a      0      0         12
8           6     C2      a      0      0         12
9           2     C2      b      8      0         16
10          3     C2      b      0      0         16
11          4     C2      b      1      0         16

Now we can check whether result is equal to desired

result = (
    result[
        ['position1', 'position2', 'value']
    ].sort_values(['position1', 'position2']).reset_index(drop=True)
)

desired = (
    desired.sort_values(
        ['position1', 'position2']
    ).reset_index(drop=True)
)

print(result.equals(desired))

which is indeed the case.

Might be better options, so, please post them! And thanks again to mitoRibo for the inspiration!

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement