I do have three dataframes like this:
import pandas as pd
df1 = pd.DataFrame(
{
'C1': [2, 7, 3, 6, 5, 3],
'C2': [0, 8, 0, 1, 0, 0]
}
)
df2 = pd.DataFrame(
{
'position1': range(11, 17),
'column': ['C1', 'C2', 'C1', 'C1', 'C1', 'C2'],
'mapper': list('aababb')
}
)
df3 = pd.DataFrame(
{
'position2': range(1, 7),
'C1': list('aabbab'),
'C2': list('abbbaa')
}
)
that looks as follows
C1 C2 0 2 0 1 7 8 2 3 0 3 6 1 4 5 0 5 3 0 position1 column mapper 0 11 C1 a 1 12 C2 a 2 13 C1 b 3 14 C1 a 4 15 C1 b 5 16 C2 b position2 C1 C2 0 1 a a 1 2 a b 2 3 b b 3 4 b b 4 5 a a 5 6 b a
and I would like to create another dataframe using these 3 dataframes that looks as follows:
position1 position2 value 0 11 1 2 1 11 2 7 2 13 3 3 3 13 4 6 4 14 5 5 5 15 6 3 6 12 1 0 7 16 2 8 8 16 3 0 9 16 4 1 10 12 5 0 11 12 6 0
Here is the logic for C1:
- First, one checks the first value in column
C1indf3which is ana. - Second, one checks in
df2where one first finds the letter determined in 1) – in our case anafor the respective column (here:C1) and notes down the value ofposition1(here:11). - Now one goes to
df1and notes down the respective value forC1(here:2) - That gives us the first row of the desired outcome:
position2 = 1,position1 = 11and thevalue = 2.
So far, so good. The issue comes in due to a constraint:
In df2 each position1 can only be used as long as the sum of all corresponding values from df1 do not exceed 10; if that happens the next valid position in df2 should be found.
So, for the example above:
In df3 if I go to the next row in C1 I again find an a, therefore I again check df2 and end up again with position1 = 11. If I check in df1 I find a value of 7, the cumulative sum would be 9 which is below 10, so all good and I have the next row of my desired dataframe:
position2 = 2, position1 = 11 and the value = 7.
Now I go to the next row in df3 in column C1 and find a b, checking df2 gives me position 13 and the value from df is 3, so I get the row:
position2 = 3, position1 = 13 and the value = 3.
Doing it once more gives
position2 = 4, position1 = 13 and the value = 6.
Doing it again, gives me now letter a again which would point to position1 = 11 in df2. The value from df1 is 5; as the cumulative sum is already 9, I cannot use this position but have to find the next one in df2 which is position2 = 14. Therefore I can add the row:
position2 = 5, position1 = 14 and the value = 5.
And so on…
I am struggling with incorporating the check for the cumsum. Does anyone see an elegant solution to create the desired dataframe from the 3 inputs? Only solutions I have contain several loops and the code is not very readable.
The example might be tricky to follow but I could not design an easier one.
Advertisement
Answer
The answer by @mitoRibo got me on the right track; pd.melt is indeed key to solve it, it seems. Here is my solution with a few comments:
import pandas as pd
import numpy as np
def assign_group_memberships(aniterable, max_sum):
label = 0
total_sum = 0
for val in aniterable:
total_sum += val
if total_sum > max_sum:
total_sum = val
label += 1
yield label
# copy df1, df2 and df3 from the question
desired = pd.DataFrame(
{
'position1': [11, 11, 13, 13, 14, 15, 12, 16, 16, 16, 12, 12],
'position2': list(range(1, 7)) + list(range(1, 7)),
'value': [2, 7, 3, 6, 5, 3, 0, 8, 0, 1, 0, 0]
}
)
threshold = 10
# Convert df1 and df3 to long form
df1_long = df1.melt(
var_name='column'
)
df3_long = df3.melt(
id_vars='position2',
var_name='column',
value_name='mapper',
)
df3_long['value'] = df1_long['value'].copy()
Now we can assign groups to the individual rows based on threshold: whenever threshold is exceeded, a new label is created for each column, mapper group.
df3_long['group'] = (
df3_long.groupby(['column', 'mapper'])['value'].transform(
lambda x: assign_group_memberships(x, threshold)
)
)
position2 column mapper value group
0 1 C1 a 2 0
1 2 C1 a 7 0
2 3 C1 b 3 0
3 4 C1 b 6 0
4 5 C1 a 5 1
5 6 C1 b 3 1
6 1 C2 a 0 0
7 2 C2 b 8 0
8 3 C2 b 0 0
9 4 C2 b 1 0
10 5 C2 a 0 0
11 6 C2 a 0 0
Now we can also determine the respective group labels in df2
df2['group'] = df2.groupby(['column', 'mapper']).cumcount() position1 column mapper group 0 11 C1 a 0 1 12 C2 a 0 2 13 C1 b 0 3 14 C1 a 1 4 15 C1 b 1 5 16 C2 b 0
and the only thing left to do is to merge df2 and df3_long
result = df3_long.merge(df2, on=['column', 'mapper', 'group'])
position2 column mapper value group position1
0 1 C1 a 2 0 11
1 2 C1 a 7 0 11
2 3 C1 b 3 0 13
3 4 C1 b 6 0 13
4 5 C1 a 5 1 14
5 6 C1 b 3 1 15
6 1 C2 a 0 0 12
7 5 C2 a 0 0 12
8 6 C2 a 0 0 12
9 2 C2 b 8 0 16
10 3 C2 b 0 0 16
11 4 C2 b 1 0 16
Now we can check whether result is equal to desired
result = (
result[
['position1', 'position2', 'value']
].sort_values(['position1', 'position2']).reset_index(drop=True)
)
desired = (
desired.sort_values(
['position1', 'position2']
).reset_index(drop=True)
)
print(result.equals(desired))
which is indeed the case.
Might be better options, so, please post them! And thanks again to mitoRibo for the inspiration!