I do have three dataframes like this:
import pandas as pd
df1 = pd.DataFrame(
{
'C1': [2, 7, 3, 6, 5, 3],
'C2': [0, 8, 0, 1, 0, 0]
}
)
df2 = pd.DataFrame(
{
'position1': range(11, 17),
'column': ['C1', 'C2', 'C1', 'C1', 'C1', 'C2'],
'mapper': list('aababb')
}
)
df3 = pd.DataFrame(
{
'position2': range(1, 7),
'C1': list('aabbab'),
'C2': list('abbbaa')
}
)
that looks as follows
C1 C2
0 2 0
1 7 8
2 3 0
3 6 1
4 5 0
5 3 0
position1 column mapper
0 11 C1 a
1 12 C2 a
2 13 C1 b
3 14 C1 a
4 15 C1 b
5 16 C2 b
position2 C1 C2
0 1 a a
1 2 a b
2 3 b b
3 4 b b
4 5 a a
5 6 b a
and I would like to create another dataframe using these 3 dataframes that looks as follows:
position1 position2 value
0 11 1 2
1 11 2 7
2 13 3 3
3 13 4 6
4 14 5 5
5 15 6 3
6 12 1 0
7 16 2 8
8 16 3 0
9 16 4 1
10 12 5 0
11 12 6 0
Here is the logic for C1
:
- First, one checks the first value in column
C1
indf3
which is ana
. - Second, one checks in
df2
where one first finds the letter determined in 1) – in our case ana
for the respective column (here:C1
) and notes down the value ofposition1
(here:11
). - Now one goes to
df1
and notes down the respective value forC1
(here:2
) - That gives us the first row of the desired outcome:
position2 = 1
,position1 = 11
and thevalue = 2
.
So far, so good. The issue comes in due to a constraint:
In df2
each position1
can only be used as long as the sum of all corresponding values from df1
do not exceed 10
; if that happens the next valid position in df2
should be found.
So, for the example above:
In df3
if I go to the next row in C1
I again find an a
, therefore I again check df2
and end up again with position1 = 11
. If I check in df1
I find a value of 7
, the cumulative sum would be 9
which is below 10
, so all good and I have the next row of my desired dataframe:
position2 = 2
, position1 = 11
and the value = 7
.
Now I go to the next row in df3
in column C1
and find a b
, checking df2
gives me position 13
and the value from df
is 3
, so I get the row:
position2 = 3
, position1 = 13
and the value = 3
.
Doing it once more gives
position2 = 4
, position1 = 13
and the value = 6
.
Doing it again, gives me now letter a
again which would point to position1 = 11
in df2
. The value from df1
is 5
; as the cumulative sum is already 9
, I cannot use this position but have to find the next one in df2
which is position2 = 14
. Therefore I can add the row:
position2 = 5
, position1 = 14
and the value = 5
.
And so on…
I am struggling with incorporating the check for the cumsum
. Does anyone see an elegant solution to create the desired dataframe from the 3 inputs? Only solutions I have contain several loops and the code is not very readable.
The example might be tricky to follow but I could not design an easier one.
Advertisement
Answer
The answer by @mitoRibo got me on the right track; pd.melt
is indeed key to solve it, it seems. Here is my solution with a few comments:
import pandas as pd
import numpy as np
def assign_group_memberships(aniterable, max_sum):
label = 0
total_sum = 0
for val in aniterable:
total_sum += val
if total_sum > max_sum:
total_sum = val
label += 1
yield label
# copy df1, df2 and df3 from the question
desired = pd.DataFrame(
{
'position1': [11, 11, 13, 13, 14, 15, 12, 16, 16, 16, 12, 12],
'position2': list(range(1, 7)) + list(range(1, 7)),
'value': [2, 7, 3, 6, 5, 3, 0, 8, 0, 1, 0, 0]
}
)
threshold = 10
# Convert df1 and df3 to long form
df1_long = df1.melt(
var_name='column'
)
df3_long = df3.melt(
id_vars='position2',
var_name='column',
value_name='mapper',
)
df3_long['value'] = df1_long['value'].copy()
Now we can assign groups to the individual rows based on threshold
: whenever threshold
is exceeded, a new label
is created for each column, mapper
group.
df3_long['group'] = (
df3_long.groupby(['column', 'mapper'])['value'].transform(
lambda x: assign_group_memberships(x, threshold)
)
)
position2 column mapper value group
0 1 C1 a 2 0
1 2 C1 a 7 0
2 3 C1 b 3 0
3 4 C1 b 6 0
4 5 C1 a 5 1
5 6 C1 b 3 1
6 1 C2 a 0 0
7 2 C2 b 8 0
8 3 C2 b 0 0
9 4 C2 b 1 0
10 5 C2 a 0 0
11 6 C2 a 0 0
Now we can also determine the respective group labels in df2
df2['group'] = df2.groupby(['column', 'mapper']).cumcount()
position1 column mapper group
0 11 C1 a 0
1 12 C2 a 0
2 13 C1 b 0
3 14 C1 a 1
4 15 C1 b 1
5 16 C2 b 0
and the only thing left to do is to merge df2
and df3_long
result = df3_long.merge(df2, on=['column', 'mapper', 'group'])
position2 column mapper value group position1
0 1 C1 a 2 0 11
1 2 C1 a 7 0 11
2 3 C1 b 3 0 13
3 4 C1 b 6 0 13
4 5 C1 a 5 1 14
5 6 C1 b 3 1 15
6 1 C2 a 0 0 12
7 5 C2 a 0 0 12
8 6 C2 a 0 0 12
9 2 C2 b 8 0 16
10 3 C2 b 0 0 16
11 4 C2 b 1 0 16
Now we can check whether result
is equal to desired
result = (
result[
['position1', 'position2', 'value']
].sort_values(['position1', 'position2']).reset_index(drop=True)
)
desired = (
desired.sort_values(
['position1', 'position2']
).reset_index(drop=True)
)
print(result.equals(desired))
which is indeed the case.
Might be better options, so, please post them! And thanks again to mitoRibo for the inspiration!