I do have three dataframes like this:
import pandas as pd df1 = pd.DataFrame( { 'C1': [2, 7, 3, 6, 5, 3], 'C2': [0, 8, 0, 1, 0, 0] } ) df2 = pd.DataFrame( { 'position1': range(11, 17), 'column': ['C1', 'C2', 'C1', 'C1', 'C1', 'C2'], 'mapper': list('aababb') } ) df3 = pd.DataFrame( { 'position2': range(1, 7), 'C1': list('aabbab'), 'C2': list('abbbaa') } )
that looks as follows
C1 C2 0 2 0 1 7 8 2 3 0 3 6 1 4 5 0 5 3 0 position1 column mapper 0 11 C1 a 1 12 C2 a 2 13 C1 b 3 14 C1 a 4 15 C1 b 5 16 C2 b position2 C1 C2 0 1 a a 1 2 a b 2 3 b b 3 4 b b 4 5 a a 5 6 b a
and I would like to create another dataframe using these 3 dataframes that looks as follows:
position1 position2 value 0 11 1 2 1 11 2 7 2 13 3 3 3 13 4 6 4 14 5 5 5 15 6 3 6 12 1 0 7 16 2 8 8 16 3 0 9 16 4 1 10 12 5 0 11 12 6 0
Here is the logic for C1
:
- First, one checks the first value in column
C1
indf3
which is ana
. - Second, one checks in
df2
where one first finds the letter determined in 1) – in our case ana
for the respective column (here:C1
) and notes down the value ofposition1
(here:11
). - Now one goes to
df1
and notes down the respective value forC1
(here:2
) - That gives us the first row of the desired outcome:
position2 = 1
,position1 = 11
and thevalue = 2
.
So far, so good. The issue comes in due to a constraint:
In df2
each position1
can only be used as long as the sum of all corresponding values from df1
do not exceed 10
; if that happens the next valid position in df2
should be found.
So, for the example above:
In df3
if I go to the next row in C1
I again find an a
, therefore I again check df2
and end up again with position1 = 11
. If I check in df1
I find a value of 7
, the cumulative sum would be 9
which is below 10
, so all good and I have the next row of my desired dataframe:
position2 = 2
, position1 = 11
and the value = 7
.
Now I go to the next row in df3
in column C1
and find a b
, checking df2
gives me position 13
and the value from df
is 3
, so I get the row:
position2 = 3
, position1 = 13
and the value = 3
.
Doing it once more gives
position2 = 4
, position1 = 13
and the value = 6
.
Doing it again, gives me now letter a
again which would point to position1 = 11
in df2
. The value from df1
is 5
; as the cumulative sum is already 9
, I cannot use this position but have to find the next one in df2
which is position2 = 14
. Therefore I can add the row:
position2 = 5
, position1 = 14
and the value = 5
.
And so on…
I am struggling with incorporating the check for the cumsum
. Does anyone see an elegant solution to create the desired dataframe from the 3 inputs? Only solutions I have contain several loops and the code is not very readable.
The example might be tricky to follow but I could not design an easier one.
Advertisement
Answer
The answer by @mitoRibo got me on the right track; pd.melt
is indeed key to solve it, it seems. Here is my solution with a few comments:
import pandas as pd import numpy as np def assign_group_memberships(aniterable, max_sum): label = 0 total_sum = 0 for val in aniterable: total_sum += val if total_sum > max_sum: total_sum = val label += 1 yield label # copy df1, df2 and df3 from the question desired = pd.DataFrame( { 'position1': [11, 11, 13, 13, 14, 15, 12, 16, 16, 16, 12, 12], 'position2': list(range(1, 7)) + list(range(1, 7)), 'value': [2, 7, 3, 6, 5, 3, 0, 8, 0, 1, 0, 0] } ) threshold = 10 # Convert df1 and df3 to long form df1_long = df1.melt( var_name='column' ) df3_long = df3.melt( id_vars='position2', var_name='column', value_name='mapper', ) df3_long['value'] = df1_long['value'].copy()
Now we can assign groups to the individual rows based on threshold
: whenever threshold
is exceeded, a new label
is created for each column, mapper
group.
df3_long['group'] = ( df3_long.groupby(['column', 'mapper'])['value'].transform( lambda x: assign_group_memberships(x, threshold) ) ) position2 column mapper value group 0 1 C1 a 2 0 1 2 C1 a 7 0 2 3 C1 b 3 0 3 4 C1 b 6 0 4 5 C1 a 5 1 5 6 C1 b 3 1 6 1 C2 a 0 0 7 2 C2 b 8 0 8 3 C2 b 0 0 9 4 C2 b 1 0 10 5 C2 a 0 0 11 6 C2 a 0 0
Now we can also determine the respective group labels in df2
df2['group'] = df2.groupby(['column', 'mapper']).cumcount() position1 column mapper group 0 11 C1 a 0 1 12 C2 a 0 2 13 C1 b 0 3 14 C1 a 1 4 15 C1 b 1 5 16 C2 b 0
and the only thing left to do is to merge df2
and df3_long
result = df3_long.merge(df2, on=['column', 'mapper', 'group']) position2 column mapper value group position1 0 1 C1 a 2 0 11 1 2 C1 a 7 0 11 2 3 C1 b 3 0 13 3 4 C1 b 6 0 13 4 5 C1 a 5 1 14 5 6 C1 b 3 1 15 6 1 C2 a 0 0 12 7 5 C2 a 0 0 12 8 6 C2 a 0 0 12 9 2 C2 b 8 0 16 10 3 C2 b 0 0 16 11 4 C2 b 1 0 16
Now we can check whether result
is equal to desired
result = ( result[ ['position1', 'position2', 'value'] ].sort_values(['position1', 'position2']).reset_index(drop=True) ) desired = ( desired.sort_values( ['position1', 'position2'] ).reset_index(drop=True) ) print(result.equals(desired))
which is indeed the case.
Might be better options, so, please post them! And thanks again to mitoRibo for the inspiration!