I have a dataframe which looks something like this:
date A_1 A_2 A_3 B_1 B_2 B_3 C_1 C_2 C_3 D_1 D_2 D_3 xxx 4 5 6 2 3 1 5 7 2 4 3 1 xxx 3 3 2 4 5 2 6 2 3 2 4 2 xxx 5 7 5 1 3 3 4 5 4 8 2 2 xxx 6 1 8 6 1 4 1 2 7 4 3 5
I am trying to compute a value based on a condition for every row which will apply across the column groupings of A, B, C, D, etc. and count how many of those groups passed the condition, for example, some pseudo-code:
count = 0 for each (A, B, C, D) as col: if col_1 > 3 and (col_2 > col_3): count++ dataframe['count'] = count
Expected output:
date A_1 A_2 A_3 B_1 B_2 B_3 C_1 C_2 C_3 D_1 D_2 D_3 count xxx 4 5 6 2 3 1 5 7 2 4 3 1 2 xxx 3 3 2 4 5 2 6 2 3 2 4 2 1 xxx 5 7 5 1 3 3 4 5 4 8 2 2 2 xxx 6 1 8 6 1 4 1 2 7 4 3 5 0
This would mean the example dataframe would end up with a new column with a value of 2 because in this example column C and D pass the set of conditions.
Later I plan to do something for each row with a count value above some number.
The best way I was able to come up with looked something like this:
for col in cols: conditions.append( ( (dataframe[f'{col}_1'] > min_corr_coef) & (dataframe[f'{col}_2'] < (dataframe[f'{col_3}]) ) | (dataframe[f'{col}_1'] <= min_corr_coef) ) conditions.append( (dataframe[dataframe.loc[:,cols] > min_corr_coef].count(axis=1) >= min_corr_pair) ) if conditions: dataframe.loc[ reduce(lambda x, y: x & y, conditions), 'pass'] = 1
Not only is this incorrect, but it is also extremely slow. What I like about it, is that it is readable and adding additional conditions if there are more columns/different data is relatively straightforward.
I am not sure if this type of operation is too complex to perform on a dataframe the way I am trying to and perhaps I need to change my data structure or not, but I figured I’d see if there is some way of going about this before I start to re-write things.
Ideally the logic being applied to the column groups (A, B, C, D in this case) can be modular and whatever method I use to do this should be easy to add/remove conditions because I want to be able to add/remove data columns and logic conditions over time.
What I am trying to do is test an arbitrary set of conditions when one of the columns for a group (which is a correlation coefficient) is above a certain value. If enough of the column groups pass the conditions, do something.
Thanks in advance, I am a python and pandas novice and this has been giving me a headache for days.
Advertisement
Answer
Let’s ignore date
and handle the other columns first.
Remove date, create a MultiIndex using str.split
, and stack
to long form:
new_df = df.drop(columns='date') new_df.columns = new_df.columns.str.split('_', expand=True) new_df = new_df.stack(level=0)
1 2 3 0 A 4 5 6 B 2 3 1 C 5 7 2 D 4 3 1 1 A 3 3 2 B 4 5 2 C 6 2 3 D 2 4 2 2 A 5 7 5 B 1 3 3 C 4 5 4 D 8 2 2 3 A 6 1 8 B 6 1 4 C 1 2 7 D 4 3 5
Apply the condition row-wise:
new_df['condition'] = new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])
Then sum
level 0 and assign back to the original df
:
df['count'] = new_df['condition'].sum(level=0)
(Alternatively sum the conditions directly rather than assigning to both new_df
and df
)
df['count'] = (new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])).sum(level=0)
df
:
date A_1 A_2 A_3 B_1 B_2 B_3 C_1 C_2 C_3 D_1 D_2 D_3 count 0 xxx 4 5 6 2 3 1 5 7 2 4 3 1 2 1 xxx 3 3 2 4 5 2 6 2 3 2 4 2 1 2 xxx 5 7 5 1 3 3 4 5 4 8 2 2 2 3 xxx 6 1 8 6 1 4 1 2 7 4 3 5 0
Complete Working Example:
import pandas as pd df = pd.DataFrame({ 'date': ['xxx', 'xxx', 'xxx', 'xxx'], 'A_1': [4, 3, 5, 6], 'A_2': [5, 3, 7, 1], 'A_3': [6, 2, 5, 8], 'B_1': [2, 4, 1, 6], 'B_2': [3, 5, 3, 1], 'B_3': [1, 2, 3, 4], 'C_1': [5, 6, 4, 1], 'C_2': [7, 2, 5, 2], 'C_3': [2, 3, 4, 7], 'D_1': [4, 2, 8, 4], 'D_2': [3, 4, 2, 3], 'D_3': [1, 2, 2, 5] }) new_df = df.drop(columns='date') new_df.columns = new_df.columns.str.split('_', expand=True) new_df = new_df.stack(level=0) new_df['condition'] = new_df['1'].gt(3) & new_df['2'].gt(new_df['3']) df['count'] = new_df['condition'].sum(level=0) print(df)