Skip to content
Advertisement

How to assign a value to a column for a subset of dataframe based on a condition in Pandas?

I have a data frame:

df = pd.DataFrame([[0,4,0,0],
[1,5,1,0],
[2,6,0,0],
[3,7,1,0]], columns=['index', 'A', 'class', 'label'])

df:

index A class label
0 4 0 0
1 5 1 0
2 6 0 0
3 7 1 0

I want to change the label to 1, if the mean of A column of rows with class 0 is bigger than the mean of all data in column A?

How to do this in a few line of code?

I tried this but didn’t work:

if df[df['class'] == 0]['A'].mean() > df['A'].mean():
   df[df['class']]['lable'] = 1

Advertisement

Answer

Use the following, pandas.DataFrame.groupby 'class', get groupby.mean of each group of 'A', check whether greater than df['A'].mean(), and pandas.Series.map that boolean series astype(int) to df['class'] and assign to df['label']:

>>> df['label'] = df['class'].map(
        df.groupby('class')['A'].mean() > df['A'].mean()
    ).astype(int)

>>> df

   index  A  class  label
0      0  4      0      0
1      1  5      1      1
2      2  6      0      0
3      3  7      1      1

Since you are checking only for class == 0, you need to add another boolean mask on df['class']:

>>> df['label'] = (df['class'].map(
        df.groupby('class')['A'].mean() > df['A'].mean()
        ) & (~df['class'].astype(bool))
    ).astype(int)
   index  A  class  label
0      0  4      0      0
1      1  5      1      0   # because (5+7)/2 < (4+5+6+7)/4
2      2  6      0      0
3      3  7      1      0   # because (5+7)/2 < (4+5+6+7)/4

So even if your code has worked, you will not know it, because the conditions do not get fulfilled.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement