Pandas: Remove Column Based on Threshold Criteria

I have to solve this problem: Objective: Drops columns most of whose rows missing Inputs: 1. Dataframe df: Pandas dataframe 2. threshold: Determines which columns will be dropped. If threshold is .9, the columns with 90% missing value will be dropped Outputs: 1. Dataframe df with dropped columns (if no columns are dropped, you will return the same dataframe)

Excel Doc Screenshot

I’ve coded this:

class variableTreatment():

    def drop_nan_col(self, df, threshold): 

        self.threshold = threshold
        self.df = df
        for i in df.columns:
            if (float(df[i].isnull().sum())/df[i].shape[0]) > threshold:
                df = df.drop(i)

I have to have “self, dr, and threshold” and cannot add more. The code must pass the test cases below:

import pandas as pd
import numpy as np
df = pd.read_excel('CKD.xlsx')

VT = variableTreatment()

VT

VT.drop_nan_col(df, 0.9).head()

When I run VT.drop_nan_col(df, 0.9).head(), I cannot change this line of code, I get :

KeyError: "['yls'] not found in axis"

If I change the shape to have 0 instead of 1, I don’t think this is correct for what I’m doing, I get:

IndexError: tuple index out of range

Can anyone help me understand how I can fix this?

Answer

I think you need to change from

df = df.drop(i)

df = df.drop(i, axis=1)

So you account for columns instead of rows, which is the default option. See here for the same error https://stackoverflow.com/a/44931865/5184851

Also, to use .head() the function drop_nan_col(...) needs to return dataframe i.e df

Advertisement

Answer