I have to solve this problem: Objective: Drops columns most of whose rows missing Inputs: 1. Dataframe df: Pandas dataframe 2. threshold: Determines which columns will be dropped. If threshold is .9, the columns with 90% missing value will be dropped Outputs: 1. Dataframe df with dropped columns (if no columns are dropped, you will return the same dataframe)
I’ve coded this:
class variableTreatment(): def drop_nan_col(self, df, threshold): self.threshold = threshold self.df = df for i in df.columns: if (float(df[i].isnull().sum())/df[i].shape[0]) > threshold: df = df.drop(i)
I have to have “self, dr, and threshold” and cannot add more. The code must pass the test cases below:
import pandas as pd import numpy as np df = pd.read_excel('CKD.xlsx') VT = variableTreatment() VT VT.drop_nan_col(df, 0.9).head()
When I run VT.drop_nan_col(df, 0.9).head(), I cannot change this line of code, I get :
KeyError: "['yls'] not found in axis"
If I change the shape to have 0 instead of 1, I don’t think this is correct for what I’m doing, I get:
IndexError: tuple index out of range
Can anyone help me understand how I can fix this?
Advertisement
Answer
I think you need to change from
df = df.drop(i)
to
df = df.drop(i, axis=1)
So you account for columns instead of rows, which is the default option. See here for the same error https://stackoverflow.com/a/44931865/5184851
Also, to use .head()
the function drop_nan_col(...)
needs to return dataframe i.e df