Skip to content
Advertisement

Remove outlier using quantile python

I need to remove outlier for a regression dataset. Lets say the dataset is consist in the following way

# dataset named df
humidity     windspeed
 0.01          4.9
 4.5           20.0
 3.5           5.0
 50.0          4.0
 4.2           0.05
 3.4           3.9
 18.0          4.7

# code for outlier removal
def quantile(columns):
   for column in columns:
      lower_quantile = df[column].quantile(0.25)
      upper_quantile = df[column].quantile(0.75)
      df[column] = df[(df[column] >= lower_quantile) & df[column] <= upper_quantile)

columns = ['humidity', 'windspeed']
quantile(columns)

With closer inspection, the column humidity has three outliers which are 50.0,18.0,0.01 but for windspeed column the outliers are 20 and 0.05 and both columns outliers are not in the same row. In this case if I remove my outlier with the code above, I would get the following error:

Value error: Columns must be same length as key

From what I understand, the length of row in each column is not the same once the outlier is removed hence it return me the error. Is there any other way to overcome this issue?

Advertisement

Answer

You may filter for both columns at the same time,

df[
    df['humidity'].between(df['humidity'].quantile(.25), df['humidity'].quantile(.75)) &
    df['windspeed'].between(df['windspeed'].quantile(.25), df['windspeed'].quantile(.75))
]

In this case all three of the df, the conditions for 'humidity' and that for 'windspeed' share the same length because they are all derived from the same df.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement