Skip to content
Advertisement

How to build a custom scaler based on StandardScaler?

I am trying to build a custom scaler to scale only the continuous variables on a dataset (the US Adult Income: https://www.kaggle.com/uciml/adult-census-income), using StandardScaler as a base. Here is my Python code that I used:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    

    def transform(self, X, y=None, copy=None):
        
        init_col_order = X.columns
        
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

X=new_df_upsampled.copy()
X.drop('income',axis=1,inplace=True)

continuous = df.iloc[:, np.r_[0,2,10:13]] 
#basically independent variables that I consider continuous

columns_to_scale = continuous

scaler = CustomScaler(columns_to_scale)

scaler.fit(X)

However when I tried to run the scaler, I met this problem: enter image description here

So what is the error that I have on building the scaler? And furthermore, how could you build a custom scaler for this dataset?

Thank you!

Advertisement

Answer

I agree with @AntoineDubuis, that ColumnTransformer is a better (builtin!) way to do this. That said, I’d like to address where your code goes wrong.

In fit, you have self.scaler.fit(X[self.columns], y); this indicates that self.columns should be a list of column names (or a few other options). But you’ve defined the parameter as continuous = df.iloc[:, np.r_[0,2,10:13]], which is a dataframe.

A couple other issues:

  1. you should only set attributes in __init__ that come from its signature, or cloning will fail. Move self.scaler to fit, and save its parameters copy etc. directly at __init__. Don’t initialize mean_ or var_.
  2. you never actually use mean_ or var_. You can keep them if you want, but the relevant statistics are stored in the scaler object.
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement