Skip to content

How to build a custom scaler based on StandardScaler?

I am trying to build a custom scaler to scale only the continuous variables on a dataset (the US Adult Income:, using StandardScaler as a base. Here is my Python code that I used:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator,TransformerMixin): 
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
    def fit(self, X, y=None):[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self

    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]


continuous = df.iloc[:, np.r_[0,2,10:13]] 
#basically independent variables that I consider continuous

columns_to_scale = continuous

scaler = CustomScaler(columns_to_scale)

However when I tried to run the scaler, I met this problem: enter image description here

So what is the error that I have on building the scaler? And furthermore, how could you build a custom scaler for this dataset?

Thank you!



I agree with @AntoineDubuis, that ColumnTransformer is a better (builtin!) way to do this. That said, I’d like to address where your code goes wrong.

In fit, you have[self.columns], y); this indicates that self.columns should be a list of column names (or a few other options). But you’ve defined the parameter as continuous = df.iloc[:, np.r_[0,2,10:13]], which is a dataframe.

A couple other issues:

  1. you should only set attributes in __init__ that come from its signature, or cloning will fail. Move self.scaler to fit, and save its parameters copy etc. directly at __init__. Don’t initialize mean_ or var_.
  2. you never actually use mean_ or var_. You can keep them if you want, but the relevant statistics are stored in the scaler object.
User contributions licensed under: CC BY-SA
9 People found this is helpful