I am trying to build a custom scaler to scale only the continuous variables on a dataset (the US Adult Income: https://www.kaggle.com/uciml/adult-census-income), using StandardScaler as a base. Here is my Python code that I used:
from sklearn.base import BaseEstimator, TransformerMixin from sklearn.preprocessing import StandardScaler class CustomScaler(BaseEstimator,TransformerMixin): def __init__(self,columns,copy=True,with_mean=True,with_std=True): self.scaler = StandardScaler(copy,with_mean,with_std) self.columns = columns self.mean_ = None self.var_ = None def fit(self, X, y=None): self.scaler.fit(X[self.columns], y) self.mean_ = np.mean(X[self.columns]) self.var_ = np.var(X[self.columns]) return self def transform(self, X, y=None, copy=None): init_col_order = X.columns X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns) X_not_scaled = X.loc[:,~X.columns.isin(self.columns)] return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order] X=new_df_upsampled.copy() X.drop('income',axis=1,inplace=True) continuous = df.iloc[:, np.r_[0,2,10:13]] #basically independent variables that I consider continuous columns_to_scale = continuous scaler = CustomScaler(columns_to_scale) scaler.fit(X)
However when I tried to run the scaler, I met this problem:
So what is the error that I have on building the scaler? And furthermore, how could you build a custom scaler for this dataset?
Thank you!
Advertisement
Answer
I agree with @AntoineDubuis, that ColumnTransformer
is a better (builtin!) way to do this. That said, I’d like to address where your code goes wrong.
In fit
, you have self.scaler.fit(X[self.columns], y)
; this indicates that self.columns
should be a list of column names (or a few other options). But you’ve defined the parameter as continuous = df.iloc[:, np.r_[0,2,10:13]]
, which is a dataframe.
A couple other issues:
- you should only set attributes in
__init__
that come from its signature, or cloning will fail. Moveself.scaler
tofit
, and save its parameterscopy
etc. directly at__init__
. Don’t initializemean_
orvar_
. - you never actually use
mean_
orvar_
. You can keep them if you want, but the relevant statistics are stored in the scaler object.