I am trying to build a custom scaler to scale only the continuous variables on a dataset (the US Adult Income: https://www.kaggle.com/uciml/adult-census-income), using StandardScaler as a base. Here is my Python code that I used:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
class CustomScaler(BaseEstimator,TransformerMixin):
def __init__(self,columns,copy=True,with_mean=True,with_std=True):
self.scaler = StandardScaler(copy,with_mean,with_std)
self.columns = columns
self.mean_ = None
self.var_ = None
def fit(self, X, y=None):
self.scaler.fit(X[self.columns], y)
self.mean_ = np.mean(X[self.columns])
self.var_ = np.var(X[self.columns])
return self
def transform(self, X, y=None, copy=None):
init_col_order = X.columns
X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]
X=new_df_upsampled.copy()
X.drop('income',axis=1,inplace=True)
continuous = df.iloc[:, np.r_[0,2,10:13]]
#basically independent variables that I consider continuous
columns_to_scale = continuous
scaler = CustomScaler(columns_to_scale)
scaler.fit(X)
However when I tried to run the scaler, I met this problem:
So what is the error that I have on building the scaler? And furthermore, how could you build a custom scaler for this dataset?
Thank you!
Advertisement
Answer
I agree with @AntoineDubuis, that ColumnTransformer
is a better (builtin!) way to do this. That said, I’d like to address where your code goes wrong.
In fit
, you have self.scaler.fit(X[self.columns], y)
; this indicates that self.columns
should be a list of column names (or a few other options). But you’ve defined the parameter as continuous = df.iloc[:, np.r_[0,2,10:13]]
, which is a dataframe.
A couple other issues:
- you should only set attributes in
__init__
that come from its signature, or cloning will fail. Moveself.scaler
tofit
, and save its parameterscopy
etc. directly at__init__
. Don’t initializemean_
orvar_
. - you never actually use
mean_
orvar_
. You can keep them if you want, but the relevant statistics are stored in the scaler object.