I’m trying to create a pairplot of my dataset, where the variables are vastly different numbers (some are in the 0-1 range, some, like age and Monthly Income, can go way higher) and I want to scale those variables that go above 1 to 0-1 using the following code:
scale_vars=['MonthlyIncome','age','NumberOfTime30-59DaysPastDueNotWorse','DebtRatio','NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate','NumberRealEstateLoansOrLines','NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'] scaler=MinMaxScaler(copy=False) train2[scale_vars]=scaler.fit_transform(train2[scale_vars])
My problem is that after scaling the variables and creating the pairplot again, it doesn’t change at all. Do you know what might be the cause for this? Here’s the code I use to create a pairplot:
g=sns.pairplot(train2, hue='SeriousDlqin2yrs', diag_kws={'bw':0.2})
where SeriousDlqin2yrs is the Y variable.
Advertisement
Answer
The plots are expected to look the same, but not exactly – the tick labels should be different. The scaler does a linear transformation, and seaborn chooses the axis limits based on the range of values, so the arrangement of points in the scatter plots does not change.
Since I do not have your data, here is the same effect with Ronald Fisher’s classic iris dataset:
import pandas as pd import seaborn as sns; sns.set() from sklearn.datasets import load_iris from sklearn.preprocessing import MinMaxScaler iris_dict = load_iris(as_frame=True) iris = iris_dict['data'] iris['species'] = iris_dict['target'] g = sns.pairplot(iris, hue='species', diag_kws={'bw_method':0.2})
scale_vars = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] scaler = MinMaxScaler(copy=False) iris[scale_vars] = scaler.fit_transform(iris[scale_vars]) g = sns.pairplot(iris, hue='species', diag_kws={'bw_method':0.2})
Note that the column names should have been changed when the scaling was done, because these are no longer centimeters.