The following code snippet illustrates the issue:
from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler import numpy as np (nrows, ncolumns) = (1912392, 131) X = np.random.random((nrows, ncolumns)) pca = PCA(n_components=28, random_state=0) transformed_X1 = pca.fit_transform(X) pca1 = pca.fit(X) transformed_X2 = pca1.transform(X) print((transformed_X1 != transformed_X2).sum()) # Gives output as 53546976 scalar = StandardScaler() scaled_X1 = scalar.fit_transform(X) scalar2 = scalar.fit(X) scaled_X2 = scalar2.transform(X) (scaled_X1 != scaled_X2).sum() # Gives output as 0
Can someone explain as to why the first output is not zero and the second output is?
Advertisement
Answer
Using this works:
pca = PCA(n_components=28, svd_solver = 'full') transformed_X1 = pca.fit_transform(X) pca1 = pca.fit(X) transformed_X2 = pca1.transform(X) print(np.allclose(transformed_X1, transformed_X2)) True
Apparently svd_solver = 'random'
(which is what 'auto'
defaults to) has enough process difference between .fit(X).transform(X)
and fit_transform(X)
to give different results even with the same seed. Also remember floating point errors make ==
and /=
unreliable judges of equality of different processes, so use np.allclose()
.
It seems like StandardScaler.fit_transform()
just directly uses .fit(X).transform(X)
under the hood, so there were no floating point errors there to trip you up.