scikit preprocessing across entire dataframe

Question

I have a dataframe: The data is an average response of the same question asked across 4 quarters. I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize. How would I standardize/normalize across the entire dataframe. What is the best way to go about this?

Accepted Answer

If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.As you can read from the linked documentation, you need to provide inside a tuple:a name for the stepthe chosen transformer (e.g. StandardScaler) or a Pipeline as wella list of columns to which apply the selected transformationsCode example# specify columnscolumns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']# create a ColumnTransformer instancect = ColumnTransformer([    ('scaler', StandardScaler(), columns)])# fit and transform the input dataframect.fit_transform(df)array([[ 0.86955718,  0.93177476,  0.96056682,  0.46493449],       [ 0.53109031,  0.45544147,  0.41859563,  0.92419906],       [-1.40064749, -1.38721623, -1.37916245, -1.38913355]])ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it&#8217;s easy to convert the array to a pandas dataframe if you need to.

Advertisement

Answer