Skip to content
Advertisement

scikit preprocessing across entire dataframe

I have a dataframe:

df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'], 
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})

The data is an average response of the same question asked across 4 quarters.

I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.

How would I standardize/normalize across the entire dataframe. What is the best way to go about this?

I can do this for a row or column using but struggling across the dataframe.

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

#define scaler
scaler = MinMaxScaler() #or StandardScaler

X = df.loc[1].T
X = X.to_numpy()

#transform data
scaled = scaler.fit_transform(X)

Advertisement

Answer

If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:

  • a name for the step
  • the chosen transformer (e.g. StandardScaler) or a Pipeline as well
  • a list of columns to which apply the selected transformations

Code example

# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']

# create a ColumnTransformer instance
ct = ColumnTransformer([
    ('scaler', StandardScaler(), columns)
])

# fit and transform the input dataframe
ct.fit_transform(df)

array([[ 0.86955718,  0.93177476,  0.96056682,  0.46493449],
       [ 0.53109031,  0.45544147,  0.41859563,  0.92419906],
       [-1.40064749, -1.38721623, -1.37916245, -1.38913355]])

ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it’s easy to convert the array to a pandas dataframe if you need to.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement