I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'], 'Q1-2019': [9.05, 8.64, 6.3], 'Q2-2019': [8.94, 8.56, 7.09], 'Q3-2019': [8.86, 8.45, 7.09], 'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler #define scaler scaler = MinMaxScaler() #or StandardScaler X = df.loc[1].T X = X.to_numpy() #transform data scaled = scaler.fit_transform(X)
Advertisement
Answer
If I understood correctly your need, you can use ColumnTransformer
to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
- a name for the step
- the chosen transformer (e.g.
StandardScaler
) or aPipeline
as well - a list of columns to which apply the selected transformations
Code example
# specify columns columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019'] # create a ColumnTransformer instance ct = ColumnTransformer([ ('scaler', StandardScaler(), columns) ]) # fit and transform the input dataframe ct.fit_transform(df) array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449], [ 0.53109031, 0.45544147, 0.41859563, 0.92419906], [-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer
will output a numpy array with the transformed value, which were fitted on the input dataset df
. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it’s easy to convert the array to a pandas dataframe if you need to.