Use case: I have time series data for multiple assets (eg. AAPL, MSFT) and multiple features (eg. MACD, Volatility etc). I am building a ML model to make classification predictions on a subset of this data.
Problem: For each asset & feature – I want to fit and apply a transformation. For example: for volatility, I want to fit a transformer for AAPL, MSFT… etc – and then apply that transformation to that partition of the data.
Current status: I currently use compose.make_column_transformer
but this only applies a single transformer to the entire column volatility
and does not allow partitioning of the data & individual transformers to be fit/applied to these partitions.
Research: I’ve done some research and come across sklearn.preprocessing.FunctionTransformer
which seems to be a building block I could use. But haven’t figured out how.
Main question: What is the best way to build a sklearn pipeline that can fit a transformer to a partition (ie. groupby) within a single column? Any code pointers would be great. TY
Example dataset:
Date | Ticker | Volatility | transformed_vol |
---|---|---|---|
01/01/18 | AAPL | X | A(X) |
01/02/18 | AAPL | X | A(X) |
… | AAPL | X | A(X) |
12/30/22 | AAPL | X | A(X) |
12/31/22 | AAPL | X | A(X) |
01/01/18 | GOOG | X | B(X) |
01/02/18 | GOOG | X | B(X) |
… | GOOG | X | B(X) |
12/30/22 | GOOG | X | B(X) |
12/31/22 | GOOG | X | B(X) |
Advertisement
Answer
I don’t think this is doable in an “elegant” way using Scikit’s built-in functionality, simply because the transformers are applied on the whole column. However, one could use the FunctionalTransformer
(as you correctly point out) to circumvent this limitation:
I am using the following example:
print(df) Ticker Volatility OtherCol 0 AAPL 0 1 1 AAPL 1 1 2 AAPL 2 1 3 AAPL 3 1 4 AAPL 4 1 5 GOOG 5 1 6 GOOG 6 1 7 GOOG 7 1 8 GOOG 8 1 9 GOOG 9 1
I added another column just to demonstrate.
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import FunctionTransformer # The index should dictate the groups along the column. df = df.set_index('Ticker') def A(x): return x*x def B(x): return 2*x def C(x): return 10*x # Map groups to function. A dict for each column and each group in the index. f_dict = {'Volatility': {'AAPL':A, 'GOOG':B}, 'OtherCol': {'AAPL':A, 'GOOG':C}} def pick_transform(df): return df.groupby(df.index).apply(lambda df: f_dict[df.columns[0]][df.index[0]](df)) ct = ColumnTransformer( [(f'transformed_{col}', FunctionTransformer(func=pick_transform), [col]) for col in f_dict] ) df[[f'transformed_{col}' for col in f_dict]] = ct.fit_transform(df) print(df)
Which results in:
Volatility OtherCol transformed_vol transformed_OtherCol Ticker AAPL 0 1 0 1 AAPL 1 1 1 1 AAPL 2 1 4 1 AAPL 3 1 9 1 AAPL 4 1 16 1 GOOG 5 1 10 10 GOOG 6 1 12 10 GOOG 7 1 14 10 GOOG 8 1 16 10 GOOG 9 1 18 10
Here you can add other columns in f_dict
and then the transformer will be created in the list comprehension.