Use case: I have time series data for multiple assets (eg. AAPL, MSFT) and multiple features (eg. MACD, Volatility etc). I am building a ML model to make classification predictions on a subset of this data.
Problem: For each asset & feature – I want to fit and apply a transformation. For example: for volatility, I want to fit a transformer for AAPL, MSFT… etc – and then apply that transformation to that partition of the data.
Current status: I currently use compose.make_column_transformer but this only applies a single transformer to the entire column volatility and does not allow partitioning of the data & individual transformers to be fit/applied to these partitions.
Research: I’ve done some research and come across sklearn.preprocessing.FunctionTransformer which seems to be a building block I could use. But haven’t figured out how.
Main question: What is the best way to build a sklearn pipeline that can fit a transformer to a partition (ie. groupby) within a single column? Any code pointers would be great. TY
Example dataset:
| Date | Ticker | Volatility | transformed_vol |
|---|---|---|---|
| 01/01/18 | AAPL | X | A(X) |
| 01/02/18 | AAPL | X | A(X) |
| … | AAPL | X | A(X) |
| 12/30/22 | AAPL | X | A(X) |
| 12/31/22 | AAPL | X | A(X) |
| 01/01/18 | GOOG | X | B(X) |
| 01/02/18 | GOOG | X | B(X) |
| … | GOOG | X | B(X) |
| 12/30/22 | GOOG | X | B(X) |
| 12/31/22 | GOOG | X | B(X) |
Advertisement
Answer
I don’t think this is doable in an “elegant” way using Scikit’s built-in functionality, simply because the transformers are applied on the whole column. However, one could use the FunctionalTransformer (as you correctly point out) to circumvent this limitation:
I am using the following example:
print(df) Ticker Volatility OtherCol 0 AAPL 0 1 1 AAPL 1 1 2 AAPL 2 1 3 AAPL 3 1 4 AAPL 4 1 5 GOOG 5 1 6 GOOG 6 1 7 GOOG 7 1 8 GOOG 8 1 9 GOOG 9 1
I added another column just to demonstrate.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
# The index should dictate the groups along the column.
df = df.set_index('Ticker')
def A(x):
return x*x
def B(x):
return 2*x
def C(x):
return 10*x
# Map groups to function. A dict for each column and each group in the index.
f_dict = {'Volatility': {'AAPL':A, 'GOOG':B}, 'OtherCol': {'AAPL':A, 'GOOG':C}}
def pick_transform(df):
return df.groupby(df.index).apply(lambda df: f_dict[df.columns[0]][df.index[0]](df))
ct = ColumnTransformer(
[(f'transformed_{col}', FunctionTransformer(func=pick_transform), [col])
for col in f_dict]
)
df[[f'transformed_{col}' for col in f_dict]] = ct.fit_transform(df)
print(df)
Which results in:
Volatility OtherCol transformed_vol transformed_OtherCol Ticker AAPL 0 1 0 1 AAPL 1 1 1 1 AAPL 2 1 4 1 AAPL 3 1 9 1 AAPL 4 1 16 1 GOOG 5 1 10 10 GOOG 6 1 12 10 GOOG 7 1 14 10 GOOG 8 1 16 10 GOOG 9 1 18 10
Here you can add other columns in f_dict and then the transformer will be created in the list comprehension.