I am trying to figure out how Featuretools works and I am testing it on the Housing Prices dataset on Kaggle. Because the dataset is huge, I’ll work here with only a set of it.
The dataframe is:
train=pd.DataFrame({ 'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60}, 'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'}, 'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0}, 'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260} })
I set de dataframe properties:
dataframes = {'train': (train, 'Id')}
Then call the dfs
method:
train_feature_matrix, train_feature_names = ft.dfs(dataframes=dataframes, target_dataframe_name='train', max_depth=10, agg_primitives=["mean", "sum", "mode"])
I get the following warning:
UnusedPrimitiveWarning: Some specified primitives were not used during DFS: agg_primitives: [‘mean’, ‘mode’, ‘sum’] This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data. If the DFS call contained multiple instances of a primitive in the list above, none of them were used. warnings.warn(warning_msg, UnusedPrimitiveWarning)
And the train_feature_matrix
is exactly as the original train
dataframe.
At first, I said that this is because I have a small dataframe and nothing useful can be extracted. But I get the same behavior with the entire dataframe (80 columns and 1460 rows).
Every example I saw on the Featuretools page had 2+ dataframes, but I only have one.
Can you shed some light here? What am I doing wrong?
Advertisement
Answer
Aggregation primitives cannot create features on an EntitySet with a single DataFrame.
This is because the aggregation that they perform occurs over the the one-to-many relationship that exists when you have a parent-child relationship between DataFrames in an EntitySet. The Featuretools guide on primitives has a section that explains the difference here. With your data, that might look like a child DataFrame that has a non-unique house_id
column over. Then, running dfs on your train
DataFrame would aggregate the desired information for each Id
, using every time it shows up in the child DataFrame.
To get get automated feature generation with a single DataFrame, you should use Transform features. The available Transform Primitives can be found here.