Regarding featuretools, the rank results are wrong

Using Featuretools, I want to convert the value of a certain feature to rank.

This will be the exact question. If anyone can help me, please answer.

First, the following code uses the rank function of pandas and displays the result. I believe this result is correct.

import pandas as pd
df = pd.DataFrame({'col1': [50, 80, 100, 80,90,100,150],
                   'col2': [0.3, 0.05, 0.1, 0.1,0.4,0.7,0.9]})
print(df.rank(method="dense",ascending=True))

JavaScript
​x
 
import pandas as pd
df = pd.DataFrame({'col1': [50, 80, 100, 80,90,100,150],
                   'col2': [0.3, 0.05, 0.1, 0.1,0.4,0.7,0.9]})
print(df.rank(method="dense",ascending=True))
​

However, when I create a custom primitive and run the following code, the results are different. Why is this happend? Please fix my code if it is wrong. Thank you very much for your help.

from featuretools.primitives import TransformPrimitive
from featuretools.variable_types import Numeric
import pandas as pd

class Rank(TransformPrimitive):
    name = 'rank'
    input_types = [Numeric]
    return_type = Numeric

    def get_function(self):
        def rank(column):
            return column.rank(method="dense",ascending=True)     
        return rank

df = pd.DataFrame({'col1': [50, 80, 100, 80,90,100,150],
                   'col2': [0.3, 0.05, 0.1, 0.1,0.4,0.7,0.9]})

import featuretools as ft
es = ft.EntitySet(id="test_es",     
                  entities=None,
                  relationships=None)

es.entity_from_dataframe(entity_id="data",
                         dataframe=df,
                         index="index",
                         variable_types=None,
                         make_index=True,
                         time_index=None,
                         secondary_time_index=None,
                         already_sorted=False)

feature_matrix, feature_defs = ft.dfs(entities=None,
                                      relationships=None,
                                      entityset=es,  
                                      target_entity="data", 
                                      cutoff_time=None,
                                      instance_ids=None,
                                      agg_primitives=None, 
                                      trans_primitives=[Rank], 
                                      groupby_trans_primitives=None, 
                                      allowed_paths=None,
                                      max_depth=2,
                                      ignore_entities=None, 
                                      ignore_variables=None, 
                                      primitive_options=None, 
                                      seed_features=None, 
                                      drop_contains=None,
                                      drop_exact=None,
                                      where_primitives=None,
                                      max_features=-1,
                                      cutoff_time_in_index=False,
                                      save_progress=None,
                                      features_only=False,
                                      training_window=None,
                                      approximate=None,
                                      chunk_size=None,
                                      n_jobs=-1,
                                      dask_kwargs=None,
                                      verbose=False,
                                      return_variable_types=None,
                                      progress_callback=None,     
                                      include_cutoff_time=False)
feature_matrix

JavaScript
 
from featuretools.primitives import TransformPrimitive
from featuretools.variable_types import Numeric
import pandas as pd
​
class Rank(TransformPrimitive):
    name = 'rank'
    input_types = [Numeric]
    return_type = Numeric
​
    def get_function(self):
        def rank(column):
            return column.rank(method="dense",ascending=True)     
        return rank
​
df = pd.DataFrame({'col1': [50, 80, 100, 80,90,100,150],
                   'col2': [0.3, 0.05, 0.1, 0.1,0.4,0.7,0.9]})
​
import featuretools as ft
es = ft.EntitySet(id="test_es",     
                  entities=None,
                  relationships=None)
​
es.entity_from_dataframe(entity_id="data",
                         dataframe=df,
                         index="index",
                         variable_types=None,
                         make_index=True,
                         time_index=None,
                         secondary_time_index=None,
                         already_sorted=False)
​
feature_matrix, feature_defs = ft.dfs(entities=None,
                                      relationships=None,
                                      entityset=es,  
                                      target_entity="data", 
                                      cutoff_time=None,
                                      instance_ids=None,
                                      agg_primitives=None, 
                                      trans_primitives=[Rank], 
                                      groupby_trans_primitives=None, 
                                      allowed_paths=None,
                                      max_depth=2,
                                      ignore_entities=None, 
                                      ignore_variables=None, 
                                      primitive_options=None, 
                                      seed_features=None, 
                                      drop_contains=None,
                                      drop_exact=None,
                                      where_primitives=None,
                                      max_features=-1,
                                      cutoff_time_in_index=False,
                                      save_progress=None,
                                      features_only=False,
                                      training_window=None,
                                      approximate=None,
                                      chunk_size=None,
                                      n_jobs=-1,
                                      dask_kwargs=None,
                                      verbose=False,
                                      return_variable_types=None,
                                      progress_callback=None,     
                                      include_cutoff_time=False)
feature_matrix 
​

Here is the result.

enter image description here

However, when I tried the following code, I was able to get the correct data. Why are the answers different?

import pandas as pd
df = pd.DataFrame({'col1': [50, 80, 100, 80,90,100,150],
                   'col2': [0.3, 0.05, 0.1, 0.1,0.4,0.7,0.9]})
print(df.rank(method="dense",ascending=True))


pd.set_option('display.max_columns', 2000)

  
import featuretools as ft
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data',
                         dataframe=df,
                         index='index')

fm, fd = ft.dfs(entityset=es,
            target_entity='data',
            trans_primitives=[Rank])
fm

JavaScript
 
import pandas as pd
df = pd.DataFrame({'col1': [50, 80, 100, 80,90,100,150],
                   'col2': [0.3, 0.05, 0.1, 0.1,0.4,0.7,0.9]})
print(df.rank(method="dense",ascending=True))
​
​
pd.set_option('display.max_columns', 2000)
​
  
import featuretools as ft
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data',
                         dataframe=df,
                         index='index')
​
fm, fd = ft.dfs(entityset=es,
            target_entity='data',
            trans_primitives=[Rank])
fm
​

Answer

NEW ANSWER: Based on your updated code, the problem is arising because you are setting njobs=-1. When you do this, behind the scenes, Featuretools is distributing the calculation of the feature matrix to multiple workers. In doing so, Featuretools is breaking up the dataframe for calculating the transform feature values among the workers and sending pieces to each worker.

This creates a problem with the Rank primitive you have defined as this primitive requires all of the data to be present to get a correct answer. For situations like this you need to set uses_full_entity=True when defining the primitive to force featuretools to include all of the data when the primitive function is called to compute the feature values.

If you update the Rank primitive definition as follows, you will get the correct answer:

class Rank(TransformPrimitive):
    name = 'rank'
    input_types = [Numeric]
    return_type = Numeric
    uses_full_entity = True

    def get_function(self):
        def rank(column):
            return column.rank(method="dense",ascending=True)     
        return rank

JavaScript
 
class Rank(TransformPrimitive):
    name = 'rank'
    input_types = [Numeric]
    return_type = Numeric
    uses_full_entity = True
​
    def get_function(self):
        def rank(column):
            return column.rank(method="dense",ascending=True)     
        return rank
​

OLD ANSWER: In the custom primitive function you define, the parameters you are passing to rank are different than the parameters you are using when you call rank directly on the DataFrame.

When calling directly on the DataFrame you are using the following parameters:

.rank(method="min", ascending=False, numeric_only=True)

JavaScript
 
.rank(method="min", ascending=False, numeric_only=True)
​

In the custom primitive function you are using different values:

.rank(method="dense", ascending=True)

JavaScript
 
.rank(method="dense", ascending=True) 
​

If you update the primitive function to use the same parameters, the results you get from Featuretools should match what you get when calling rank directly on the DataFrame.

Advertisement

Answer