A function to create interaction variables: what is wrong with the code?

Tags: , ,



I’ve written a function below that takes, as arguments, a dataframe (df) and two of its column names (var1, var2). Then it creates interaction variables for the two variables and adds those columns to the original dataframe. The code works when I hard code it, but when I try to call the function like:

create_interactions(my_dataframe, 'variable1', 'variable2')
my_dataframe

I receive no errors but the new columns are not added to the dataframe – it returns the original dataframe. What am I doing wrong? Thank you.

def create_interactions(df,var1,var2):
    variables = df[[var1,var2]] 
    for i in range(0, variables.columns.size):
        for j in range(0, variables.columns.size):
            col1 = str(variables.columns[i])
            col2 = str(variables.columns[j])
            if i <= j:
                name = col1 + "*" + col2
                df = pd.concat([df, pd.Series(variables[col1] * variables[col2], name=name)], axis=1)

Answer

Doing df = ... doesn’t modify the original df. It just makes a new local variable with your new df.

You could return df from your function, and then use it like df = create_interactions(df, 'var1', 'var2').

But if you do want your function to modify the original df, it might be better to change your last line to this:

df[name] = pd.Series(variables[col1] * variables[col2], name=name)

This will insert the new column into the existing DataFrame.

There are a couple other odd things about your code. You create a new variable called variables that just contains two columns of the original df. Then you loop over range(0, variables.columns.size). But since you defined variables to have only two columns, variables.columns.size will always be two. Later, you grab columns from variables, but these same columns are already present in df, so you could just grab them from df instead.

Also, your code creates “interactions” of each variable with itself, which seems a bit odd. I think your code could be simplified to this:

def create_interaction(df,var1,var2):
    name = var1 + "*" + var2
    df[name] = pd.Series(df[var1] * df[var2], name=name)

Since you only accept exactly two variables, there will be exactly one interaction, so you don’t need any loops at all. (And I renamed it create_interaction to indicate this! :-) Just grab the two specified variables and multiply them.



Source: stackoverflow