I’m writing a class that does one hot encoding, but it doesn’t work as I expected.
On my main code I have this:
for col in train_x_categorical.columns: dataCleaner.addFeatureToBeOneHotEncoded(col) dataCleaner.applyOneHotEncoding(train_x_categorical) train_x_categorical.head()
The class method is the following:
def addFeatureToBeOneHotEncoded(self, featureName): self._featuresToBeOneHotEncoded.append(featureName) def applyOneHotEncoding(self, data): for feature in self._featuresToBeOneHotEncoded: dummies = pd.get_dummies(data[feature]) dummies.drop(dummies.columns[-1],axis=1,inplace=True) data.drop(feature, axis=1, inplace=True) data = pd.concat([data, dummies], axis=1) print(data.columns)
Now, with print(data.columns)
I can see that the method works correctly, but when train_x_categorical.head()
runs I can’t see the effect of the method applyOneHotEncoding
.
I don’t understand why this is happening and how to fix it.
I thought that since python passes values by reference, the variable data
points to the same object as the variable train_x_categorical
, so in the method applyOneHotEncoding
I was working on the same object, but clearly I am wrong.
Can someone explain to me why my reasoning is wrong and how I can solve the problem?
Advertisement
Answer
It is because applyOneHotEncoding
updates the reference variable – data
. That doesn’t work the way you think it does. This is a well-known feature in Python. There are a couple of ways around this that I know of – one is to have your method return the value. That won’t work in your case since you are doing this as part of a loop. The other option is to put the variable to be updated in a wrapper class and pass that to the method. Then updating the variable that is part of the wrapper class will work.
See this for an exhaustive discussion: How do I pass a variable by reference?