I would like to get the feature names of a data set after it has been transformed by SKLearn OneHotEncoder.
In active_features_ attribute in OneHotEncoder one can see a very good explanation how the attributes n_values_, feature_indices_ and active_features_ get filled after transform() was executed.
My question is:
For e.g. DataFrame based input data:
data = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]}).as_matrix()
How does the code look like to get from the original feature names a, b and c to a list of the transformed feature names 
(like e.g:
a-0,a-1, a-2, b-0, b-1, b-2, b-3, c-0, c-1, c-2, c-3 
or
a-0,a-1, a-2, b-0, b-1, b-2, b-3, b-4, b-5, b-6, b-7, b-8 
or anything that helps to see the assignment of encoded columns to the original columns).
Background: I would like to see the feature importances of some of the algorithms to get a feeling for which feature have the most effect on the algorithm used.
Advertisement
Answer
You can use pd.get_dummies():
pd.get_dummies(data["a"],prefix="a")
will give you:
a_0 a_1 a_2 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0
which can automatically generates the column names. You can apply this to all your columns and then get the columns names. No need to convert them to a numpy matrix.
So with:
df = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]})
data = df.as_matrix()
the solution looks like:
columns = df.columns
my_result = pd.DataFrame()
temp = pd.DataFrame()
for runner in columns:
    temp = pd.get_dummies(df[runner], prefix=runner)
    my_result[temp.columns] = temp
print(my_result.columns)
>>Index(['a_0', 'a_1', 'a_2', 'b_0', 'b_1', 'b_4', 'b_5', 'c_0', 'c_1', 'c_4',
       'c_5'],
      dtype='object')
