I would like to get the feature names of a data set after it has been transformed by SKLearn OneHotEncoder.
In active_features_ attribute in OneHotEncoder one can see a very good explanation how the attributes n_values_
, feature_indices_
and active_features_
get filled after transform()
was executed.
My question is:
For e.g. DataFrame based input data:
data = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]}).as_matrix()
How does the code look like to get from the original feature names a
, b
and c
to a list of the transformed feature names
(like e.g:
a-0
,a-1
, a-2
, b-0
, b-1
, b-2
, b-3
, c-0
, c-1
, c-2
, c-3
or
a-0
,a-1
, a-2
, b-0
, b-1
, b-2
, b-3
, b-4
, b-5
, b-6
, b-7
, b-8
or anything that helps to see the assignment of encoded columns to the original columns).
Background: I would like to see the feature importances of some of the algorithms to get a feeling for which feature have the most effect on the algorithm used.
Advertisement
Answer
You can use pd.get_dummies()
:
pd.get_dummies(data["a"],prefix="a")
will give you:
a_0 a_1 a_2 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0
which can automatically generates the column names. You can apply this to all your columns and then get the columns names. No need to convert them to a numpy matrix.
So with:
df = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]}) data = df.as_matrix()
the solution looks like:
columns = df.columns my_result = pd.DataFrame() temp = pd.DataFrame() for runner in columns: temp = pd.get_dummies(df[runner], prefix=runner) my_result[temp.columns] = temp print(my_result.columns) >>Index(['a_0', 'a_1', 'a_2', 'b_0', 'b_1', 'b_4', 'b_5', 'c_0', 'c_1', 'c_4', 'c_5'], dtype='object')