I have a pandas dataframe that I want to group by and create columns for each value of col1 and they should contain the value of col2. And example dataframe:
JavaScript
x
7
1
data = {'item_id': {0: 2, 1: 2, 2: 2, 3: 3, 4: 3},
2
'feature_category_id': {0: 56, 1: 62, 2: 68, 3: 56, 4: 72},
3
'feature_value_id': {0: 365, 1: 801, 2: 351, 3: 802, 4: 75}}
4
5
df = pd.DataFrame(data)
6
7
I want to groupby item_id, create as many columns as feature_category_id and fill them with the feature_value_id.
The resultant df for the example would look like this:
JavaScript
1
8
1
data = {'item_id': {0: 2, 1: 3},
2
'feature_56': {0: 801, 1: 802},
3
'feature_62': {0: 365, 1: None},
4
'feature_68': {0: 351, 1: None},
5
'feature_72': {0: None, 1: 75},}
6
7
df = pd.DataFrame(data)
8
Where features not present for a certain item_id (but present for at least one item_id) are NaN.
Which would be the most optimal operation to do this?
Advertisement
Answer
What you are searching for is pandas pivot()
function. It does exactly what you want:
JavaScript
1
7
1
# Change df shape
2
result = df.pivot(index="item_id", columns="feature_category_id")
3
4
# Change the axis labels
5
result.columns = ["feature_" + str(x[1]) for x in result.columns]
6
result = result.reset_index()
7
Output:
JavaScript
1
4
1
item_id feature_56 feature_62 feature_68 feature_72
2
0 2 365.0 801.0 351.0 NaN
3
1 3 802.0 NaN NaN 75.0
4