I have a pandas dataframe that I want to group by and create columns for each value of col1 and they should contain the value of col2. And example dataframe:
data = {'item_id': {0: 2, 1: 2, 2: 2, 3: 3, 4: 3}, 'feature_category_id': {0: 56, 1: 62, 2: 68, 3: 56, 4: 72}, 'feature_value_id': {0: 365, 1: 801, 2: 351, 3: 802, 4: 75}} df = pd.DataFrame(data)
I want to groupby item_id, create as many columns as feature_category_id and fill them with the feature_value_id.
The resultant df for the example would look like this:
data = {'item_id': {0: 2, 1: 3}, 'feature_56': {0: 801, 1: 802}, 'feature_62': {0: 365, 1: None}, 'feature_68': {0: 351, 1: None}, 'feature_72': {0: None, 1: 75},} df = pd.DataFrame(data)
Where features not present for a certain item_id (but present for at least one item_id) are NaN.
Which would be the most optimal operation to do this?
Advertisement
Answer
What you are searching for is pandas pivot()
function. It does exactly what you want:
# Change df shape result = df.pivot(index="item_id", columns="feature_category_id") # Change the axis labels result.columns = ["feature_" + str(x[1]) for x in result.columns] result = result.reset_index()
Output:
item_id feature_56 feature_62 feature_68 feature_72 0 2 365.0 801.0 351.0 NaN 1 3 802.0 NaN NaN 75.0