I have a pandas dataframe that I want to group by and create columns for each value of col1 and they should contain the value of col2. And example dataframe: I want to groupby item_id, create as many columns as feature_category_id and fill them with the feature_value_id. The resultant df for the example would look like this: Where features not

pandas groupby create new columns based on col1 containing value of col2

I have a pandas dataframe that I want to group by and create columns for each value of col1 and they should contain the value of col2. And example dataframe:

data = {'item_id': {0: 2, 1: 2, 2: 2, 3: 3, 4: 3},
 'feature_category_id': {0: 56, 1: 62, 2: 68, 3: 56, 4: 72},
 'feature_value_id': {0: 365, 1: 801, 2: 351, 3: 802, 4: 75}}

df = pd.DataFrame(data)

JavaScript
​x
 
data = {'item_id': {0: 2, 1: 2, 2: 2, 3: 3, 4: 3},
 'feature_category_id': {0: 56, 1: 62, 2: 68, 3: 56, 4: 72},
 'feature_value_id': {0: 365, 1: 801, 2: 351, 3: 802, 4: 75}}
​
df = pd.DataFrame(data)
​
​

I want to groupby item_id, create as many columns as feature_category_id and fill them with the feature_value_id.

The resultant df for the example would look like this:

data = {'item_id': {0: 2, 1: 3},
 'feature_56': {0: 801, 1: 802},
 'feature_62': {0: 365, 1: None},
 'feature_68': {0: 351, 1: None},
 'feature_72': {0: None, 1: 75},}

df = pd.DataFrame(data)

JavaScript
 
data = {'item_id': {0: 2, 1: 3},
 'feature_56': {0: 801, 1: 802},
 'feature_62': {0: 365, 1: None},
 'feature_68': {0: 351, 1: None},
 'feature_72': {0: None, 1: 75},}
​
df = pd.DataFrame(data)
​

Where features not present for a certain item_id (but present for at least one item_id) are NaN.

Which would be the most optimal operation to do this?

Answer

What you are searching for is pandas pivot() function. It does exactly what you want:

# Change df shape
result = df.pivot(index="item_id", columns="feature_category_id")

# Change the axis labels
result.columns = ["feature_" + str(x[1]) for x in result.columns]
result = result.reset_index()

JavaScript
 
# Change df shape
result = df.pivot(index="item_id", columns="feature_category_id")
​
# Change the axis labels
result.columns = ["feature_" + str(x[1]) for x in result.columns]
result = result.reset_index()
​

Output:

   item_id  feature_56  feature_62  feature_68  feature_72
0        2       365.0       801.0       351.0         NaN
1        3       802.0         NaN         NaN        75.0

JavaScript
 
   item_id  feature_56  feature_62  feature_68  feature_72
0        2       365.0       801.0       351.0         NaN
1        3       802.0         NaN         NaN        75.0
​

Advertisement

Answer