I have a table containing price data for a set of products over 6 months. Each product has a unique id (sku_id) and can be from size 6-12. We measured the price each day, and generated a table similar to the example below. Source indicates what website the price was on (can be 1-4).
| sku_id | size | price | timestamp | source | | 1 | 6.0 | 115.0 | 2021-01-10| 1 | | 2 | 8.0 | 149.0 | 2021-01-10| 1 | | 1 | 6.0 | 168.0 | 2021-01-10| 2 | | 1 | 9.0 | 152.0 | 2021-01-10| 1 | ... | 1 | 9.0 | 152.0 | 2021-07-10| 1 |
Now, I want to perform some analytics/modelling on the above dataset, but I’m having trouble formatting it as needed. I want one row for each product from each site (i.e. the key will be [sku_id, size, source]), where we get the price for each day from January 10th, to July 1st in a new column. This is represented in the table below.
| sku_id | size | source |price on 2021-01-10|price on 2021-01-11|price on 2021-01-12| ... | | 1 | 6.0 | 1 | 149.0 | 151.2 | 158.2 | ... | | 2 | 6.0 | 1 | 142.0 | 221.8 | 312.9 | ... | ...
I’m trying to do this in Pandas using a group by (grouping by sku_id, size, and source), but I’m not sure how to instruct Pandas to tack a new column on for each day. Does anyone have instructions/advice on how to build this second table from the first one? I am not quite sure how to get the group by to work, and I don’t know what other strategies could work either.
Advertisement
Answer
Use pivot
to transform your dataframe:
# pivot args: 1st -> index, 2nd -> columns, 3rd -> values >>> df.pivot(['sku_id', 'size', 'source'], 'timestamp', 'price') .rename_axis(columns=None).add_prefix('price on ').reset_index() sku_id size source price on 2021-01-10 price on 2021-07-10 0 1 6.0 1 115.0 NaN 1 1 6.0 2 168.0 NaN 2 1 9.0 1 152.0 152.0 3 2 8.0 1 149.0 NaN