Apologies for the vague title, I’m not entirely sure how to word it more correctly. I have a DataFrame like this:
date customerID saved purchased savedProduct purchasedProduct 0 2021-01-01 456789 1 0 11223344 [0] 1 2021-01-01 456789 1 0 55667788 [0] 2 2021-01-03 456789 0 1 0 [11223344, 28373827]
Which is created with this:
d = {'date': ['2021-01-01', '2021-01-01', '2021-01-03'], 'customerID': ['456789', '456789', '456789'], 'saved':[1, 1, 0], 'purchased': [0, 0, 1], 'savedProduct': [11223344, 55667788, 0], 'purchasedProduct': [[0], [0], [11223344, 28373827]]} df = pd.DataFrame(data=d)
And the logic behind it is that each line is a customer record: they can only ever save one product at a time (which is why savedProduct has one product code) but they can purchase multiple products, which is why purchasedProduct contains a list. What I want to do is:
- By customerID, get unique productIDs in savedProduct
- By unique productID in this column, see if they appear in purchasedProduct
- If they appear, pull the date column from the line in which purchasedProduct appears so I can calculate the amount of days between savedProduct and purchasedProduct
So e.g., the product in line 1 appears in line 3 so preferably there’d be a way to have both the first line’s date (2021-01-01) and third line’s date (2021-01-03) in the same row so we can calculate difference between the dates.
I thought a nested loop would do the job but I can’t get it to work (and there must be a more efficient way..):
dateDF = pd.DataFrame({'customerID': ['0'], 'savedDate': ['0'], 'purchasedDate': ['0']}) dateDF_t = pd.DataFrame() sp = [] for x in df['customerID'].unique(): customerID = x sp = df[df['customerID'] == x]['savedProduct'].unique() for i in sp: for idx, n in enumerate(df[df['customerID'] == x]['purchasedProduct']): if i in n and i != 0: print(df[df['customerID'] == x].iloc[idx, 1]) dateDF_t['customerID'] = df[df['customerID'] == x].iloc[idx, 1] dateDF_t['savedDate'] = df[(df['customerID'] == x) & (df['savedProduct'] == i)]['date'] dateDF_t['purchasedDate'] = df[df['customerID'] == x].iloc[idx, 0] dateDF = pd.concat([dateDF, dateDF_t])
But the output is like this:
customerID savedDate purchasedDate 0 0 0 0 0 NaN 2021-01-01 2021-01-03
Is there any way to do this better and also, why is customerID producing NaNs? When I have the output (the print in the loop) it works fine
Thanks for any help!
EDIT – may have just figured it out using lists instead but if someone has a more efficient way, would still be appreciated!
sp = [] customerIDs = [] savedDates = [] purchasedDates = [] for x in df['customerID'].unique(): sp = df[df['customerID'] == x]['savedProduct'].unique() for i in sp: for idx, n in enumerate(df[df['customerID'] == x]['purchasedProduct']): if i in n and i != 0: customerIDs.append(df[df['customerID'] == x].iloc[idx, 1]) savedDates.append(df[(df['customerID'] == x) & (df['savedProduct'] == i)]['date'].values[0]) purchasedDates.append(df[df['customerID'] == x].iloc[idx, 0]) savedDF = pd.DataFrame({'customerID': customerIDs, 'savedDates': savedDates, 'purchasedDates': purchasedDates})
which has the following output:
customerID savedDates purchasedDates 456789 2021-01-01 2021-01-03 2727228 2021-02-05 2021-02-09
Advertisement
Answer
Try:
df=df.explode('purchasedProduct').reset_index(drop=True) df['purchase_date'] = df.groupby('customerID').apply( lambda df: df.apply( lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis=1))
This will first explode the rows with lists in purchasedProducts, so it creates a seperate row for each item in the list. Then it adds a purchase date column, so you can determine on row level if and when the product is bought.
date customerID saved purchased savedProduct purchasedProduct purchase_date 2021-01-01 456789 1 0 11223344 0 2021-01-03 2021-01-01 456789 1 0 55667788 0 NaN 2021-01-03 456789 0 1 0 11223344 NaN 2021-01-03 456789 0 1 0 28373827 NaN
Of course you can filter the df to only have rows with saved products:
df.loc[df.saved==1]
date customerID saved purchased savedProduct purchasedProduct purchase_date 2021-01-01 456789 1 0 11223344 0 2021-01-03 2021-01-01 456789 1 0 55667788 0 NaN
Or with only certain columns:
df.loc[df.saved==1, ['customerID', 'savedProduct', 'date',`'purchase_date']]
customerID savedProduct date purchase_date 456789 11223344 2021-01-01 2021-01-03 456789 55667788 2021-01-01 NaN