I am trying to come up with a dynamic way to check for the existence of a string and report back a few different results: gone_client, boomerang, new_client.
If I groupby address_id and my_date, and the pattern is Verizon, Verizon, Comcast, Comcast, the client left Verizon and went to another company.
If the client went from Verizon to Comcast and then back to Verizon, this is a boomerang.
If the client is new to Verizon last month, this is a new client, but if the client was with Verizon at some point in the past and new last month, this is a boomerang.
import pandas as pd import numpy as np
# data stored in dictionary details = { 'address_id': [111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777], 'my_company':['Comcast','Verizon','Other','Other','Comcast','Comcast','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Verizon','Verizon','Verizon','Verizon','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Other','Verizon','Comcast','Comcast'], 'my_date':['2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28'] } df = pd.DataFrame(details) df
I can’t get the logic quite right, but I think it’s something along these lines…but something seems to be off.
def f(ser): if 'Verizon' not in ser.unique(): return False if 'Verizon' in ser.unique(): return True return False df['gone_client'] = df.groupby('address_id')['my_company'].transform(f) df.head()
Maybe something like this…
# substring to be searched sub ='Verizon' # creating and passing series to new column df["Indexes"]= df["my_company"].str.find(sub) df
I’m not sure how to handle the dates though, and only by checking the dates can you know whether a client is gone, boomeranging, new, or maybe some weird patterns which doesn’t fit into a bucket, like ‘Other’. Expected result would be this (last scenario is kind of a boomerang that left yet again). I wouldn’t expect many scenarios like this, but there certainly will be some very unique cases, so ‘Other’ is fine for these records.
address_id my_company my_date status 0 111 Comcast 1/24/2022 Left 1 111 Verizon 2/21/2022 Left 2 111 Other 3/28/2022 Left 3 111 Other 4/25/2022 Left 4 111 Comcast 5/23/2022 Left 5 111 Comcast 6/27/2022 Left 6 222 Spectrum 1/24/2022 Never Client 7 222 Spectrum 2/21/2022 Never Client 8 222 Spectrum 3/28/2022 Never Client 9 222 Spectrum 4/25/2022 Never Client 10 222 Spectrum 5/23/2022 Never Client 11 222 Spectrum 6/27/2022 Never Client 12 333 Verizon 1/24/2022 Never Left 13 333 Verizon 2/21/2022 Never Left 14 333 Verizon 3/28/2022 Never Left 15 333 Verizon 4/25/2022 Never Left 16 333 Verizon 5/23/2022 Never Left 17 333 Verizon 6/27/2022 Never Left 18 444 Spectrum 1/24/2022 Left 19 444 Spectrum 2/21/2022 Left 20 444 Spectrum 3/28/2022 Left 21 444 Spectrum 4/25/2022 Left 22 444 Verizon 5/23/2022 Left 23 444 Spectrum 6/27/2022 Left 24 555 Verizon 1/24/2022 Boomerang 25 555 Spectrum 2/21/2022 Boomerang 26 555 Spectrum 3/28/2022 Boomerang 27 555 Spectrum 4/25/2022 Boomerang 28 555 Verizon 5/23/2022 Boomerang 29 555 Verizon 6/27/2022 Boomerang 30 777 Comcast 1/24/2022 New Client 31 777 Comcast 2/21/2022 New Client 32 777 Verizon 1/24/2022 New Client 33 888 Verizon 2/21/2022 Other 34 888 Comcast 3/28/2022 Other 35 888 Verizon 4/25/2022 Other 36 888 Comcast 5/23/2022 Other 37 888 Comcast 6/27/2022 Other
Advertisement
Answer
Here’s a solution with some verbose logic that you can play around with. It doesn’t sound like you’re quite certain of your final logic but hopefully this gives you enough to play around with.
This solution groups the dataframe based on the address_id
. Then, for each individual group, we can examine which my_company
feature. We can use this, along with a timedelta
to have some logic around whether the address is with Verizon, not with Verizon, never with Verizon, is currently with Verizon and came back in the last 30 days, etc.
This answer is not sponsored by Verizon. Other cellular providers exist.
import pandas as pd import datetime # data stored in dictionary details = { 'address_id': [111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777], 'my_company':['Comcast','Verizon','Other','Other','Comcast','Comcast','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Verizon','Verizon','Verizon','Verizon','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Other','Verizon','Comcast','Comcast'], 'my_date':['2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28'] } df = pd.DataFrame(details) df['my_date'] = pd.to_datetime(df['my_date']) address_groups = df.groupby(['address_id']) frame_list = [] current_date = datetime.datetime.now() for group, frame in address_groups: # Create a list and set of each company used by a given address-id: company_list = frame['my_company'].values.tolist() company_set = set(company_list) # Exclusively Verizon if ('Verizon' in company_set) and (len(company_set) == 1): frame['status'] = 'Verizon Diehard' # Never Verizon if ('Verizon' not in company_set): frame['status'] = 'Verizon Never' # Verizon at some point but not currently if ('Verizon' in company_set) and (company_list[-1] != 'Verizon'): v_frame = frame[frame['my_company'] == 'Verizon'] last_verizon_date = v_frame['my_date'].iloc[-1] last_verizon_date = datetime.datetime.strptime(last_verizon_date, '%Y-%m-%d') if (current_date - last_verizon_date) < pd.Timedelta("30 days"): frame['status'] = 'Not curretly Verizon, but was in last 30 days' else: frame['status'] = 'Not curretly Verizon, but was so more than 30 days ago' # Verizon currently but was a boomerang if (company_list[-1] == 'Verizon') and (len(company_set) >= 2): non_v_frame = frame[frame['my_company'] != 'Verizon'] last_non_v_date = non_v_frame['my_date'].iloc[-1] last_non_v_date = datetime.datetime.strptime(last_non_v_date, '%Y-%m-%d') if (current_date - last_non_v_date) < pd.Timedelta("30 days"): frame['status'] = 'Boomerang back to Verizon in last 30 days' else: frame['status'] = 'Boomerang back more than 30 days ago' frame_list.append(frame) final_df = pd.concat(frame_list)