I have a large customer dataset, it has things like Customer ID, Service ID, Product, etc. So the two ways we can measure churn are at a Customer-ID level, if the entire customer leaves and at a Service-ID level where maybe they cancel 2 out of 5 services.
The data looks like this, and as we can see
- Alligators stops being a customer at the end of Jan as they dont have any rows in Feb (CustomerChurn)
- Aunties stops being a customer at the end of Jan as they dont have any rows in Feb (CustomerChurn)
- Bricks continues with Apples and Oranges in Jan and Feb (ServiceContinue)
- Bricks continues being a customer but cancels two services at the end of Jan (ServiceChurn)
I am trying to write some code that creates the ‘Churn’ column.. I have tried
- To manually just grab lists of CustomerIDs and ServiceIDs using Set from Oct 2019, and then comparing that to Nov 2019, to find the ones that churned. This is not too slow but doesn’t seem very Pythonic.
Thank you!
data = {'CustomerName': ['Alligators','Aunties', 'Bricks', 'Bricks','Bricks', 'Bricks', 'Bricks', 'Bricks', 'Bricks', 'Bricks'], 'ServiceID': [1009, 1008, 1001, 1002, 1003, 1004, 1001, 1002, 1001, 1002], 'Product': ['Apples', 'Apples', 'Apples', 'Bananas', 'Oranges', 'Watermelon', 'Apples', 'Bananas', 'Apples', 'Bananas'], 'Month': ['Jan', 'Jan', 'Jan', 'Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'], 'Year': [2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021], 'Churn': ['CustomerChurn', 'CustomerChurn', 'ServiceContinue', 'ServiceContinue', 'ServiceChurn', 'ServiceChurn','ServiceContinue', 'ServiceContinue', 'NA', 'NA']} df = pd.DataFrame(data) df
Advertisement
Answer
I think this gets close to what you want, except for the NA in the last two rows, but if you really need those NA, then you can filter by date and change the values.
Because you are really testing two different groupings, I send the first customername grouping through a function and depending what I see, I send a more refined grouping through a second function. For this data set it seems to work.
I create an actual date column and make sure everything is sorted before grouping. The logic inside the functions is testing the max date of the group to see if it’s less than a certain date. Looks like you are testing March as the current month
You should be able to adapt it for your needs
df['testdate'] = df.apply(lambda x: datetime.datetime.strptime('-'.join((x['Month'], str(x['Year']))),'%b-%Y'), axis=1) df = df.sort_values('testdate') df1 = df.drop('Churn',axis=1) def get_customerchurn(x, tdate): # print(x) # print(tdate) if x.testdate.max() < tdate: x.loc[:, 'Churn'] = 'CustomerChurn' return x else: x = x.groupby(['CustomerName', 'Product']).apply(lambda x: get_servicechurn(x, datetime.datetime(2021,3,1))) return x def get_servicechurn(x, tdate): print(x) # print(tdate) if x.testdate.max() < tdate: x.loc[:, 'Churn'] = 'ServiceChurn' return x else: x.loc[:, 'Churn'] = 'ServiceContinue' return x df2 = df1.groupby(['CustomerName']).apply(lambda x: get_customerchurn(x, datetime.datetime(2021,3,1))) df2
Output:
CustomerName ServiceID Product Month Year testdate Churn 0 Alligators 1009 Apples Jan 2021 2021-01-01 CustomerChurn 1 Aunties 1008 Apples Jan 2021 2021-01-01 CustomerChurn 2 Bricks 1001 Apples Jan 2021 2021-01-01 ServiceContinue 3 Bricks 1002 Bananas Jan 2021 2021-01-01 ServiceContinue 4 Bricks 1003 Oranges Jan 2021 2021-01-01 ServiceChurn 5 Bricks 1004 Watermelon Jan 2021 2021-01-01 ServiceChurn 6 Bricks 1001 Apples Feb 2021 2021-02-01 ServiceContinue 7 Bricks 1002 Bananas Feb 2021 2021-02-01 ServiceContinue 8 Bricks 1001 Apples Mar 2021 2021-03-01 ServiceContinue 9 Bricks 1002 Bananas Mar 2021 2021-03-01 ServiceContinue