Skip to content
Advertisement

Causal Inference in observational data [closed]

I am using the python package DoWhy to see if I have a causal relationship between tenure and churn based on this site.

# TREATMENT = TENURE
    causal_df = df.causal.do('tenure', 
                             method = 'weighting', 
                             variable_types = {'Churn': 'd', 'tenure': 'c', 'nr_login',  'c','avg_movies': 'c'
                                              },
                             outcome='Churn',common_causes=['nr_login':'c','avg_movies': 'c'])

I have a number of other variables as well.

  1. Is this the right way to do the analysis?

  2. What does common causes mean, and how to choose them?

  3. How I can interpret results and with what certainty?

Advertisement

Answer

Let’s take your questions one by one.

1. Is this the right way?

Yes, your code snippet is correct, assuming that you want to estimate the causal effect of tenure and Churn, by conditioning on nr_login and avg_movies.

However this method will output a dataframe containing the interventional values of the outcome Churn. That is, the values of the churn variable as if tenure had been changed independent of the specified common causes. If the treatment tenure was discrete, you could have done a simple plot to visualize the effect of different values of tenure. Something like:

causal_df = df.causal.do('tenure', 
                          method = 'weighting', 
                          variable_types = {'Churn': 'd', 'tenure': 'd', 'nr_login',  'c','avg_movies': 'c'},
                          outcome='Churn',common_causes=['nr_login':'c','avg_movies': 'c']).groupby('tenure').mean()

However, to compute the average causal effect, a more direct procedure is to run the do method twice for the two values of treatment over which the effect is to be computed (the typical value is comparing treatment=1 versus 0). The resultant code will look like below as described in the example notebook (also see the docs for the do method):

df_treatment1 = df.causal.do({'tenure': 1}, 
                                method = 'weighting', 
                                variable_types = {'Churn': 'd', 'tenure': 'd', 'nr_login',  'c','avg_movies': 'c'},
                                outcome='Churn',common_causes=['nr_login':'c','avg_movies': 'c'])

df_treatment0 = df.causal.do({'tenure': 0}, 
                                method = 'weighting', 
                                variable_types = {'Churn': 'd', 'tenure': 'd', 'nr_login',  'c','avg_movies': 'c'},
                                outcome='Churn',common_causes=['nr_login':'c','avg_movies': 'c'])
causal_effect = (df_treatment1['churn'] - df_treatment0["churn"]).mean()

There’s also an equivalent way of achieving the same result using the main DoWhy API.

model= CausalModel(
        data=df,
        treatment='tenure',
        outcome='churn',
        common_causes=['nr_login', 'avg_movies'])` 
identified_estimand = model.identify_effect()
model.estimate_effect(identified_estimand, method="backdoor.propensity_score_weighting")

That said, based on your dataset, there may be other estimation methods that are better suited. For example, the “weighting” method is expected to have high variance if one of the treatment values is unlikely given the possible values of common causes. Also, if you have limited data, this method may not work well for continuous treatments since it is a non-parametric method that will have high variance in general. In those cases, you can use other estimator methods like double-ML that use parametric assumptions to reduce variance in the estimation (at the cost of possible bias). You can call double-ML or other advanced EconML estimators like this (full example in this notebook):

model.estimate_effect(identified_estimand, 
                       method_name="backdoor.econml.dml.DMLCateEstimator",
                       control_value = 0,
                       treatment_value = 1,
                       confidence_intervals=False,
                       method_params={"init_params": 
                                        {'model_y':GradientBoostingRegressor(),
                                         'model_t': GradientBoostingRegressor(),
                                         "model_final":LassoCV(),                    
                                         'featurizer':PolynomialFeatures(degree=1, 
                                                         include_bias=True)
                                        },
                                      "fit_params":{}
                                     })

2. How to choose common_causes?

Common causes are the variables that cause both treatment and outcome. Therefore, a correlation between treatment and outcome can be due to the causal effect of treatment, or simply due to the effect of common causes (classic example is that ice-cream sales are correlated with swimming pool memberships, but one does not cause the other; hot weather is the common cause here). The goal of causal inference is to somehow disentangle the effect of common causes and only return the effect of treatment. Formally, causal effect is the effect of treatment on outcome when all common causes are held constant. For more, check out this tutorial on causal inference)

So, in your example, you’d want to include all variables that both lead to a customer having high tenure and reduce their chances of churn (e.g., their monthly usage, trust in the platform, etc.). These are the common causes or confounders that need to be included in the model.

3. How to interpret the results and their uncertainty?

As mentioned above, the standard interpretation of a causal effect is the change in outcome (churn) when the treatment is changed by 1 unit. For a continuous variable though, this is simply a convention: you can define the causal effect as the change in outcome over any two values of the treatment.

For estimating uncertainty, you can estimate confidence intervals and/or do refutation tests. Confidence intervals will tell you about the statistical uncertainty (roughly, how much will your estimate change if you are given a fresh i.i.d. sample of the data?). Refutations tests will quantify the uncertainty due to causal assumptions (if you missed specifying an important common cause, how much would the estimate change?).

Here’s an example. You can find more on refutation methods here.

# Confidence intervals
est = model.estimate_effect(identified_estimand, method="backdoor.propensity_score_weighting",
confidence_intervals=True)
# Refutation test by adding a random common cause
model.refute_estimate(identified_estimand, est, method_name="random_common_cause")
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement