Skip to content
Advertisement

Missing observations and clustered standard errors in Python statsmodels?

What’s the cleanest, most pythonic way to run a regression only on non-missing data and use clustered standard errors?

Imagine I have a Pandas dataframe all_data.

Clunky method that works (make a dataframe without missing data):

I can make a new dataframe without the missing data, make the model, and fit the model:

import statsmodels.formula.api as smf

available_data = all_data.loc[:,['y', 'x', 'groupid']].dropna(how='any')
model  = smf.ols('y ~ x', data = available_data)
result = model.fit(cov_type = 'cluster', cov_kwds={'groups': available_data['groupid']})

This feels a bit clunky (esp. when I’m doing it all over the place with different right hand side variables.) And I have to make sure that my stats formula matches the dataframe variables.

But is there a way to make it work using the missing argument?

I can make the model by setting the missing argument and fit the model.

m = smf.ols('y ~ x', data = all_data, missing = 'drop')
result_nocluster = m.fit()`

That works great for the default, homoeskedastic standard errors, but I don’t know how to make this work with clustered standard errors? If I run:

result = m.fit(cov_type = 'cluster', cov_kwds = {'groups': all_data['groupid']})

I get the error ValueError: The weights and list don't have the same length. Presumably the rows with missing observations aren’t getting removed from all_data['groupid'], so it’s throwing an error.

Advertisement

Answer

(A bit too late but for the use of other users) In short, if you only want to use the missing argument in the smf.ols function, there is no way to make it work and, I think, there should not be one, given the current state of the package. The reason is exactly as you mentioned: “the rows with missing observations aren’t getting removed” and they shouldn’t. Because the missing argument creates a (lazy) copy of the input data without missing values and uses that as the input (input data: $X$, the lazy copy: $hat{X}$). This process really should not remove the missing values from the original data ($X$)! At the same time, the groups array should refer to the same data that model uses, i.e., $hat{X}$, however in you code the group variable is coming from the original data ($X$), which is different than the model data ($hat{X}$).

One might argue that groups should accept just keywords. I guess that’s something to be discussed more in depth over GitHub page of the package.

For now, one quick fix for your problem is to add a dropna to the second line, which defies the purpose. So it would look like this:

result = m.fit(cov_type = 'cluster',
               cov_kwds = {'groups': alldata[['y', 'x', 'groupid']].dropna()['groupid']})

Very ugly, inefficient, and mistake-prone! So, possibly your original chunky method would work better.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement