Missing observations and clustered standard errors in Python statsmodels?

Question

What&#8217;s the cleanest, most pythonic way to run a regression only on non-missing data and use clustered standard errors? Imagine I have a Pandas dataframe all_data. Clunky method that works (make a dataframe without missing data): I can make a new dataframe without the missing data, make the model, and fi…

Accepted Answer

(A bit too late but for the use of other users)In short, if you only want to use the missing argument in the smf.ols function, there is no way to make it work and, I think, there should not be one, given the current state of the package. The reason is exactly as you mentioned: &#8220;the rows with missing observations aren&#8217;t getting removed&#8221; and they shouldn&#8217;t. Because the missing argument creates a (lazy) copy of the input data without missing values and uses that as the input (input data: $X$, the lazy copy: $hat{X}$). This process really should not remove the missing values from the original data ($X$)! At the same time, the groups array should refer to the same data that model uses, i.e., $hat{X}$, however in you code the group variable is coming from the original data ($X$), which is different than the model data ($hat{X}$). One might argue that groups should accept just keywords. I guess that&#8217;s something to be discussed more in depth over GitHub page of the package. For now, one quick fix for your problem is to add a dropna to the second line, which defies the purpose. So it would look like this:result = m.fit(cov_type = 'cluster',               cov_kwds = {'groups': alldata[['y', 'x', 'groupid']].dropna()['groupid']})Very ugly, inefficient, and mistake-prone! So, possibly your original chunky method would work better.

Missing observations and clustered standard errors in Python statsmodels?

Clunky method that works (make a dataframe without missing data):

But is there a way to make it work using the missing argument?

Advertisement

Answer