I’m using ols
in statsmodels
to run a regression. Once I run the regressions on each row of my dataframe, I want to retrieve the X variables from patsy
thats used in those regressions. But, I get an error that I just cant seem to understand.
Edit: I am trying to run a regression as presented in the answer here, but want to run the regression across each row of a grouped version of my dataframe df
, where it is grouped by Date
,bal
, dist
, pay_hist
, inc
, bckts
. So I first group this data as described above and then try to run the regression on each row where df
is grouped by Date
: df.groupby(['Date']).apply(ols_coef,'bal ~ C(dist) + C(pay_hist) + C(inc) + C(bckts)')
My code is as follows:
from statsmodels.formula.api import ols
df = df.groupby([['Date','bal', 'dist', 'pay_hist', 'inc', 'bckts']])
######run regression
def ols_coef(x,formula):
return ols(formula,data=x).fit().params
gamma = df.groupby(['Date']).apply(ols_coef,'bal ~ C(dist) + C(pay_hist) + C(inc) + C(bckts)')
print('gamme is {}'.format(gamma))
########################
#####Now trying to retrieve the X variables in the regressions above
formula = 'bal ~ C(dist) + C(pay_hist) + C(inc) + C(bckts)'
data = df.groupby(['Date'])[['bckts', 'wac_dist', 'pay_hist', 'inc', 'bal']]
y,X = patsy.dmatrices(formula,data,return_type='dataframe')
################
I get the following error and am unsure how to solve it:
patsy.PatsyError: Error evaluating factor: Exception: Column(s) ['bckts', 'dist', 'pay_hist', 'inc', 'bal'] already selected
bal ~ C(dist) + C(pay_hist) + C(inc) + C(bckts)
^^^^^^^^^^^
Advertisement
Answer
The problem is that you’re passing a grouped dataframe into thepasty.dmatrices
function. Since the grouped dataframe is iterable, you can do it in a loop like this, and store all of your X dataframs (one for each group) into a dictionary:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import patsy
# Loading data
df = sm.datasets.get_rdataset("Guerry", "HistData").data
# Extracting Independent variables
formula = 'Suicides ~ Crime_parents + Infanticide'
data = df.groupby(['Region'])[['Suicides', 'Crime_parents', 'Infanticide', 'Region']]
X = {}
for name, group in data:
Y, X[name] = patsy.dmatrices(formula, group, return_type='dataframe')
print(X)