I have a dataset like
import pandas as pd import statsmodels.formula.api as smf import statsmodels.api as sm data = pd.DataFrame({'a':[4,3,4,6,6,3,2], 'b':[12,14,11,15,14,15,10]} test = data.iloc[:4] train = data.iloc[4:]
and I built the linear model for the train data
model = smf.ols("a ~ b", data = data) print(model.fit().summary())
Now what I want to do is get the adjusted R^2 value based on the test data. Is there a simple command for this? I’ve been trying to build it from scratch and keep getting an error.
What I’ve been trying:
model.predict(test.b)
but it complains about the shape. Based on this: https://www.statsmodels.org/stable/examples/notebooks/generated/predict.html
I tried the following
X = sm.add_constant(test.b) model.predict(X)
Now the error is
ValueError: shapes (200,2) and (200,2) not aligned: 2 (dim 1) != 200 (dim 0)
The shape matches but then there’s this thing I don’t understand about the “dim”. But I thought I matched as well as I could the example in the link so I’m just not sure what’s up.
Advertisement
Answer
You should first run the .fit()
method and save the returned object and then run the .predict()
method on that object.
results = model.fit()
Running results.params
will produce this pandas Series:
Intercept -0.875 b 0.375 dtype: float64
Then, running results.predict(test.b)
will produce this Series:
0 3.625 1 4.375 2 3.250 3 4.750 dtype: float64
You can also retrieve model fit summary values by calling individual attributes of the results class (https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.html):
>>> results.rsquared_adj 0.08928571428571419
But those will be for the full/train model, so yes, you will probably need to manually compute SSR/SST/SSE values from your test predictions and true values, and get the adjusted R-squared from that.