I have a multilinear regression problem where I have the prior information about the range of the output (dependent variable y) – The prediction must always lie in that range.
I want to find the coefficients (upper and lower bound) of each feature (independent variables) in order to make the linear regression model restricted to the desired range of output.
For example, I want to apply this solution on sklearn.datasets.load_boston where I know that the price of the house will be in the range of [10, 40] (the actual min and max values of y are [5,50]).
In the following example, my objective is 10 < yp < 40 and based on this I want to find the min and max bound of all the coefficients
from sklearn.datasets import load_boston import numpy as np import pandas as pd from gekko import GEKKO # Loading the dataset x, y = load_boston(return_X_y=True) c = m.Array(m.FV, x.shape[1]+1) #array of parameters and intercept for ci in c: ci.STATUS = 1 #calculate fixed parameter ci.LOWER = 1 #constraint: lower limit ci.UPPER = 0 #constraint: lower limit #Define variables xd = m.Array(m.Param,x.shape[1]) for i in range(x.shape[1]): xd[i].value = x[i] yd = m.Param(y); yp = m.Var() #Equation of Linear Functions y_pred = c[0]*xd[0]+ c[1]*xd[1]+ c[2]*xd[2]+ c[3]*xd[3]+ c[4]*xd[4]+ c[5]*xd[5]+ c[6]*xd[6]+ c[7]*xd[7]+ c[8]*xd[8]+ c[9]*xd[9]+ c[10]*xd[10]+ c[11]*xd[11]+ c[12]*xd[12]+ c[13] #Inequality Constraints yp = m.Var(lb=5,ub=40) m.Equation(yp==y_pred) #Minimize difference between actual and predicted y m.Minimize((yd-yp)**2) #Solve m.solve(disp=True) #Retrieve parameter values a = [i.value[0] for i in c] print(a)
The listed code either gives me a solution not found error. Can you point out what I’m doing wrong here, or am I missing something important? Also, how can I get both the Upper and Lower bound of c for all independent variables?
Advertisement
Answer
Try IMODE=2
for regression mode. There are a few modifications such as x[:,i]
to load the data and ci.LOWER=0
as the lower bounds and ci.UPPER=1
as the upper bound.
from sklearn.datasets import load_boston import numpy as np import pandas as pd from gekko import GEKKO m = GEKKO(remote=False) # Loading the dataset x, y = load_boston(return_X_y=True) n = x.shape[1] c = m.Array(m.FV, n+1) #array of parameters and intercept for ci in c: ci.STATUS = 1 #calculate fixed parameter ci.LOWER = 0 #constraint: lower limit ci.UPPER = 1 #constraint: lower limit #Load data xd = m.Array(m.Param,n) for i in range(n): xd[i].value = x[:,i] yd = m.Param(y) #Equation of Linear Functions yp = m.Var(lb=5,ub=40) m.Equation(yp==m.sum([c[i]*xd[i] for i in range(n)]) + c[13]) #Minimize difference between actual and predicted y m.Minimize((yd-yp)**2) #Regression mode m.options.IMODE=2 #APOPT solver m.options.SOLVER = 1 #Solve m.solve(disp=True) #Retrieve parameter values a = [i.value[0] for i in c] print(a)
The solution to this constrained problem is obtained with the IPOPT or APOPT (slightly faster).
Number of state variables: 7604 Number of total equations: - 7590 Number of slack variables: - 0 --------------------------------------- Degrees of freedom : 14 ---------------------------------------------- Model Parameter Estimation with APOPT Solver ---------------------------------------------- Iter Objective Convergence 0 5.60363E+04 1.25000E+00 1 1.87753E+05 6.66134E-16 2 3.06630E+04 9.99201E-16 3 3.06630E+04 3.84187E-16 5 3.06630E+04 1.35308E-16 Successful solution --------------------------------------------------- Solver : APOPT (v1.0) Solution time : 0.48429999999999995 sec Objective : 30662.976306748842 Successful solution --------------------------------------------------- [0.0, 0.11336851243, 0.0, 1.0, 0.43491543768, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.037613878067, 0.0, 1.0]