Skip to content
Advertisement

Scipy minimize with pandas dataframe with group by

I have a data frame (a sample df below) and trying to minimize cost function on that.

GrpId = ['A','A','A','A','A','A','B','B','B','B','B','B','B']
col1 = [69.1,70.5,71.4,72.8,73.2,74.2,208.0,209.2,210.2,211.0,211.2,211.7,212.5]
col2 = [2,3.1,1.1,2.1,6.0,1.1,1.2,1.3,3.1,2.9,5.0,6.1,3.2]
d = {'GrpId':GrpId,'col1':col1,'col2':col2}

df1 = pd.DataFrame(d)

Below are minimize and cost function.

col1_const=[0,0,0,0,60.0,0,0,0]
col2_const=[0,0,0,0,0,100.0,0,0]

def main(type1,type2,type3,df):
    vall0=[type1,type2,type3]
    res=minimize(cost_fun, vall0, args=(df), method = 'SLSQP', tol=0.01)

    [type1,type2,type3]=res.x

    return type1,type2,type3

def cost_fun(v, df):

    df['col1_res'][i] = np.where((df['col1'][i]!=np.nan), ((1/0.095)*(np.sqrt(df['col1'][i])-np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2 ,0)
    df['col2_res'][i] = np.where((df['col2'][i]!=np.nan), ((1/0.12)*(np.sqrt(df['col2'][i])-np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2 ,0)   
    
    res=0.5*np.sqrt(df['col1_res'][i]+df['col2_res'][i])

    return res

Then I’m iterating this function in loop as below, which is working but takes lot of time and memory,

df1['type1']=np.nan
df1['type2']=np.nan
df1['type3']=np.nan
df1['col3']=np.nan
df1['col1_res']=np.nan
df1['col2_res']=np.nan

for i in range(len(df1.GrpId)):
    if i==0:
        df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(0.125, 0.125, 0.125,df1)
    else:
        df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(df1['type1'][i-1], df1['type2'][i-1], df1['type3'][i-1],df1)
    df1['col3'][i]=df1['type1'][i]+df1['type2'][i]

Please note that I have bigger dataframe with more rows and columns, for this questions I just created a sample code/case.

My questions are,

  1. How can I do the same without iteration
  2. col1_const[4] value will change as per the group (group by GrpId) – I have another function to calculate col1_const[4] values per group. How can I pass this value to cost_fun in that case by group.

Advertisement

Answer

Firstly, I don’t think it’s necessary to check for != np.nan inside the objective function. Instead, you could clean up your dataframe and replace all np.nan with zero. The objective function is called several times during the optimization routine, so it should be written as efficient and fast as possible. Consequently, we remove the call of np.where. Note also that relying on the fact that the index variable i is known at the outer scope is bad practice and makes the code hard to read. I’d recommend something like this:

col1 = df1.col1.values[df1.col1.values != np.nan]
col2 = df1.col2.values[df1.col2.values != np.nan]
col1_const = np.array([0,0,0,0,60.0,0,0,0])
col2_const = np.array([0,0,0,0,0,100.0,0,0])

def cost_fun1(v, *args):
    i, col1, col2, col1_const, col2_const = args
    col1_res = ((1/0.095)*(np.sqrt(col1[i]) - np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2
    col2_res = ((1/0.12)*(np.sqrt(col2[i]) - np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2
    return 0.5*np.sqrt(col1_res + col2_res)

Next, and more importantly, you are solving multiple optimization problems instead of solving one large-scale optimization problem. Mathematically, because your objective function is guaranteed to be positive, you can reformulate your problem in the same vein to this answer. Then, cost_fun2 basically returns the sum of all cost_fun1 for all indices i. Using a bit of reshaping magic, the function nearly looks the same:

def cost_fun2(vv, *args):
    col1, col2, col1_const, col2_const = args
    v = vv.reshape(3, col1.size)
    col1_res = ((1/0.095)*(np.sqrt(col1) - np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2
    col2_res = ((1/0.12)*(np.sqrt(col2) - np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2
    return np.sum(0.5*np.sqrt(col1_res + col2_res))

Then, we simply solve the problem and write the solution values into the dataframe afterwards:

from scipy.optimize import minimize

# initial guess
x0 = np.ones(3*col1.size)

# solve the problem
res = minimize(lambda vv: cost_fun2(vv, col1, col2, col1_const, col2_const), x0=x0, method="trust-constr")

# write to dataframe
type1_vals, type2_vals, type3_vals = np.split(res.x, 3)
df1['type1'] = type1_vals
df1['type2'] = type2_vals
df1['type3'] = type3_vals

If you need col1_res and col2_res in the dataframe, it’s straighforward to modify the objective function accordingly.

Last but not least, depending on the size of your dataframe, it’s highly recommended to pass the exact objective gradient to scipy.optimize.minimize in order to obtain a good convergence performance. At the moment, the gradient is approximated by finite differences which is quite slow and prone to rounding errors.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement