Scipy minimize with pandas dataframe with group by

Question

I have a data frame (a sample df below) and trying to minimize cost function on that. Below are minimize and cost function. Then I'm iterating this function in loop as below, which is working but takes lot of time and memory, Please note that I have bigger dataframe with more rows and columns, for this questions I just created

Accepted Answer

Firstly, I don&#8217;t think it&#8217;s necessary to check for != np.nan inside the objective function. Instead, you could clean up your dataframe and replace all np.nan with zero. The objective function is called several times during the optimization routine, so it should be written as efficient and fast as possible. Consequently, we remove the call of np.where. Note also that relying on the fact that the index variable i  is known at the outer scope is bad practice and makes the code hard to read. I&#8217;d recommend something like this:col1 = df1.col1.values[df1.col1.values != np.nan]col2 = df1.col2.values[df1.col2.values != np.nan]col1_const = np.array([0,0,0,0,60.0,0,0,0])col2_const = np.array([0,0,0,0,0,100.0,0,0])def cost_fun1(v, *args):    i, col1, col2, col1_const, col2_const = args    col1_res = ((1/0.095)*(np.sqrt(col1[i]) - np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2    col2_res = ((1/0.12)*(np.sqrt(col2[i]) - np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2    return 0.5*np.sqrt(col1_res + col2_res)Next, and more importantly, you are solving multiple optimization problems instead of solving one large-scale optimization problem. Mathematically, because your objective function is guaranteed to be positive, you can reformulate your problem in the same vein to this answer. Then, cost_fun2 basically returns the sum of all cost_fun1 for all indices i. Using a bit of reshaping magic, the function nearly looks the same:def cost_fun2(vv, *args):    col1, col2, col1_const, col2_const = args    v = vv.reshape(3, col1.size)    col1_res = ((1/0.095)*(np.sqrt(col1) - np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2    col2_res = ((1/0.12)*(np.sqrt(col2) - np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2    return np.sum(0.5*np.sqrt(col1_res + col2_res))Then, we simply solve the problem and write the solution values into the dataframe afterwards:from scipy.optimize import minimize# initial guessx0 = np.ones(3*col1.size)# solve the problemres = minimize(lambda vv: cost_fun2(vv, col1, col2, col1_const, col2_const), x0=x0, method="trust-constr")# write to dataframetype1_vals, type2_vals, type3_vals = np.split(res.x, 3)df1['type1'] = type1_valsdf1['type2'] = type2_valsdf1['type3'] = type3_valsIf you need col1_res and col2_res in the dataframe, it&#8217;s straighforward to modify the objective function accordingly.Last but not least, depending on the size of your dataframe, it&#8217;s highly recommended to pass the exact objective gradient to scipy.optimize.minimize in order to obtain a good convergence performance. At the moment, the gradient is approximated by finite differences which is quite slow and prone to rounding errors.

Advertisement

Answer