Python data science: How to select three houses in dataset with budget constraint, optimizing for highest risidual between predicted and actual price

Question

I have gotten the assignment to analyze a dataset of 1.000+ houses, build a multiple regression model to predict prices and then select the three houses which are the cheapest compared to the predicted price. Other than selecting specifically three houses, there is also the constraint of a &#8220;budget&#8221…

Accepted Answer

With the clarification that you are looking for the best combination your problem is more complicated ;)I have tried a “brute-force” approach but at least my laptop takes forever with the full dataset. Find below my thoughts:Obviously we have to calculate the combinations of many houses, therefore my first approach was to reduce the dataset as far as possible.If Price+2*min(Price)>budget there will be no combination with two houses that is smallerIf risidual is negative we will not consider the house during optimizationIn pandas this will look as this:budget=7000000df=df[df['Price']<(budget-2*df['Price'].min())].copy()df=df[df['risidual']>0].copy() This reduces the objects from 1395 to 550.Unfortunatly, 550 ID are still many combinations (27578100) as calculated with itertools:import itertoolsidx=[a for a in itertools.combinations(df.index,3)] You can evaluate these combinations byresult={comb: df.loc[[*comb], 'risidual'].sum() for comb in idx[10000:] if df.loc[[*comb], 'Price'].sum() < budget}Note: I have limited the evaluation to the first 10000 values due to the calculation duration.print("Combination: {}nPrice: {}nCost: {}".format(max(result),df.loc[[*max(result)], 'Price'].sum(),result[max(result)] ))Maybe it is advisable to calculate the combination of just two object first to further reduce the possible combinations. I think you should have a look at the Knapsack problemI think you are almost there. Given that df["risidual"] has the difference between predicted and real price you have to select the subset that fits your limit e.g.df_budget=df[df['price']<=budget].copy()using pandas nlargest() you could retrieve the three biggest differencesdf_budget.nlargest(3, 'risidual')Note: Code was not tested due to missing sample data

Advertisement

Answer