I am trying to make a Python program that will calculate a result based on a formula, given factors and an input dataframe.
I have a number of cars (N_cars
) on a given length of the road (l
) and their average speed (v
):
input_columns = ['l', 'N_cars', 'v'] input_data = [[3.5, 1000, 100], [5.7, 500, 110], [10, 367, 110], [11.1, 1800, 95], [2.8, 960, 105], [4.7, 800, 120], [10.4, 103, 111], [20.1, 1950, 115]] input_df = pd.DataFrame(input_data, columns=input_columns) input_df l N_cars v 0 3.5 1000 100 1 5.7 500 110 2 10.0 367 110 3 11.1 1800 95 4 2.8 960 105 5 4.7 800 120 6 10.4 103 111 7 20.1 1950 115
I also know the factors needed for the formula for each category of car, and I know the percentage of each category. I also have different options for each category (3 options that I have here are just an example, there are many more options).
factors_columns = ['category', 'category %', 'option', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'] factors_data = [['A', 58, 'opt_1', 0.000011, 0.23521, 0.93847, 0.39458, 0.00817, 0.24566, 0.0010, 0], ['A', 58, 'opt_2', 0.000011, 0.23521, 0.93145, 0.39458, 0.00467, 0.24566, 0.0010, 0], ['A', 58, 'opt_3', 0.000011, 0.23521, 0.93145, 0.39458, 0.00467, 0.24566, 0.0010, 0], ['B', 22, 'opt_1', 0.002452, 0.48327, 0.83773, 0.92852, 0.00871, 0.29568, 0.0009, 0.02], ['B', 22, 'opt_2', 0.002899, 0.49327, 0.83773, 0.92852, 0.00871, 0.30468, 0.0009, 0.02], ['B', 22, 'opt_3', 0.002452, 0.48327, 0.83773, 0.92852, 0.00771, 0.29568, 0.0119, 0.01], ['C', 17, 'opt_1', 0.082583, 0.39493, 0.02462, 0.82714, 0.00918, 0.28572, 0.0012, 0], ['C', 17, 'opt_2', 0.072587, 0.35493, 0.02852, 0.82723, 0.00912, 0.29572, 0.0018, 0], ['C', 17, 'opt_3', 0.082583, 0.39493, 0.02852, 0.82714, 0.00962, 0.28572, 0.0012, 0.01], ['D', 3, 'opt_1', 0.018327, 0.32342, 0.82529, 0.92752, 0.00988, 0.21958, 0.0016, 0], ['D', 3, 'opt_2', 0.014427, 0.32342, 0.82729, 0.92752, 0.00968, 0.22558, 0.0026, 0], ['D', 3, 'opt_3', 0.018327, 0.32342, 0.82729, 0.94452, 0.00988, 0.21258, 0.0016, 0]] factors_df = pd.DataFrame(factors_data, columns=factors_columns) factors_df category category % option a b c d e f g h 0 A 58 opt_1 0.000011 0.23521 0.93847 0.39458 0.00817 0.24566 0.0010 0.00 1 A 58 opt_2 0.000011 0.23521 0.93145 0.39458 0.00467 0.24566 0.0010 0.00 2 A 58 opt_3 0.000011 0.23521 0.93145 0.39458 0.00467 0.24566 0.0010 0.00 3 B 22 opt_1 0.002452 0.48327 0.83773 0.92852 0.00871 0.29568 0.0009 0.02 4 B 22 opt_2 0.002899 0.49327 0.83773 0.92852 0.00871 0.30468 0.0009 0.02 5 B 22 opt_3 0.002452 0.48327 0.83773 0.92852 0.00771 0.29568 0.0119 0.01 6 C 17 opt_1 0.082583 0.39493 0.02462 0.82714 0.00918 0.28572 0.0012 0.00 7 C 17 opt_2 0.072587 0.35493 0.02852 0.82723 0.00912 0.29572 0.0018 0.00 8 C 17 opt_3 0.082583 0.39493 0.02852 0.82714 0.00962 0.28572 0.0012 0.01 9 D 3 opt_1 0.018327 0.32342 0.82529 0.92752 0.00988 0.21958 0.0016 0.00 10 D 3 opt_2 0.014427 0.32342 0.82729 0.92752 0.00968 0.22558 0.0026 0.00 11 D 3 opt_3 0.018327 0.32342 0.82729 0.94452 0.00988 0.21258 0.0016 0.00
For each option (opt_1, opt_2, opt_3), I have to calculate the result based on this formula (factors are taken from the factors table, but v is coming from the input table):
formula = ( (a*v*v) + (b*v) + c + (d/v) ) / ( (e*v*v) + (f*v) + g) * (1 - h) result = l * N_cars * formula
However, I have to take into account the percentage of each category of car. For each row of the input_df
I have to perform the calculations three times, once for each of the three options. For example, for the index 0 of input_df
, I have N_cars=1000
, v=100
and l=3.5
, the output should be something like this:
# for opt_1: result = 3.5 * 1000 * ((58% of category A {formula for index 0 of factors_df}) + (22% of category B {formula for index 3 of factors_df) + (17% of category C {formula for index 6 of factors_df}) + (3% of category D {formula for index 9 of factors_df}) ) # for opt_2: result = 3.5 * 1000 * ((58% of category A {formula for index 1 of factors_df}) + (22% of category B {formula for index 4 of factors_df) + (17% of category C {formula for index 7 of factors_df}) + (3% of category D {formula for index 10 of factors_df}) ) # for opt_3: result = 3.5 * 1000 * ((58% of category A {formula for index 2 of factors_df}) + (22% of category B {formula for index 5 of factors_df) + (17% of category C {formula for index 8 of factors_df}) + (3% of category D {formula for index 11 of factors_df}) )
So, as an output, for each of the rows in input_df
, I should have three results, one for each of the three options.
I can do the calculation manually for each step, but what I am having troubles with is to make a loop that does it automatically for each input row and all 3 options and then passes to the next input row and so on until the last input row.
Advertisement
Answer
Solution
Not sure what your expected results are, but I believe this does what you’re asking for:
def formula(g, *, l, N_cars, v): x = (1 - g.h) * (g.a * v*v + g.b*v + g.c + g.d/v) / (g.e * v*v + g.f*v + g.g) return N_cars * l * (x * g.pct / 100).sum() groups = factors_df.rename(columns={"category %": "pct"}).groupby("option") result = input_df.apply(lambda r: groups.apply(lambda g: formula(g, **r)), axis=1)
Output:
In [5]: input_df.join(result) Out[5]: l N_cars v opt_1 opt_2 opt_3 0 3.5 1000 100 5411.685077 5115.048256 5500.985916 1 5.7 500 110 4425.339734 4169.893681 4483.595803 2 10.0 367 110 5698.595376 5369.652565 5773.612841 3 11.1 1800 95 30820.717985 29180.106606 31384.785443 4 2.8 960 105 4165.270216 3930.726187 4226.877893 5 4.7 800 120 5860.057879 5506.509637 5919.496692 6 10.4 103 111 1663.960420 1567.455541 1685.339848 7 20.1 1950 115 60976.735053 57375.300546 61685.075902
Explanation
The first step is to group factors_df
by option
. Just to show what that looks like:
In [6]: groups.apply(print) category pct option a b ... d e f g h 0 A 58 opt_1 0.000011 0.23521 ... 0.39458 0.00817 0.24566 0.0010 0.00 3 B 22 opt_1 0.002452 0.48327 ... 0.92852 0.00871 0.29568 0.0009 0.02 6 C 17 opt_1 0.082583 0.39493 ... 0.82714 0.00918 0.28572 0.0012 0.00 9 D 3 opt_1 0.018327 0.32342 ... 0.92752 0.00988 0.21958 0.0016 0.00 [4 rows x 11 columns] category pct option a b ... d e f g h 1 A 58 opt_2 0.000011 0.23521 ... 0.39458 0.00467 0.24566 0.0010 0.00 4 B 22 opt_2 0.002899 0.49327 ... 0.92852 0.00871 0.30468 0.0009 0.02 7 C 17 opt_2 0.072587 0.35493 ... 0.82723 0.00912 0.29572 0.0018 0.00 10 D 3 opt_2 0.014427 0.32342 ... 0.92752 0.00968 0.22558 0.0026 0.00 [4 rows x 11 columns] category pct option a b ... d e f g h 2 A 58 opt_3 0.000011 0.23521 ... 0.39458 0.00467 0.24566 0.0010 0.00 5 B 22 opt_3 0.002452 0.48327 ... 0.92852 0.00771 0.29568 0.0119 0.01 8 C 17 opt_3 0.082583 0.39493 ... 0.82714 0.00962 0.28572 0.0012 0.01 11 D 3 opt_3 0.018327 0.32342 ... 0.94452 0.00988 0.21258 0.0016 0.00
Note that I renamed the category %
to pct
. This isn’t necessary, but made accessing that column in the formula()
function a bit cleaner (g.pct
vs g["category %"]
).
The next step was to implement formula()
in such a way as to accept a group from factors_df
as an argument:
def formula(g, *, l, N_cars, v): x = (1 - g.h) * (g.a * v*v + g.b*v + g.c + g.d/v) / (g.e * v*v + g.f*v + g.g) return N_cars * l * (x * g.pct / 100).sum()
In the function signature, g
is a group from factors_df
, then the keyword-only arguments l
, N_cars
, and v
, which will come from a single row of input_df
at a time.
Each of the three groups shown above will be entered into the formula()
function one at a time, in their entirety. For example, during one call to formula()
, the g
argument will hold all of this data:
category pct option a b ... d e f g h 0 A 58 opt_1 0.000011 0.23521 ... 0.39458 0.00817 0.24566 0.0010 0.00 3 B 22 opt_1 0.002452 0.48327 ... 0.92852 0.00871 0.29568 0.0009 0.02 6 C 17 opt_1 0.082583 0.39493 ... 0.82714 0.00918 0.28572 0.0012 0.00 9 D 3 opt_1 0.018327 0.32342 ... 0.92752 0.00988 0.21958 0.0016 0.00
When the formula uses something like g.e
, it’s accessing the entire e
column, and is taking advantage of vectorization to perform the arithmetic calculations on the entire column at the same time. When the dust settles, x
will be a Series
where each item in the series will be the result of the formula for each of the four categories of car. Here’s an example:
0 0.231242 3 0.619018 6 7.188941 9 1.792376
Notice the indices? Those correspond to category A
, B
, C
, and D
from factors_df
, respectively.
From there, we need to call formula()
on each row of input_df
, using the axis
argument of pd.DataFrame.apply()
:
input_df.apply(lambda r: groups.apply(lambda g: formula(g, **r)), axis=1)
The lambda r
is an anonymous function object being passed to apply
, being applied over axis 1, meaning that r
will be a single row from input_df
at a time, for example:
In [13]: input_df.apply(print, axis=1) l 3.5 N_cars 1000.0 v 100.0 Name: 0, dtype: float64 . . .
Now, on each row-wise apply
, we’re also applying the formula()
function on the groups
groupby object with lambda g: formula(g, **r)
. The **r
unpacks the row from input_df
as keyword arguments, which helps to ensure that the values for v
, l
, and N_cars
aren’t misused in the formula (no need to worry about which order they’re passed into the formula()
function).