I am running into a problem using the operator * with numpy scalars, and it would be great if someone can explain what is going on.
Basically, I needed to multiply the sums of columns and rows from various dataframes, and the easiest way to do that was to assign each aggregate to a variable, and then multiply those variables together.
The following block of code demonstrates the problem:
#define dictionary, four columns a-d, five rows with progressively larger values mydict = [{"a":10, "b":20, "c": 30, "d": 40}, {"a":100, "b":200, "c": 300, "d": 400}, {"a":1000, "b":2000, "c": 3000, "d": 4000}, {"a":10000, "b":20000, "c": 30000, "d": 40000}, {"a":100000, "b":200000, "c": 300000, "d": 400000}] #create dataframe df = pd.DataFrame(mydict) #assign sum of each column to variable a_sum = df.iloc[:,0].sum() b_sum = df.iloc[:,1].sum() c_sum = df.iloc[:,2].sum() d_sum = df.iloc[:,3].sum() print(a_sum, b_sum, c_sum, d_sum) print(type(a_sum)) # output is: #111110 222220 333330 444440 #<class 'numpy.int64'>
Then, I multiply the resulting sums using both hardcoded and variable approaches and receive two different results:
#copy-pasted column sums from output above, multiply together no_vars = 111110 * 222220 * 333330 * 444440 #multiply variables together (should be identical to line above) with_vars = a_sum * b_sum * c_sum * d_sum #compare the outputs, expect the results to be 1 here print(no_vars/with_vars) #output is #680.233
I’m guessing this has something to do with how numpy treats the * operator, but I have not been able to find a definitive explanation about what is going on and how to avoid this problem.
Note that the following workaround that removes numpy from the question returns 1
as expected:
no_vars = 111110 * 222220 * 333330 * 444440 with_vars = int(a_sum) * int(b_sum) * int(c_sum) * int(d_sum) print(no_vars/with_vars)
Thanks in advance!
Advertisement
Answer
The problem is that you are using fixed width integers (int64
) that are capped in the minimum and maximum values they can hold, and you are trying to represent a number larger than what can be represented (integer overflow).
You could either use variable size integers (like big int that Python uses) or you could switch to floats which trade off some precision for larger minimum and maximum values they can represent.
Practically, you can just force the _sum
variables to be treated as float
before overflowing:
a_sum = a_sum.astype(np.float_)
With this you can observe that the following:
no_vars = 111110 * 222220 * 333330 * 444440 a_sum = a_sum.astype(np.float_) with_vars = a_sum * b_sum * c_sum * d_sum print(no_vars/with_vars)
will print a value of 1.0
.
Note that such apparently exact result is a result of this specific calculation and how numbers get converted.
In general, results obtained with float
arithmetic and big int arithmetic will be different, e.g.:
print(no_vars) # 3657832649657049840000 print(with_vars) # 3.6578326496570497e+21 print(float(no_vars)) # 3.6578326496570497e+21 print(int(with_vars)) # 3657832649657049677824 print(no_vars == with_vars) # False print(float(no_vars) == with_vars) # True print(no_vars == int(with_vars)) # False