Skip to content
Advertisement

How is the dtype of a numpy array calculated internally?

I was just messing around with numpy arrays when I realized the lesser known behavior of the dtypes parameter.

It seems to change as the input changes. For example,

t = np.array([2, 2])
t.dtype

gives dtype('int32')

However,

t = np.array([2, 22222222222])
t.dtype

gives dtype('int64')

So, my first question is: How is this calculated? Does it make the datatype suitable for the maximum element as a datatype for all the elements? If that is the case, don’t you think it requires more space because it is unnecessarily storing excess memory to store 2 in the second array as a 64 bit integer?

Now again, if I want to change the zeroth element of array([2, 2]) like

t = np.array([2, 2])
t[0] = 222222222222222

I get OverflowError: Python int too large to convert to C long.

My second question is: Why it does not support the same logic it did while creating the array if you change a particular value? Why does it not recompute and reevaluate?

Any help is appreciated. Thanks in advance.

Advertisement

Answer

Let us try and find the relevant bits in the docs.

from the np.array doc string:

array(…)

[…]

Parameters

[…]

dtype : data-type, optional The desired data-type for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence. This argument can only be used to ‘upcast’ the array. For downcasting, use the .astype(t) method.

[…]

(my emphasis)

It should be noted that this is not entirely accurate, for example for integer arrays the system (C) default integer is preferred over smaller integer types as is evident form your example.

Note that for numpy to be fast it is essential that all elements of an array be of the same size. Otherwise, how would you quickly locate the 1000th element, say? Also, mixing types wouldn’t save all that much space since you would have to store the types of every single element on top of the raw data.

Re your second question. First of all. There are type promotion rules in numpy. The best doc I could find for that is the np.result_type doc string:

result_type(…) result_type(*arrays_and_dtypes)

Returns the type that results from applying the NumPy type promotion rules to the arguments.

Type promotion in NumPy works similarly to the rules in languages like C++, with some slight differences. When both scalars and arrays are used, the array’s type takes precedence and the actual value of the scalar is taken into account.

For example, calculating 3*a, where a is an array of 32-bit floats, intuitively should result in a 32-bit float output. If the 3 is a 32-bit integer, the NumPy rules indicate it can’t convert losslessly into a 32-bit float, so a 64-bit float should be the result type. By examining the value of the constant, ‘3’, we see that it fits in an 8-bit integer, which can be cast losslessly into the 32-bit float.

[…]

I’m not quoting the entire thing here, refer to the doc string for more detail.

The exact way these rules apply are complicated and appear to represent a compromise between being intuitive and efficiency.

For example, the choice is based on inputs, not result

>>> A = np.full((2, 2), 30000, 'i2')
>>> 
>>> A
array([[30000, 30000],
       [30000, 30000]], dtype=int16)
# 1
>>> A + 30000
array([[-5536, -5536],
       [-5536, -5536]], dtype=int16)
# 2
>>> A + 60000
array([[90000, 90000],
       [90000, 90000]], dtype=int32)

Here efficiency wins. It would arguably be more intuitive to have #1 behave like #2. But this would be expensive.

Also, and more directly related to your question, type promotion only applies out-of-place, not in-place:

# out-of-place
>>> A_new = A + 60000
>>> A_new
array([[90000, 90000],
       [90000, 90000]], dtype=int32)
# in-place
>>> A += 60000
>>> A
array([[24464, 24464],
       [24464, 24464]], dtype=int16)

or

# out-of-place
>>> A_new = np.where([[0, 0], [0, 1]], 60000, A)
>>> A_new
array([[30000, 30000],
       [30000, 60000]], dtype=int32)
# in-place
>>> A[1, 1] = 60000
>>> A
array([[30000, 30000],
       [30000, -5536]], dtype=int16)

Again, this may seem rather non-intuitive. There are, however, compelling reasons for this choice.

And these should answer your second question:

Changing to a larger dtype would require allocating a larger buffer and copying over all the data. Not only would that be expensive for large arrays.

Many idioms in numpy rely on views and the fact that writing to a view directly modifies the base array (and other overlapping views). Therefore an array is not free to change its data buffer whenever it feels like it. To not break the link between views it would be necessary for an array to be aware of all views into its data buffer which would add a lot of admin overhead, and all those views would have to change their data pointers and metadata as well. And if the first array is itself a view (a slice, say) into another array things get even worse.

I suppose we can agree on that not being worth it and that is why types are not promoted in-place.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement