How is the dtype of a numpy array calculated internally?

Question

I was just messing around with numpy arrays when I realized the lesser known behavior of the dtypes parameter. It seems to change as the input changes. For example, gives dtype('int32') However, gives dtype('int64') So, my first question is: How is this calculated? Does it make the datatype suitable for the maximum element as a datatype for all the elements?

Accepted Answer

Let us try and find the relevant bits in the docs.from the np.array doc string:  array(&#8230;)    [&#8230;]    Parameters    [&#8230;]    dtype : data-type, optional      The desired data-type for the array.  If not given, then the type will      be determined as the minimum type required to hold the objects in the      sequence.  This argument can only be used to &#8216;upcast&#8217; the array.  For      downcasting, use the .astype(t) method.    [&#8230;](my emphasis)It should be noted that this is not entirely accurate, for example for integer arrays the system (C) default integer is preferred over smaller integer types as is evident form your example.Note that for numpy to be fast it is essential that all elements of an array be of the same size. Otherwise, how would you quickly locate the 1000th element, say? Also, mixing types  wouldn&#8217;t save all that much space since you would have to store the types of every single element on top of the raw data.Re your second question. First of all. There are type promotion rules in numpy. The best doc I could find for that is the np.result_type doc string:  result_type(&#8230;) result_type(*arrays_and_dtypes)    Returns the type that results from applying the NumPy type promotion  rules to the arguments.    Type promotion in NumPy works similarly to the rules in languages like  C++, with some slight differences.  When both scalars and arrays are  used, the array&#8217;s type takes precedence and the actual value of the  scalar is taken into account.    For example, calculating 3*a, where a is an array of 32-bit floats,  intuitively should result in a 32-bit float output.  If the 3 is a  32-bit integer, the NumPy rules indicate it can&#8217;t convert losslessly  into a 32-bit float, so a 64-bit float should be the result type. By  examining the value of the constant, &#8216;3&#8217;, we see that it fits in an  8-bit integer, which can be cast losslessly into the 32-bit float.    [&#8230;]I&#8217;m not quoting the entire thing here, refer to the doc string for more detail.The exact way these rules apply are complicated and appear to represent a compromise between being intuitive and efficiency.For example, the choice is based on inputs, not result>>> A = np.full((2, 2), 30000, 'i2')>>> >>> Aarray([[30000, 30000],       [30000, 30000]], dtype=int16)# 1>>> A + 30000array([[-5536, -5536],       [-5536, -5536]], dtype=int16)# 2>>> A + 60000array([[90000, 90000],       [90000, 90000]], dtype=int32)Here efficiency wins. It would arguably be more intuitive to have #1 behave like #2. But this would be expensive.Also, and more directly related to your question, type promotion only applies out-of-place, not in-place:# out-of-place>>> A_new = A + 60000>>> A_newarray([[90000, 90000],       [90000, 90000]], dtype=int32)# in-place>>> A += 60000>>> Aarray([[24464, 24464],       [24464, 24464]], dtype=int16)or# out-of-place>>> A_new = np.where([[0, 0], [0, 1]], 60000, A)>>> A_newarray([[30000, 30000],       [30000, 60000]], dtype=int32)# in-place>>> A[1, 1] = 60000>>> Aarray([[30000, 30000],       [30000, -5536]], dtype=int16)Again, this may seem rather non-intuitive. There are, however, compelling reasons for this choice.And these should answer your second question:Changing to a larger dtype would require allocating a larger buffer and copying over all the data. Not only would that be expensive for large arrays.Many idioms in numpy rely on views and the fact that writing to a view directly modifies the base array (and other overlapping views). Therefore an array is not free to change its data buffer whenever it feels like it. To not break the link between views it would be necessary for an array to be aware of all views into its data buffer which would add a lot of admin overhead, and all those views would have to change their data pointers and metadata as well. And if the first array is itself a view (a slice, say) into another array things get even worse.I suppose we can agree on that not being worth it and that is why types are not promoted in-place.

Advertisement

Answer

Parameters