Context
Since numpy
version 1.16, if you access multiple fields of a structured array, the dtype
of the resulting array will have the same item size as the original one, leading to extra “padding”:
The new behavior as of Numpy 1.16 leads to extra “padding” bytes at the location of unindexed fields compared to 1.15. You will need to update any code which depends on the data having a “packed” layout.
This can lead to issues, e.g. if you want to add fields to the array in question later-on:
import numpy as np import numpy.lib.recfunctions a = np.array( [ (10.0, 13.5, 1248, -2), (20.0, 0.0, 0, 0), (30.0, 0.0, 0, 0), (40.0, 0.0, 0, 0), (50.0, 0.0, 0, 999) ], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')] ) # some array stolen from here: https://stackoverflow.com/a/37081693/5472354 print(a.shape, a.dtype, a.dtype.names, a.dtype.descr) # all good so far b = a[['x', 'i']] # for further processing I only need certain fields print(b.shape, b.dtype, b.dtype.names, b.dtype.descr) # you will only notice the extra padding in the descr # b = np.lib.recfunctions.repack_fields(b) # workaround # now when I add fields, this becomes an issue c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')]) c[list(b.dtype.names)] = b c['c'] = 1 print(c.dtype.names) print(c['f1']) # the void fields are filled with raw data and were given proper names # that can be accessed
Now a workaround would be to use numpy.lib.recfunctions.repack_fields, which removes the padding, and I will use this in the future, but for my previous code, I need a fix. (Though there can be issues with recfunctions
, as the module may not be found; as is the case for me, thus the additional import numpy.lib.recfunctions
statement.)
Question
This part of the code is what I used to add fields to an array (based on this):
c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')]) c[list(b.dtype.names)] = b c['c'] = 1
Though (now that I know of it) using numpy.lib.recfunctions.require_fields may be more appropriate to add the fields. However, I would still need a way to remove the empty fields from b.dtype.descr
:
[('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]
This is just a list
of tuples
, so I guess I could construct a more or less awkward way (along the lines of descr.remove(('', '|V8'))
) to deal with this, but I was wondering if there is a better way, especially since the size of the voids depends on the number of left-out fields, e.g. from V8 to V16 if there are two in a row and so on (instead of a new void for each left-out field). So the code would become real clunky real fast.
Advertisement
Answer
In [237]: a = np.array( ...: [ ...: (10.0, 13.5, 1248, -2), ...: (20.0, 0.0, 0, 0), ...: (30.0, 0.0, 0, 0), ...: (40.0, 0.0, 0, 0), ...: (50.0, 0.0, 0, 999) ...: ], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')] ...: ) In [238]: a Out[238]: array([(10., 13.5, 1248, -2), (20., 0. , 0, 0), (30., 0. , 0, 0), (40., 0. , 0, 0), (50., 0. , 0, 999)], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')])
the b
view:
In [240]: b = a[['x','i']] In [241]: b Out[241]: array([(10., 1248), (20., 0), (30., 0), (40., 0), (50., 0)], dtype={'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})
the repacked copy:
In [243]: c = rf.repack_fields(b) In [244]: c Out[244]: array([(10., 1248), (20., 0), (30., 0), (40., 0), (50., 0)], dtype=[('x', '<f8'), ('i', '<i8')]) In [245]: c.dtype Out[245]: dtype([('x', '<f8'), ('i', '<i8')])
your overly padded attempt at adding a field:
In [247]: d = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')]) ...: d[list(b.dtype.names)] = b ...: d['c'] = 1 In [248]: d Out[248]: array([(10., b'x00x00x00x00x00x00x00x00', 1248, b'x00x00x00x00x00x00x00x00', 1), (20., b'x00x00x00x00x00x00x00x00', 0, b'x00x00x00x00x00x00x00x00', 1), ...], dtype=[('x', '<f8'), ('f1', 'V8'), ('i', '<i8'), ('f3', 'V8'), ('c', '<i4')])
My first attempt at making a dtype that does not include the Void
fields. I don’t know simply testing for V
is robust enough:
In [253]: [des for des in b.dtype.descr if not 'V' in des[1]] Out[253]: [('x', '<f8'), ('i', '<i8')]
And make a new dtype from that:
In [254]: d_dtype = _ + [('c','i4')]
All of this is normal python list and tuple manipulation. I’ve seen that in other recfunctions
. I suspect repack_fields
does something like this.
Now we make a new array with the simpler dtype:
In [255]: d = np.empty(b.shape, dtype=d_dtype) In [256]: d[list(b.dtype.names)] = b ...: d['c'] = 1 In [257]: d Out[257]: array([(10., 1248, 1), (20., 0, 1), (30., 0, 1), (40., 0, 1), (50., 0, 1)], dtype=[('x', '<f8'), ('i', '<i8'), ('c', '<i4')])
I’ve extracted from repack_fields
the code that constructs a new, un-padded, dtype:
In [262]: def foo(a): ...: fieldinfo = [] ...: for name in a.names: ...: tup = a.fields[name] ...: fmt = tup[0] ...: if len(tup) == 3: ...: name = (tup[2], name) ...: fieldinfo.append((name, fmt)) ...: print(fieldinfo) ...: dt = np.dtype(fieldinfo) ...: return dt ...: ...: In [263]: foo(b.dtype) [('x', dtype('float64')), ('i', dtype('int64'))] Out[263]: dtype([('x', '<f8'), ('i', '<i8')])
This works from dtype.fields
rather than the dtype.descr
. One’s a dict
the other a list.
In [274]: b.dtype Out[274]: dtype({'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32}) In [275]: b.dtype.descr Out[275]: [('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')] In [276]: b.dtype.fields Out[276]: mappingproxy({'x': (dtype('float64'), 0), 'i': (dtype('int64'), 16)}) In [277]: b.dtype.fields['x'] Out[277]: (dtype('float64'), 0)
another way of getting just the valid descr
tuples from b.dtype
:
In [278]: [des for des in b.dtype.descr if des[0] in b.dtype.names] Out[278]: [('x', '<f8'), ('i', '<i8')]