I’m currently trying to manually implement a function to represent the KNN graph of a set of points as an incidence matrix, and my idea was to take the rows of an affinity matrix(n x n matrix representing the distance between the n points), enumerate and sort them, then return indices for the first K elements
for node in range(node_count): # neighbor_indices[:, node] = print( np.fromiter( np.ndenumerate(affinity_matrix[ node,:]), dtype=(np.intp, np.float64), count=node_count, )#.sort( # reverse=True, key=lambda x: x[1] # )[1 :: k + 1][0] )
the errors I get are dependent on the value of dtype.
the obvious choice I thought was dtype=(np.intp, np.float64)
or dtype=(int,np.float64)
but this returns the error: ValueError: setting an array element with a sequence.
meaning I’m trying to unpack multiple values to a single spot
when inspecting the output of ndenumerate in a loop, the first value appears to be a single value inside a tuple:
for x in np.ndenumerate(affinity_matrix[node, :]): print(x) print(type(x), " ", type(x[0]), " ", type(x[0][0]))
((990,), 0.9958856990164133) <class 'tuple'> <class 'tuple'> <class 'int'>
but setting dtype=((int,), np.float64)
throws the error: TypeError: Tuple must have size 2, but has size 1
Is there a way to use fromiter
and ndenumerate
together, or are they somehow incompatible?
Advertisement
Answer
ndenumerate
produces, for each element, a indexing tuple and the value.
In [163]: x = np.arange(6) In [164]: list(np.ndenumerate(x)) Out[164]: [((0,), 0), ((1,), 1), ((2,), 2), ((3,), 3), ((4,), 4), ((5,), 5)]
That makes more sense when the array is 2d or more. The indexing tuples will have 2 or more values:
In [165]: list(np.ndenumerate(x.reshape(3,2))) Out[165]: [((0, 0), 0), ((0, 1), 1), ((1, 0), 2), ((1, 1), 3), ((2, 0), 4), ((2, 1), 5)]
With ‘plain’ enumerate, you get a 2 element tuple:
In [166]: list(enumerate(x)) Out[166]: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]
With fromiter
and the compound dtype:
In [167]: np.fromiter(enumerate(x), dtype=np.dtype("i,f")) Out[167]: array([(0, 0.), (1, 1.), (2, 2.), (3, 3.), (4, 4.), (5, 5.)], dtype=[('f0', '<i4'), ('f1', '<f4')]) The `dtype` shows the full specification that your short hand produces. With that dtype, you get a structured array, which can be accessed field by field: In [169]: _['f0'], _['f1'] Out[169]: (array([0, 1, 2, 3, 4, 5], dtype=int32), array([0., 1., 2., 3., 4., 5.], dtype=float32)) I've never seen `fromiter` used with `enumerate`. Admittedly `enumerate/ndenumerate` are generators, and `fromiter` is supposed to be the better way of creating an array from generators. Let's try some times: In [170]: y = np.random.rand(10000) In [171]: timeit np.fromiter(enumerate(y), dtype=np.dtype("i,f")) 2.39 ms ± 68.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [172]: timeit list(enumerate(y)) 1.37 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Just 'listing' the generator is faster. `ndenumerate` is slower. In [173]: timeit list(np.ndenumerate(y)) 4.58 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) But if your goal is an array, not a just a list, then `fromiter` is faster: In [174]: timeit np.array(list(enumerate(y))) 9.99 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) I can't find the source code for `ndenumerate` - it's buried in some file redirections), but I suspect it uses `ndindex` to create the indexing tuples, and then makes a new tuple from that plus the value: In [179]: list(np.ndindex(x.shape)) Out[179]: [(0,), (1,), (2,), (3,), (4,), (5,)] In [180]: list(np.ndindex(3,2)) Out[180]: [(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)] For a 1d array, it's easy to create index - `np.arange(x.shape[0])`. For higher dimensions, `meshgrid`, `mgrid` etc can generate all the indexing arrays.
edit
For a 1d array, this function produces the same structured array as your fromiter
def foo(x): n = x.shape[0] res = np.empty(n, 'i,f') res['f0'] = np.arange(n) res['f1'] = x return res In [216]: foo(x) Out[216]: array([(0, 0.), (1, 1.), (2, 2.), (3, 3.), (4, 4.), (5, 5.)], dtype=[('f0', '<i4'), ('f1', '<f4')]) In [217]: foo(y) Out[217]: array([( 0, 0.08351453), ( 1, 0.86144197), ( 2, 0.6635565 ), ..., (9997, 0.52427566), (9998, 0.7808558 ), (9999, 0.5060718 )], dtype=[('f0', '<i4'), ('f1', '<f4')]) In [218]: timeit foo(y) 51.8 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)