python creating new list using a “template list”

Question

Suppose i have: and: with the same shape now i want to "put x2 ontop of x1" and sum up all the numbers of x1 corresponding to the numbers of x2 so the end result is: this is a naive implementation using list to further clarify the question so my question: is there is a way to implement this in

Accepted Answer

In this particular example (and, in general, for unique, duplicated, and groupby kinds of operations), pandas is faster than a pure numpy solution:A pandas way, using Series (credit: very similar to @mcsoini&#8217;s answer):def pd_group_sum(x1, x2):    return pd.Series(x1, index=x2).groupby(x2).sum()A pure numpy way, using np.unique and some fancy indexing:def np_group_sum(a, groups):    _, ix, rix = np.unique(groups, return_index=True, return_inverse=True)    return np.where(np.arange(len(ix))[:, None] == rix, a, 0).sum(axis=1)Note: a better pure numpy way is inspired by @Woodford&#8217;s answer:def selsum(a, g, e):    return a[g==e].sum()vselsum = np.vectorize(selsum, signature='(n),(n),()->()')def np_group_sum2(a, groups):    return vselsum(a, groups, np.unique(groups))Yet another pure numpy way is inspired by a comment from @mapf about using argsort(). That in itself already takes 45ms, but we may try something based on np.argpartition(x2, len(x2)-1) instead, since that takes only 7.5ms by itself on the benchmark below:def np_group_sum3(a, groups):    ix = np.argpartition(groups, len(groups)-1)    ends = np.nonzero(np.diff(np.r_[groups[ix], groups.max() + 1]))[0]    return np.diff(np.r_[0, a[ix].cumsum()[ends]])(Slightly modified) examplex1 = np.array([1, 3, 2, 4, 8])  # I added a group for sake of generalityx2 = np.array([0, 1, 1, 0, 7])>>> pd_group_sum(x1, x2)0    51    57    8>>> np_group_sum(x1, x2)  # and all the np_group_sum() variantsarray([5, 5, 8])Speedn = 1_000_000x1 = np.random.randint(0, 20, n)x2 = np.random.randint(0, 20, n)%timeit pd_group_sum(x1, x2)# 13.9 ms ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%timeit np_group_sum(x1, x2)# 171 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit np_group_sum2(x1, x2)# 66.7 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit np_group_sum3(x1, x2)# 25.6 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)Going via pandas is faster, in part because of numpy issue 11136.

Advertisement

Answer