I have a set of large arrays (about 6 million elements each) that I want to basically perform a np.digitize but over multiple axes. I am looking for some suggestions on both how to effectively do this but also on how to store the results.
I need all the indices (or all the values, or a mask) of array A where the values of array B are in a range and the values of array C are in another range and D in yet another. I want either the values, indices, or mask so that I can do some as of yet undecided statistics on the values of the A array in each bin. I will also need the number of elements in each bin but len()
can do that.
Here is one example I worked up that seems reasonable:
import itertools import numpy as np A = np.random.random_sample(1e4) B = (np.random.random_sample(1e4) + 10)*20 C = (np.random.random_sample(1e4) + 20)*40 D = (np.random.random_sample(1e4) + 80)*80 # make the edges of the bins Bbins = np.linspace(B.min(), B.max(), 10) Cbins = np.linspace(C.min(), C.max(), 12) # note different number Dbins = np.linspace(D.min(), D.max(), 24) # note different number B_Bidx = np.digitize(B, Bbins) C_Cidx = np.digitize(C, Cbins) D_Didx = np.digitize(D, Dbins) a_bins = [] for bb, cc, dd in itertools.product(np.unique(B_Bidx), np.unique(C_Cidx), np.unique(D_Didx)): a_bins.append([(bb, cc, dd), [A[np.bitwise_and((B_Bidx==bb), (C_Cidx==cc), (D_Didx==dd))]]])
This however makes me nervous that I will run out of memory on large arrays.
I could also do it this way:
b_inds = np.empty((len(A), 10), dtype=np.bool) c_inds = np.empty((len(A), 12), dtype=np.bool) d_inds = np.empty((len(A), 24), dtype=np.bool) for i in range(10): b_inds[:,i] = B_Bidx = i for i in range(12): c_inds[:,i] = C_Cidx = i for i in range(24): d_inds[:,i] = D_Didx = i # get the A data for the 1,2,3 B,C,D bin print A[b_inds[:,1] & c_inds[:,2] & d_inds[:,3]]
At least here the output is of known and constant size.
Does anyone have any better thoughts on how to do this smarter? Or clarification that is needed?
Based on the answer by HYRY this is the path I decided to take.
import numpy as np import pandas as pd np.random.seed(42) A = np.random.random_sample(1e7) B = (np.random.random_sample(1e7) + 10)*20 C = (np.random.random_sample(1e7) + 20)*40 D = (np.random.random_sample(1e7) + 80)*80 # make the edges of the bins we want Bbins = np.linspace(B.min(), B.max(), 9) Cbins = np.linspace(C.min(), C.max(), 10) # note different number Dbins = np.linspace(D.min(), D.max(), 11) # note different number sA = pd.Series(A) cB = pd.cut(B, Bbins, include_lowest=True) cC = pd.cut(C, Cbins, include_lowest=True) cD = pd.cut(D, Dbins, include_lowest=True) dat = pd.DataFrame({'A':A, 'cB':cB.labels, 'cC':cC.labels, 'cD':cD.labels}) g = sA.groupby([cB.labels, cC.labels, cD.labels]).indices # this then gives all the indices that match the group print g[0,1,2] # this is all the array A data for that B,C,D bin print sA[g[0,1,2]]
This method seems lightning fast even for huge arrays.
Advertisement
Answer
How about use groupby
in Pandas. Fix some problem in your code first:
import itertools import numpy as np np.random.seed(42) A = np.random.random_sample(1e4) B = (np.random.random_sample(1e4) + 10)*20 C = (np.random.random_sample(1e4) + 20)*40 D = (np.random.random_sample(1e4) + 80)*80 # make the edges of the bins Bbins = np.linspace(B.min(), B.max(), 10) Cbins = np.linspace(C.min(), C.max(), 12) # note different number Dbins = np.linspace(D.min(), D.max(), 24) # note different number B_Bidx = np.digitize(B, Bbins) C_Cidx = np.digitize(C, Cbins) D_Didx = np.digitize(D, Dbins) a_bins = [] for bb, cc, dd in itertools.product(np.unique(B_Bidx), np.unique(C_Cidx), np.unique(D_Didx)): a_bins.append([(bb, cc, dd), A[(B_Bidx==bb) & (C_Cidx==cc) & (D_Didx==dd)]]) a_bins[1000]
output:
[(4, 6, 17), array([ 0.70723863, 0.907611 , 0.46214047])]
Here is the code that return the same result by Pandas:
import pandas as pd cB = pd.cut(B, 9) cC = pd.cut(C, 11) cD = pd.cut(D, 23) sA = pd.Series(A) g = sA.groupby([cB.labels, cC.labels, cD.labels]) g.get_group((3, 5, 16))
output:
800 0.707239 2320 0.907611 9388 0.462140 dtype: float64
If you want to calculate some statistics of every group, you can call the method of g
, for example:
g.mean()
returns:
0 0 0 0.343566 1 0.410979 2 0.700007 3 0.189936 4 0.452566 5 0.565330 6 0.539565 7 0.530867 8 0.568120 9 0.587762 11 0.352453 12 0.484903 13 0.477969 14 0.484328 15 0.467357 ... 8 10 8 0.559859 9 0.570652 10 0.656718 11 0.353938 12 0.628980 13 0.372350 14 0.404543 15 0.387920 16 0.742292 17 0.530866 18 0.389236 19 0.628461 20 0.387384 21 0.541831 22 0.573023 Length: 2250, dtype: float64