I want to speed up the code below – namely the for loop. Is there a way to do it in numpy?
import numpy as np # define seend and random state rng = np.random.default_rng(0) # num of throws N = 10**1 # max number of trials total_n_trials = 10 # generate the throws' distributions of "performace" - overall scores # mu_throws = rng.normal(0.7, 0.1, N) mu_throws = np.linspace(0,1,N) all_trials = np.zeros(N*total_n_trials) for i in range(N): # generate trials per throw as Bernoulli trials with given mean all_trials[i*total_n_trials:(i+1)*total_n_trials] = rng.choice([0,1], size=total_n_trials, p=[1-mu_throws[i],mu_throws[i]])
More explanations – I want to generate N
sequences of Bernoulli trials (ie. 0s and 1s, called throws
) where each sequence has a mean (probability p
) given by values in another array (mu_throws
). This can be sampled from a normal distribution or in this case, I took it for simplicity to be a sequence of N=10
numbers from 0 to 1. The approach above works but is slow. I expect N
to be at least $10^4$ and total_n_trials
then can be anything in the order of hundreds to (tens of) thousands (and this is run several times). I have checked the following post but didn’t find an answer. I also know that numpy random choice
can generate multidimensional arrays but I didn’t find a way to set a different set of p
s for different rows. Basically getting the same as what I do above just reshaped:
all_trials.reshape(N,-1)
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 1., 1., 0., 0.], [1., 0., 0., 1., 0., 0., 0., 1., 1., 0.], [1., 0., 1., 0., 0., 1., 0., 1., 0., 1.], [1., 0., 1., 0., 0., 0., 1., 1., 0., 0.], [1., 0., 0., 1., 0., 1., 0., 1., 1., 0.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
I also thought about this trick but haven’t figured how to use it for Bernoulli trials. Thanks for any tips.
Advertisement
Answer
N = 11 mu_throws = np.linspace(0,1,N) total_n_trials = 10_000 rng = np.random.default_rng(0) all_trials = (rng.random((N, total_n_trials)).T<mu_throws).T.astype(int) all_trials # shape (N, total_n_trials)
output:
array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [1, 0, 0, ..., 0, 0, 0], ..., [1, 0, 1, ..., 1, 1, 1], [1, 1, 1, ..., 0, 1, 1], [1, 1, 1, ..., 1, 1, 1]])
Basically, what I’m doing is generate random real numbers in the interval [0, 1)
and convert them in boolean results in function of a given probability (in mu_throws
).
if you compare the mu_throws
(actual probability mean) and the probability estimated in all_trials
you have:
np.c_[mu_throws, all_trials.mean(1)]
output:
array([[0. , 0. ], [0.1 , 0.1003], [0.2 , 0.1963], [0.3 , 0.305 ], [0.4 , 0.4006], [0.5 , 0.5056], [0.6 , 0.5992], [0.7 , 0.6962], [0.8 , 0.7906], [0.9 , 0.8953], [1. , 1. ]])
For N
and total_n_trials
values from your example the time required on my machine is 0.00014019012451171875
sec, vs 0.0012738704681396484
sec of your loop.