I was trying to simulate multivariate normal data in python the following way
n = 100 p = 10 mu = np.random.normal(loc=0, scale=3, size=p) A = np.random.uniform(low = -4, high = 4, size = (p, 1)) A = pd.DataFrame(A) Sigma = np.dot(A, np.transpose(A)) Sigma = pd.DataFrame(Sigma) X = np.random.multivariate_normal(mean = mu, cov = Sigma, size = n)
However, the resulting variables are always perfectly linearly correlated. I then checked np.linalg.eigvals(Sigma)
and realized that Sigma
is not positive semi-definite. This surprised me since 1) I did not get an error from multivariate_normal
and 2) Sigma
was generated from an outer product, which is supposed to be positive semi-definite. Can anyone tell me what’s going on and how to fix this?
Advertisement
Answer
Here’s one idea. The principle is to generate p
random vectors (rather than the p
scalars in the original code), find the covariance between each pair of vectors, then use that covariance matrix to construct the multivariate distribution. This removes the predictable interrelationships between the elements of the original Sigma
(see comments).
import numpy as np n = 100 p = 10 mu = np.random.normal(loc=0, scale=3, size=p) A_vec = np.random.uniform(low = -4, high = 4, size = (p, p)) Sigma_vec = np.cov(A_vec) X = np.random.multivariate_normal(mean = mu, cov = Sigma_vec, size = n)
Inspecting the results of this with np.corrcoef(X.T)
shows that the generated variables are no longer perfectly correlated.