Skip to content
Advertisement

Tag: statistics

Is numpy.random.choice with replacement equivalent to multinomial sampling for a single trial?

I understand that strictly on concept, they are different. But in a single trial (or experiment) for numpy.random.multinomial, is it sampling the same way as numpy.random.choice though giving a different view of the output? For example: Output gives the identity of what was picked in the array [0,1,2,3,4,5] and Output gives the number of times each choice was picked, but

How to use `Dirichlet Process Gaussian Mixture Model` in Scikit-learn? (n_components?)

My understanding of “an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters” is that the number of clusters is determined by the data as they converge to a certain amount of clusters. This R Implementation https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. Although, the R implementation uses a Gibbs

Pandas – Compute z-score for all columns

I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here’s a subsection of it: Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to

Constructing a co-occurrence matrix in python pandas

I know how to do this in R. But, is there any function in pandas that transforms a dataframe to an nxn co-occurrence matrix containing the counts of two aspects co-occurring. For example a matrix df: would yield: Since the matrix is mirrored on the diagonal I guess there would be a way to optimize code. Answer It’s a simple

Boxplots in matplotlib: Markers and outliers

I have some questions about boxplots in matplotlib: Question A. What do the markers that I highlighted below with Q1, Q2, and Q3 represent? I believe Q1 is maximum and Q3 are outliers, but what is Q2?                        Question B How does matplotlib identify outliers? (i.e. how does it know that they are not the true max and min values?)

How to perform two-sample one-tailed t-test with numpy/scipy

In R, it is possible to perform two-sample one-tailed t-test simply by using In Python world, scipy provides similar function ttest_ind, but which can only do two-tailed t-tests. Closest information on the topic I found is this link, but it seems to be rather a discussion of the policy of implementing one-tailed vs two-tailed in scipy. Therefore, my question is

T-test in Pandas

If I want to calculate the mean of two categories in Pandas, I can do it like this: I have a lot of data formatted this way, and now I need to do a T-test to see if the mean of cat1 and cat2 are statistically different. How can I do that? Answer it depends what sort of t-test you

Advertisement