Tag: statistics

Unable to fix “ValueError: DataFrame constructor not properly called!”

linear-regression numpy pandas python statistics

I was asked to write a program for Linear Regression with the following steps. Load the R data set mtcars as a pandas dataframe. Build another linear regression model by considering the log of independent variable wt, and log of dependent variable mpg. Fit the model with data, and display the R-squared value I am a beginner at Statistics with

Is there a way to get the error in fitting parameters from scipy.stats.norm.fit?

curve-fitting data-fitting gaussian python statistics

I have some data which I have fitted a normal distribution to using the scipy.stats.normal objects fit function like so: I would now like to extract the uncertainty/error in the fitted mu and sigma values. How can I go about this? Answer You can use scipy.optimize.curve_fit: This method does not only return the estimated optimal values of the parameters, but

Is numpy.random.choice with replacement equivalent to multinomial sampling for a single trial?

multinomial numpy python random statistics

I understand that strictly on concept, they are different. But in a single trial (or experiment) for numpy.random.multinomial, is it sampling the same way as numpy.random.choice though giving a different view of the output? For example: Output gives the identity of what was picked in the array [0,1,2,3,4,5] and Output gives the number of times each choice was picked, but

How to use `Dirichlet Process Gaussian Mixture Model` in Scikit-learn? (n_components?)

bayesian machine-learning python scikit-learn statistics

My understanding of “an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters” is that the number of clusters is determined by the data as they converge to a certain amount of clusters. This R Implementation https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. Although, the R implementation uses a Gibbs

Pandas – Compute z-score for all columns

dataframe indexing pandas python statistics

I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here’s a subsection of it: Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to

Constructing a co-occurrence matrix in python pandas

pandas python statistics

I know how to do this in R. But, is there any function in pandas that transforms a dataframe to an nxn co-occurrence matrix containing the counts of two aspects co-occurring. For example a matrix df: would yield: Since the matrix is mirrored on the diagonal I guess there would be a way to optimize code. Answer It’s a simple

Boxplots in matplotlib: Markers and outliers

boxplot matplotlib python statistics

I have some questions about boxplots in matplotlib: Question A. What do the markers that I highlighted below with Q1, Q2, and Q3 represent? I believe Q1 is maximum and Q3 are outliers, but what is Q2? Question B How does matplotlib identify outliers? (i.e. how does it know that they are not the true max and min values?)

How to perform two-sample one-tailed t-test with numpy/scipy

python scipy statistics

In R, it is possible to perform two-sample one-tailed t-test simply by using In Python world, scipy provides similar function ttest_ind, but which can only do two-tailed t-tests. Closest information on the topic I found is this link, but it seems to be rather a discussion of the policy of implementing one-tailed vs two-tailed in scipy. Therefore, my question is

T-test in Pandas

hypothesis-test pandas python scipy statistics

If I want to calculate the mean of two categories in Pandas, I can do it like this: I have a lot of data formatted this way, and now I need to do a T-test to see if the mean of cat1 and cat2 are statistically different. How can I do that? Answer it depends what sort of t-test you

Convert Z-score (Z-value, standard score) to p-value for normal distribution in Python

python scipy statistics

How does one convert a Z-score from the Z-distribution (standard normal distribution, Gaussian distribution) to a p-value? I have yet to find the magical function in Scipy’s stats module to do this, but one must be there. Answer I like the survival function (upper tail probability) of the normal distribution a bit better, because the function name is more informative: