Tag: statistics

Micro metrics vs macro metrics

classification python scikit-learn statistics

To test the results of my multi-label classfication model, I measured the Precision, Recall and F1 scores. I wanted to compare two different results, Micro and Macro. I have a dataset with few rows, but my label count is around 1700. Why is the macro so low even though I get a high result in micro, which one would be

Cramer V correlation in python but instead of using frequency using weights?

python statistics

So the dataset for Cramer V correlation has multiple categorical variables in columns, but there is also a column that is there telling us how often these values appear. Similar to table below: I want to figure out how to calculate the Cramer V correlation between season/Age/Weather and the weight is sales? If doable how would one write something to

Two parameter non-linear function for modeling a 3-D surface

curve-fitting pandas python statistics statsmodels

I’m interested in modeling this surface with a simple equation that takes in two parameters (x,y) values and produces a z value. Ideally an equation that has a simple form. I have tried Monkey Saddle, polynomial regression (3rd and 4th order) and also multi-linear and log-linear OLS with some success (R^2 0.99), but none that are perfect especially for the

how to compare two columns and get the mean value of the the 3rd column for all matching items in the two in python pandas dataframe?

dataframe pandas python python-3.x statistics

I have the following table named Rides : start_id end_id eta A B 5 B C 4 A C 6 A B 5 B A 3 C A 3 B C 6 C A 5 A B 8 From the Rides Table , I want to Create a new table which should look like something like below : start_id end_id

Is it necessary to discard outliers before applying LSTM on time series

jupyter-notebook outliers pandas python statistics

I am trying to detect anomalies on a time series that controls battery voltage output. I find that my original dataset has some outliers. In this case do I need to remove those points using InterQuartile Range (IQR) or Zscore? of course before using the LSTM keras model Answer Removing or not removing outliers all depends on what you are

Why doesn’t Johnson-SU distribution give positive skewness in scipy.stats?

probability-distribution python scipy.stats skew statistics

The code below maps the statistical moments (mean, variance, skewness, excess kurtosis) generated by corresponding parameters (a, b, loc, scale) of the Johnson-SU distribution (johnsonsu). For the range of loop values specified in my code below, no parameter configuration results in positive skewness, only negative skewness, even though it should be possible to parameterize the Johnson-SU distribution to be positively-skewed.

Python/Pandas time series correlation on values vs differences

correlation numpy pandas python statistics

I am familiar with Pandas Series corr function to compute the correlation between two Series, so example: This willl compute the correlation in the VALUES of the two series, but if I’m working with a Time Series, I might want to compute teh correlation on changes (absolute changes or percentage changes and over 1d, 1w, 1m, etc). Some of the

Creating vector with intervals drawn from Poisson process

poisson python random statistics

I’m looking for some advice on how to implement some statistical models in Python. I’m interested in constructing a sequence of z values (z_1,z_2,z_3,…,z_n) where the number of jumps in an interval (z_1,z_2] is distributed according to the Poisson distribution with parameter lambda(z_2-z_1) and the numbers of random jumps over disjoint intervals are independent random variables. I want my piecewise

Why do coefficient of determination, R², implementations produce different results?

coefficient-of-determination numpy python statistics

When attempting to implement a python function for calculating the coefficient of determination, R², I noticed I got wildly different results depending on whose calculation sequence I used. The wikipedia page on R² gives a seemingly very clear explanation as to how R² should be calculated. My numpy interpretation of what is being said on the wiki page is the

How to Generate a dataset based on mean, median, 1st & 9th decile values?

data-science numpy pandas python statistics

I have the following values that describe a dataset: I need to generate any datasets that will fit these values. All the examples I found require you to have the standard deviation which I don’t. How this can be done? Thanks! Answer Interesting question! Based on Scott’s suggestions I gave it a quick try. Inputs: The Function: Comparaison: Output: Getting