Tag: statistics

Pandas sum of count per percentile of rows

Here is a link to a working example on Google Colaboratory. I have a dataset that represents the reviews (between 0.0 to 10.0) that users have left on various books. It looks like this: The first rows have 1 review while the last ones have thousands. I want to see the distribution of the reviews across the user population. I

How to sample data points for two variables that has highest (close to +1) or lowest (close to zero) correlation coefficient?

matlab python r random statistics

Let’s assume that we have N (N=212 in this case) number of datapoints for both variables A and B. I have to sample n (n=50 in this case) number of data points for A and B such that A and B should have the highest possible positive correlation coefficient or lowest correlation coefficient (close to zero) for that sample set.

Show Cancer Specific Survival at exact time (Kaplan Meier in Lifelines)

lifelines python statistics survival survival-analysis

shows me Cancer Specific Survival (CSS) of my cohort at different times (0, 4, 6…128 month). How can CSS be shown at exactly 120 month? Answer The survival_function_at_times() method will get you that value. Here is an example with a sample dataset:

how can I find a date with incorrect Syntax and fix it

pandas python statistics

I am new to python. I have a dataset I converted it to dataframe. all my dates are objects now. I need to convert them into dates in order to find the age of patients. My dimensions are 3400×14 long. there are date values inside which have incorrect syntax. I cannot find them. is there a way to find them?

How can I find the mode (a number) of a kde histogram in python

kernel-density matplotlib python seaborn statistics

I want to determine the X value that has the highest pick in the histogram. The code to print the histogram: Histogram and value wanted (in fact, I would like all 4): Answer You will need to retrieve the underlying x and y data for your lines using matplotlib methods. If you are using displot, as in your excerpt, then

Simulating expectation of continuous random variable

distribution numpy python simulation statistics

Currently I want to generate some samples to get expectation & variance of it. Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise} I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable But here is what I have when simulating using python: What am I doing

Processing multiple modes in pandas

dataframe pandas python statistics

I’m obviously dealing with slightly more complex and realistic data, but to showcase my trouble, let’s assume we have these data: I want to find modal values of purchases by date: agg_mode will show that for user_id 100 we have two modal values: [cookies, jam]. This is totally fine with me, when it comes to real data we’ve come up

How to generate random values for a predefined function?

matplotlib numpy python scipy statistics

I have a predefined function, for example this: How can I generate random values against it so I can plot the results of the function using matplotlib? Answer If you want to plot, don’t use random x values but rather a range. Also you should use numpy.exp that can take a vector as input and your y in the lambda

Create a for loop of wilcoxon rank sum tests in python to generate a list of p-values?

for-loop python statistics

I have a dataframe that follows this format: It is much larger (it has about 1000 genes, i.e., columns). Each number corresponds to an mRNA abundance value. I need to compare AC and SCC subtypes for each gene using the Wilcoxon rank sum test. I need to do this for every gene in my dataset, so I essentially need to