I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:
dataset = np.genfromtxt("data.csv", delimiter=",",dtype=('|S1', float, float,float,float,float,float,float,int))
Now I would like to get some descriptive statistics for each column (min, max, stdev, mean, median, etc.). Shouldn’t there be an easy way to do this?
I tried this:
from scipy import stats stats.describe(dataset)
but this returns an error: TypeError: cannot perform reduce with flexible type
How can I get descriptive statistics of the created NumPy array?
Advertisement
Answer
This is not a pretty solution, but it gets the job done. The problem is that by specifying multiple dtypes, you are essentially making a 1D-array of tuples (actually np.void
), which cannot be described by stats as it includes multiple different types, incl. strings.
This could be resolved by either reading it in two rounds, or using pandas with read_csv
.
If you decide to stick to numpy
:
import numpy as np a = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=range(1,9)) s = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=0,dtype='|S1') from scipy import stats for arr in a: #do not need the loop at this point, but looks prettier print(stats.describe(arr)) #Output per print: DescribeResult(nobs=6, minmax=(0.34999999999999998, 0.70999999999999996), mean=0.54500000000000004, variance=0.016599999999999997, skewness=-0.3049304880932534, kurtosis=-0.9943046886340534)
Note that in this example the final array has dtype
as float
, not int
, but can easily (if necessary) be converted to int using arr.astype(int)