I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here’s a subsection of it:
ID Age BMI Risk Factor PT 6 48 19.3 4 PT 8 43 20.9 NaN PT 2 39 18.1 3 PT 9 41 19.5 NaN
Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to zscore normalize pandas column with nans?
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
I’m interested in applying this solution to all of my columns except the ID column to produce a new dataframe which I can save as an Excel file using
df2.to_excel("Z-Scores.xlsx")
So basically; how can I compute z-scores for each column (ignoring NaN values) and push everything into a new dataframe?
SIDENOTE: there is a concept in pandas called “indexing” which intimidates me because I do not understand it well. If indexing is a crucial part of solving this problem, please dumb down your explanation of indexing.
Advertisement
Answer
Build a list from the columns and remove the column you don’t want to calculate the Z score for:
In [66]: cols = list(df.columns) cols.remove('ID') df[cols] Out[66]: Age BMI Risk Factor 0 6 48 19.3 4 1 8 43 20.9 NaN 2 2 39 18.1 3 3 9 41 19.5 NaN In [68]: # now iterate over the remaining columns and create a new zscore column for col in cols: col_zscore = col + '_zscore' df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0) df Out[68]: ID Age BMI Risk Factor Age_zscore BMI_zscore Risk_zscore 0 PT 6 48 19.3 4 -0.093250 1.569614 -0.150946 1 PT 8 43 20.9 NaN 0.652753 0.074744 1.459148 2 PT 2 39 18.1 3 -1.585258 -1.121153 -1.358517 3 PT 9 41 19.5 NaN 1.025755 -0.523205 0.050315 Factor_zscore 0 1 1 NaN 2 -1 3 NaN