I need to determine the percentage of missing and not available values.
I have approx. 30 columns and data for missing are NA, ‘Information not found’ (string), and ‘Data not available’ (string). For determining the pct of missing (NA) values, I am using the following:
percent_missing = df.isnull().sum() * 100 / len(df) missing_value_df = pd.DataFrame({'column_name': df.columns, 'percent_missing': percent_missing}) missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
How can I include also the other two cases (‘Information not found’ and ‘Data not available’)?
For example:
A B C D NA ex1 Data not available ex1 ex2 Information not found ex2 ex2
ex1 and ex2 are just dummies.
Expected output:
NA ...% Information not found ...% Data not available ....%
for each column:
NA Information not found Data not available A B C
Advertisement
Answer
You can use value_counts
:
filtered = df.apply(pd.Series.value_counts).fillna(0) #transpose to match your required format and keep only the columns you need filtered = filtered.T[["Data not available", "Information not found"]] filtered["NaN"] = df.isnull().sum() #change to percentages filtered = filtered.mul(100).divide(df.shape[0]) >>> filtered Data not available Information not found NaN A 0.0 0.0 50.0 B 0.0 50.0 0.0 C 50.0 0.0 0.0 D 0.0 0.0 0.0