I need to determine the percentage of missing and not available values.
I have approx. 30 columns and data for missing are NA, ‘Information not found’ (string), and ‘Data not available’ (string). For determining the pct of missing (NA) values, I am using the following:
JavaScript
x
5
1
percent_missing = df.isnull().sum() * 100 / len(df)
2
missing_value_df = pd.DataFrame({'column_name': df.columns,
3
'percent_missing': percent_missing})
4
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
5
How can I include also the other two cases (‘Information not found’ and ‘Data not available’)?
For example:
JavaScript
1
4
1
A B C D
2
NA ex1 Data not available ex1
3
ex2 Information not found ex2 ex2
4
ex1 and ex2 are just dummies.
Expected output:
JavaScript
1
4
1
NA %
2
Information not found %
3
Data not available .%
4
for each column:
JavaScript
1
5
1
NA Information not found Data not available
2
A
3
B
4
C
5
Advertisement
Answer
You can use value_counts
:
JavaScript
1
16
16
1
filtered = df.apply(pd.Series.value_counts).fillna(0)
2
3
#transpose to match your required format and keep only the columns you need
4
filtered = filtered.T[["Data not available", "Information not found"]]
5
filtered["NaN"] = df.isnull().sum()
6
7
#change to percentages
8
filtered = filtered.mul(100).divide(df.shape[0])
9
10
>>> filtered
11
Data not available Information not found NaN
12
A 0.0 0.0 50.0
13
B 0.0 50.0 0.0
14
C 50.0 0.0 0.0
15
D 0.0 0.0 0.0
16