Skip to content
Advertisement

Including specific strings in missing values count

I need to determine the percentage of missing and not available values.

I have approx. 30 columns and data for missing are NA, ‘Information not found’ (string), and ‘Data not available’ (string). For determining the pct of missing (NA) values, I am using the following:

percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)

How can I include also the other two cases (‘Information not found’ and ‘Data not available’)?

For example:

A     B              C           D
NA    ex1 Data not available   ex1
ex2   Information not found ex2 ex2

ex1 and ex2 are just dummies.

Expected output:

NA    ...%
Information not found ...%
Data not available  ....%

for each column:

         NA        Information not found     Data not available
A  
B
C

Advertisement

Answer

You can use value_counts:

filtered = df.apply(pd.Series.value_counts).fillna(0)

#transpose to match your required format and keep only the columns you need
filtered = filtered.T[["Data not available", "Information not found"]]
filtered["NaN"] = df.isnull().sum()

#change to percentages
filtered = filtered.mul(100).divide(df.shape[0])

>>> filtered
   Data not available  Information not found   NaN
A                 0.0                    0.0  50.0
B                 0.0                   50.0   0.0
C                50.0                    0.0   0.0
D                 0.0                    0.0   0.0
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement