I am dealing with a dataset that uses “..” as a placeholder for null values. These null values span across all of my columns. My dataset looks as follows:
Country Code | Year | GDP growth (%) | GDP (constant) |
---|---|---|---|
AFG | 2010 | 3.5 | .. |
AFG | 2011 | .. | 2345 |
AFG | 2012 | 1.4 | 3372 |
ALB | 2010 | .. | 4567 |
ALB | 2011 | .. | 5678 |
ALB | 2012 | 4.2 | .. |
DZA | 2010 | 2.0 | 4321 |
DZA | 2011 | .. | 5432 |
DZA | 2012 | 3.8 | 6543 |
I want to remove the rows containing missing data from my data however my solutions are not very clean.
I have tried:
df_GDP_1[df_GDP_1.str.contains("..")==False]
Which I had hoped to be a solution to deal with all columns at once, however this returns an error.
Otherwise I have tried:
df_GDP_1[df_GDP_1.col1 != '..' | df_GDP_1.col2 != '..']
However this solution requires me to alter names of columns to remove spaces and then reverse this after, and even at that, which seems unnecessarily long for the task at hand.
Any ideas which enable me to perform this in a cleaner manner would be greatly appreciated!
Advertisement
Answer
With combination of pandas.DataFrame.eq
and pandas.DataFrame.any
functions.
.any(1)
tells to find a match over the columns (axis=1)- the negation
~
tells to omit records with matches
In [269]: df[~df.eq("..").any(1)] Out[269]: Country Code Year GDP growth (%) GDP (constant) 2 AFG 2012 1.4 3372 6 DZA 2010 2.0 4321 8 DZA 2012 3.8 6543