I generate a multi-index dataframe that has some NAN values using this:
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],["one", "two", "one", "two", "one", "two", "one", "two"],] tuples = list(zip(*arrays)) index = pd.MultiIndex.from_tuples(tuples) a = np.random.randn(3, 8) mask = np.random.choice([1, 0], a.shape, p=[.3, .7]).astype(bool) a[mask] = np.nan df = pd.DataFrame(a, columns=index) df
Which will create something like this:
I’d like to get rows of a specific subset of top-level columns (eg df[['baz','qux']]
) that have no nulls. For example in df[['baz','qux']]
I’d like to get rows 0 and 1 since they both have all nulls in 3.
Hoping things would just work like a normal df I tried:
cols = ['bar','baz'] df[cols].loc[df[cols].notnull()]
But I obviously am missing something:
ValueError: Cannot index with multidimensional key
The pandas documentation for multiindex/advanced indexing illustrates how to index and slic this sort of dataframe but doesn’t seem to have anything regarding .loc/lookups/filtering. So I assume I’m looking in the wrong place. But I am having trouble finding results or resources on this.
Advertisement
Answer
df[cols].notna()
is not a 1D boolean mask. You have to reduce the dimension using all
or any
on axis.
>>> df[df[cols].notna().all(1)] bar baz foo qux one two one two one two one two 0 1.799680 -0.901705 -1.575930 0.185863 -0.793007 1.485423 NaN NaN 2 1.379878 -0.748599 0.661697 -1.015311 -0.858144 NaN -1.623013 0.340043