Skip to content
Advertisement

PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance. How to get rid of it?

I have the following line of code

 end_df['Soma Internet'] = end_df.iloc[:,end_df.columns.get_level_values(1) == 'Internet'].drop('site',axis=1).sum(axis=1)

It basically, filts my multi index df by a specific level 1 column. Drops a few not wanted columns. And does the sum, of all the other ones.

I took a glance, at a few of the documentation and other asked questions. But i didnt quite understood what causes the warning, and i also would love to rewrite this code, so i get rid of it.

Advertisement

Answer

Let’s try with an example (without data for simplicity):

import pandas as pd

# Column MultiIndex.
idx = pd.MultiIndex(levels=[['Col1', 'Col2', 'Col3'], ['subcol1', 'subcol2']], 
                    codes=[[2, 1, 0], [0, 1, 1]])

df = pd.DataFrame(columns=range(len(idx)))
df.columns = idx
print(df)
    Col3    Col2    Col1
subcol1 subcol2 subcol2

Clearly, the column MultiIndex is not sorted. We can check it with:

print(df.columns.is_monotonic_increasing)
False

This matters because Pandas performs index lookup and other operations much faster if the index is sorted, because it can use operations that assume the sorted order and are faster. Indeed, if we try to drop a column:

df.drop('Col1', axis=1)
PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  df.drop('Col1', axis=1)

Instead, if we sort the index before dropping, the warning disappears:

print(df.sort_index(axis=1))

# Index is now sorted in lexicographical order.
    Col1    Col2    Col3
subcol2 subcol2 subcol1
# No warning here.
df.sort_index(axis=1).drop('Col1', axis=1)

EDIT (see comments): As the warning suggests, this happens when we do not specify the level from which we want to drop the column. This is because to drop the column, pandas has to traverse the whole index (happens here). By specifying it we do not need such traversal:

# Also no warning.
df.drop('Col1', axis=1, level=0)

However, in general this problem relates more on row indices, as usually column multi-indices are way smaller. But definitely to keep it in mind for larger indices and dataframes. In fact, this is in particular relevant for slicing by index and for lookups. In those cases, you want your index to be sorted for better performance.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement