Skip to content
Advertisement

Python Pandas how to test equality between pandas columns that are category data types

I have large datasets that I cross-join with python pandas. Both datasets load in pandas and I convert all ‘object’ columns to ‘category’. The issue is I need to pd.query() against various ‘category’ dtype columns. When doing so with ‘category’ columns it returns an error (I expect this because not all columns have the same values (e.g. subsets and supersets that exist in both, one, or none). However, in the pd.query() method I can convert each via df[“col1”].astype(“object”) and test against another object column. When comparing “object” types my datasets grow in memory size to Gigabytes and I run into MemoryError. Is there something I’m unaware of that may allow me to test equality between two or more ‘category’ dtype columns that have varying range of values? Example code below:

import pandas as pd

df = pd.DataFrame({"c1":["a","b","c","d"],
                   "c2":["d","e","f","d"]})

print(df)
print()

df["c1"] = df["c1"].astype("category")
df["c2"] = df["c2"].astype("category")

### testing equality of category column dtypes
try : df.query("c1 == c2")
except: print("ERROR: we know this returns 'ValueError: unkown type object' bc comparing 'category' column dtypes ")
print()

dfObj = df.query("c1.astype('object') == c2.astype('object')")
print("NO ERROR because during comparions converted 'category' to 'object'!")
print(dfObj)

Result:

  c1 c2
0  a  d
1  b  e
2  c  f
3  d  d

ERROR: we know this returns 'ValueError: unkown type object' bc comparing 'category' column dtypes 

NO ERROR because during comparions converted 'category' to 'object'!
  c1 c2
3  d  d

Advertisement

Answer

You can use union_categoricals to do what you want.

Then to compare your two columns you code will look like this:

from pandas.api.types import union_categoricals

union = union_categoricals([df.c1, df.c2]).categories

df['c1'] = df.c1.cat.set_categories(union)
df['c2'] = df.c2.cat.set_categories(union)

print (df.c1 == df.c2)
0    False
1    False
2    False
3     True

Here is the doc if you need more ample details:

https://pandas.pydata.org/docs/reference/api/pandas.api.types.union_categoricals.html

This post is a duplicate of this question:

Update categories in two Series / Columns for comparison

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement