Skip to content
Advertisement

Python Pandas how to test equality between pandas columns that are category data types

I have large datasets that I cross-join with python pandas. Both datasets load in pandas and I convert all ‘object’ columns to ‘category’. The issue is I need to pd.query() against various ‘category’ dtype columns. When doing so with ‘category’ columns it returns an error (I expect this because not all columns have the same values (e.g. subsets and supersets that exist in both, one, or none). However, in the pd.query() method I can convert each via df[“col1”].astype(“object”) and test against another object column. When comparing “object” types my datasets grow in memory size to Gigabytes and I run into MemoryError. Is there something I’m unaware of that may allow me to test equality between two or more ‘category’ dtype columns that have varying range of values? Example code below:

JavaScript

Result:

JavaScript

Advertisement

Answer

You can use union_categoricals to do what you want.

Then to compare your two columns you code will look like this:

JavaScript

Here is the doc if you need more ample details:

https://pandas.pydata.org/docs/reference/api/pandas.api.types.union_categoricals.html

This post is a duplicate of this question:

Update categories in two Series / Columns for comparison

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement