I have large datasets that I cross-join with python pandas. Both datasets load in pandas and I convert all ‘object’ columns to ‘category’. The issue is I need to pd.query() against various ‘category’ dtype columns. When doing so with ‘category’ columns it returns an error (I expect this because not all columns have the same values (e.g. subsets and supersets that exist in both, one, or none). However, in the pd.query() method I can convert each via df[“col1”].astype(“object”) and test against another object column. When comparing “object” types my datasets grow in memory size to Gigabytes and I run into MemoryError. Is there something I’m unaware of that may allow me to test equality between two or more ‘category’ dtype columns that have varying range of values? Example code below:
import pandas as pd df = pd.DataFrame({"c1":["a","b","c","d"], "c2":["d","e","f","d"]}) print(df) print() df["c1"] = df["c1"].astype("category") df["c2"] = df["c2"].astype("category") ### testing equality of category column dtypes try : df.query("c1 == c2") except: print("ERROR: we know this returns 'ValueError: unkown type object' bc comparing 'category' column dtypes ") print() dfObj = df.query("c1.astype('object') == c2.astype('object')") print("NO ERROR because during comparions converted 'category' to 'object'!") print(dfObj)
Result:
c1 c2 0 a d 1 b e 2 c f 3 d d ERROR: we know this returns 'ValueError: unkown type object' bc comparing 'category' column dtypes NO ERROR because during comparions converted 'category' to 'object'! c1 c2 3 d d
Advertisement
Answer
You can use union_categoricals
to do what you want.
Then to compare your two columns you code will look like this:
from pandas.api.types import union_categoricals union = union_categoricals([df.c1, df.c2]).categories df['c1'] = df.c1.cat.set_categories(union) df['c2'] = df.c2.cat.set_categories(union) print (df.c1 == df.c2) 0 False 1 False 2 False 3 True
Here is the doc if you need more ample details:
https://pandas.pydata.org/docs/reference/api/pandas.api.types.union_categoricals.html
This post is a duplicate of this question: