Get values from dataframe with MultiIndex index containg NaNs

Question

I cannot access the values of an index position that has a nan in it and wonder how I could solve this. (In my project this index has a very special meaning and I really need to keep it, otherwise I would need to make some dirty manual modifications: "there is always a solution" even if it is a very

Accepted Answer

UpdateI am able to reproduce your error after grouping and aggregating a data frame.>>> import pandas as pd>>> data = pd.DataFrame({... "temp_playlist": [0] * 15,... "objId": ['o1'] * 2 + ['o2'] * 2 + ['o3'] * 2 + ['o4'] * 3 + ['o5'] * 2 + ['o6'] * 2 + [pd.NA] * 2,... "vals": [0, 6, 1, 4, 2, 5, 8, 9, 12, 10, 13, 11, 14, 3, 7]... })>>> df = data.groupby(["temp_playlist", "objId"], dropna=False).agg(list)>>> df.loc[(0, pd.NA)]Traceback (most recent call last): File "/home/ec2-user/miniconda3/envs/so-pandas-nan-index/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_itemKeyError: Passing in an explit MultiIndex works, though.>>> df.loc[pd.MultiIndex.from_tuples([(0, pd.NA)], names=["temp_playlist", "objId"])] valstemp_playlist objId0 NaN [3, 7]>>> df.loc[pd.MultiIndex.from_tuples([(0, pd.NA)])] vals0 NaN [3, 7]And so does returning a data frame using a single tuple. Note using [[]] returns a DataFrame.>>> df.loc[[(0, pd.NA)]] valstemp_playlist objId0 NaN [3, 7]As does DataFrame.reindex (see also the user guide on reindexing).>>> df.reindex([(0, pd.NA)]) valstemp_playlist objId0 NaN [3, 7]Original Attempt to Reproduce ErrorI am not able to reproduce your error. You can see below that using df.loc[(0, np.nan)] works.Python 3.8.5 (default, Sep 4 2020, 07:30:14)[GCC 7.3.0] :: Anaconda, Inc. on linuxType "help", "copyright", "credits" or "license" for more information.>>> import numpy as np>>> import pandas as pd>>> nan_index = pd.MultiIndex.from_tuples([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, np.nan)])>>> print(nan_index)MultiIndex([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, nan)], )>>> rng = np.random.default_rng(42)>>> vals = [rng.choice(20, 2) for i in range(nan_index.shape[0])]>>> print(vals)[array([ 1, 15]), array([13, 8]), array([ 8, 17]), array([ 1, 13]), array([4, 1]), array([10, 19]), array([14, 15])]>>> df = pd.DataFrame({"vals": vals}, index=nan_index)>>> print(df) vals0 o1 [1, 15] o2 [13, 8] o3 [8, 17] o4 [1, 13] o5 [4, 1] o6 [10, 19] NaN [14, 15]>>> print(df.loc[(0, 'o1')])vals [1, 15]Name: (0, o1), dtype: object>>> print(df.loc[(0, np.nan)])vals [14, 15]Name: (0, nan), dtype: object>>> print(pd.__version__)1.3.1Then I noticed that your index was printed as (0, nan) but mine was (0, np.nan). The difference was that I used np.nan and I suspect yours is pd.NA.>>> nan_index = pd.MultiIndex.from_tuples([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, pd.NA)])>>> nan_indexMultiIndex([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, nan)], )>>> df = pd.DataFrame({"vals": vals}, index=nan_index)>>> df vals0 o1 [1, 15] o2 [13, 8] o3 [8, 17] o4 [1, 13] o5 [4, 1] o6 [10, 19] NaN [14, 15]However, that did not resolve the difference. I was still able to use df.loc[(0, np.nan)].>>> df.loc[(0, pd.NA)]vals [14, 15]Name: (0, nan), dtype: object>>> df.loc[(0, np.nan)]vals [14, 15]Name: (0, nan), dtype: objectMoreover, I was also able to use df.loc[(0, None)].>>> df.loc[(0, None)]vals [14, 15]Name: (0, nan), dtype: objectJust to confirm, np.nan, pd.NA, and None are all different objects. Pandas must treat them the same when used with DataFrame.loc.>>> pd.NA is np.nanFalse>>> pd.NA is NoneFalse>>> np.nan is NoneFalse>>> type(pd.NA)>>> type(np.nan)

Advertisement

Answer

Update

Original Attempt to Reproduce Error