Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer

Question

The operation pandas.DataFrame.lookup is &#8220;Deprecated since version 1.2.0&#8221;, and has since invalidated a lot of previous answers. This post attempts to function as a canonical resource for looking up corresponding row col pairs in pandas versions 1.2.0 and newer. Standard LookUp Values With Default …

Accepted Answer

Standard LookUp Values With Any IndexThe documentation on Looking up values by index/column labels recommends using NumPy indexing via factorize and reindex as the replacement for the deprecated DataFrame.lookup.import numpy as npimport pandas as pddf = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],                   'A': [1, 2, 3, 4],                   'B': [5, 6, 7, 8]},                  index=[0, 2, 8, 9])idx, col = pd.factorize(df['Col'])df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]df  Col  A  B  Val0   B  1  5    51   A  2  6    22   A  3  7    33   B  4  8    8factorize is used to convert the column encode the values as an &#8220;enumerated type&#8221;.idx, col = pd.factorize(df['Col'])# idx = array([0, 1, 1, 0], dtype=int64)# col = Index(['B', 'A'], dtype='object')Notice that B corresponds to 0 and A corresponds to 1. reindex is used to ensure that columns appear in the same order as the enumeration:df.reindex(columns=col)   B  A  # B appears First (location 0) A appers second (location 1)0  5  11  6  22  7  33  8  4We need to create an appropriate range indexer compatible with NumPy indexing.The standard approach is to use np.arange based on the length of the DataFrame:np.arange(len(df))[0 1 2 3]Now NumPy indexing will work to select values from the DataFrame:df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx][5 2 3 8]*Note: This approach will always work regardless of type of index.MultiIndeximport numpy as npimport pandas as pddf = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],                   'A': [1, 2, 3, 4],                   'B': [5, 6, 7, 8]},                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))idx, col = pd.factorize(df['Col'])df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]    Col  A  B  ValC E   B  1  5    5  F   A  2  6    2D E   A  3  7    3  F   B  4  8    8Why use np.arange and not df.index directly?Standard Contiguous Range Indeximport pandas as pddf = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],                   'A': [1, 2, 3, 4],                   'B': [5, 6, 7, 8]})idx, col = pd.factorize(df['Col'])df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]In this case only, there is no error as the result from np.arange is the same as the df.index.df  Col  A  B  Val0   B  1  5    51   A  2  6    22   A  3  7    33   B  4  8    8Non-Contiguous Range Index ErrorRaises IndexError:df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],                   'A': [1, 2, 3, 4],                   'B': [5, 6, 7, 8]},                  index=[0, 2, 8, 9])idx, col = pd.factorize(df['Col'])df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]IndexError: index 8 is out of bounds for axis 0 with size 4MultiIndex Errordf = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],                   'A': [1, 2, 3, 4],                   'B': [5, 6, 7, 8]},                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))idx, col = pd.factorize(df['Col'])df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]Raises IndexError:df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indicesLookUp with Default For Unmatched/Not-Found ValuesThere are a few approaches.First let&#8217;s look at what happens by default if there is a non-corresponding value:import numpy as npimport pandas as pddf = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],                   'A': [1, 2, 3, 4],                   'B': [5, 6, 7, 8]})#   Col  A  B# 0   B  1  5# 1   A  2  6# 2   A  3  7# 3   C  4  8idx, col = pd.factorize(df['Col'])df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]  Col  A  B  Val0   B  1  5  5.01   A  2  6  2.02   A  3  7  3.03   C  4  8  NaN  # NaN Represents the Missing Value in CIf we look at why the NaN values are introduced, we will find that when factorize goes through the column it will enumerate all groups present regardless of whether they correspond to a column or not.For this reason, when we reindex the DataFrame we will end up with the following result:idx, col = pd.factorize(df['Col'])df.reindex(columns=col)idx = array([0, 1, 1, 2], dtype=int64)col = Index(['B', 'A', 'C'], dtype='object')df.reindex(columns=col)   B  A   C0  5  1 NaN1  6  2 NaN2  7  3 NaN3  8  4 NaN  # Reindex adds the missing column with the Default `NaN`If we want to specify a default value, we can specify the fill_value argument of reindex which allows us to modify the behaviour as it relates to missing column values:idx, col = pd.factorize(df['Col'])df.reindex(columns=col, fill_value=0)idx = array([0, 1, 1, 2], dtype=int64)col = Index(['B', 'A', 'C'], dtype='object')df.reindex(columns=col, fill_value=0)   B  A  C0  5  1  01  6  2  02  7  3  03  8  4  0  # Notice reindex adds missing column with specified value `0`This means that we can do:idx, col = pd.factorize(df['Col'])df['Val'] = df.reindex(    columns=col,     fill_value=0  # Default value for Missing column values).to_numpy()[np.arange(len(df)), idx]df:  Col  A  B  Val0   B  1  5    51   A  2  6    22   A  3  7    33   C  4  8    0*Notice the dtype of the column is int, since NaN was never introduced, and, therefore, the column type was not changed.LookUp with Missing Values in the lookup Colfactorize has a default na_sentinel=-1, meaning that when NaN values appear in the column being factorized the resulting idx value is -1import numpy as npimport pandas as pddf = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],                   'A': [1, 2, 3, 4],                   'B': [5, 6, 7, 8]})#    Col  A  B# 0    B  1  5# 1    A  2  6# 2    A  3  7# 3  NaN  4  8  # <- Missing Lookup Keyidx, col = pd.factorize(df['Col'])# idx = array([ 0,  1,  1, -1], dtype=int64)# col = Index(['B', 'A'], dtype='object')df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]#    Col  A  B  Val# 0    B  1  5    5# 1    A  2  6    2# 2    A  3  7    3# 3  NaN  4  8    4 <- Value From AThis -1 means that, by default, we&#8217;ll be pulling from the last column when we reindex. Notice the col still only contains the values B and A. Meaning, that we will end up with the value from A in Val for the last row.The easiest way to handle this is to fillna Col with some value that cannot be found in the column headers.Here I use the empty string '':idx, col = pd.factorize(df['Col'].fillna(''))# idx = array([0, 1, 1, 2], dtype=int64)# col = Index(['B', 'A', ''], dtype='object')Now when I reindex, the '' column will contain NaN values meaning that the lookup produces the desired result:import numpy as npimport pandas as pddf = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],                   'A': [1, 2, 3, 4],                   'B': [5, 6, 7, 8]})idx, col = pd.factorize(df['Col'].fillna(''))df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]df:   Col  A  B  Val0    B  1  5  5.01    A  2  6  2.02    A  3  7  3.03  NaN  4  8  NaN  # Missing as expected

Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer

Standard LookUp Values With Default Range Index

Standard LookUp Values With a Non-Default Index

Non-Contiguous Range Index

MultiIndex

LookUp with Default For Unmatched/Not-Found Values

LookUp with Missing Values in the lookup Col

Advertisement

Answer

Standard LookUp Values With Any Index

MultiIndex

Why use `np.arange` and not `df.index` directly?

Standard Contiguous Range Index

Non-Contiguous Range Index Error

MultiIndex Error

LookUp with Default For Unmatched/Not-Found Values

LookUp with Missing Values in the lookup Col