The operation pandas.DataFrame.lookup
is “Deprecated since version 1.2.0”, and has since invalidated a lot of previous answers.
This post attempts to function as a canonical resource for looking up corresponding row col pairs in pandas versions 1.2.0 and newer.
Standard LookUp Values With Default Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}) Col A B 0 B 1 5 1 A 2 6 2 A 3 7 3 B 4 8
I would like to be able to lookup the corresponding value in the column specified in Col
:
I would like my result to look like:
Col A B Val 0 B 1 5 5 1 A 2 6 2 2 A 3 7 3 3 B 4 8 8
Standard LookUp Values With a Non-Default Index
Non-Contiguous Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=[0, 2, 8, 9]) Col A B 0 B 1 5 2 A 2 6 8 A 3 7 9 B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val 0 B 1 5 5 2 A 2 6 2 8 A 3 7 3 9 B 4 8 8
MultiIndex
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']])) Col A B C E B 1 5 F A 2 6 D E A 3 7 F B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val C E B 1 5 5 F A 2 6 2 D E A 3 7 3 F B 4 8 8
LookUp with Default For Unmatched/Not-Found Values
Given the following DataFrame
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}) Col A B 0 B 1 5 1 A 2 6 2 A 3 7 3 C 4 8 # Column C does not correspond with any column
I would like to look up the corresponding values if one exists otherwise I’d like to have it default to 0
Col A B Val 0 B 1 5 5 1 A 2 6 2 2 A 3 7 3 3 C 4 8 0 # Default value 0 since C does not correspond
LookUp with Missing Values in the lookup Col
Given the following DataFrame:
Col A B 0 B 1 5 1 A 2 6 2 A 3 7 3 NaN 4 8 # <- Missing Lookup Key
I would like any NaN
values in Col
to result in a NaN
value in Val
Col A B Val 0 B 1 5 5.0 1 A 2 6 2.0 2 A 3 7 3.0 3 NaN 4 8 NaN # NaN to indicate missing
Advertisement
Answer
Standard LookUp Values With Any Index
The documentation on Looking up values by index/column labels recommends using NumPy indexing via factorize
and reindex
as the replacement for the deprecated DataFrame.lookup
.
import numpy as np import pandas as pd df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=[0, 2, 8, 9]) idx, col = pd.factorize(df['Col']) df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df
Col A B Val 0 B 1 5 5 1 A 2 6 2 2 A 3 7 3 3 B 4 8 8
factorize
is used to convert the column encode the values as an “enumerated type”.
idx, col = pd.factorize(df['Col']) # idx = array([0, 1, 1, 0], dtype=int64) # col = Index(['B', 'A'], dtype='object')
Notice that B
corresponds to 0
and A
corresponds to 1
. reindex
is used to ensure that columns appear in the same order as the enumeration:
df.reindex(columns=col) B A # B appears First (location 0) A appers second (location 1) 0 5 1 1 6 2 2 7 3 3 8 4
We need to create an appropriate range indexer compatible with NumPy indexing.
The standard approach is to use np.arange
based on the length of the DataFrame:
np.arange(len(df)) [0 1 2 3]
Now NumPy indexing will work to select values from the DataFrame:
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx] [5 2 3 8]
*Note: This approach will always work regardless of type of index.
MultiIndex
import numpy as np import pandas as pd df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']])) idx, col = pd.factorize(df['Col']) df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val C E B 1 5 5 F A 2 6 2 D E A 3 7 3 F B 4 8 8
Why use np.arange
and not df.index
directly?
Standard Contiguous Range Index
import pandas as pd df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}) idx, col = pd.factorize(df['Col']) df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
In this case only, there is no error as the result from np.arange
is the same as the df.index
.
df
Col A B Val 0 B 1 5 5 1 A 2 6 2 2 A 3 7 3 3 B 4 8 8
Non-Contiguous Range Index Error
Raises IndexError:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=[0, 2, 8, 9]) idx, col = pd.factorize(df['Col']) df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx] IndexError: index 8 is out of bounds for axis 0 with size 4
MultiIndex Error
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']])) idx, col = pd.factorize(df['Col']) df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
Raises IndexError:
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx] IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
LookUp with Default For Unmatched/Not-Found Values
There are a few approaches.
First let’s look at what happens by default if there is a non-corresponding value:
import numpy as np import pandas as pd df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}) # Col A B # 0 B 1 5 # 1 A 2 6 # 2 A 3 7 # 3 C 4 8 idx, col = pd.factorize(df['Col']) df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val 0 B 1 5 5.0 1 A 2 6 2.0 2 A 3 7 3.0 3 C 4 8 NaN # NaN Represents the Missing Value in C
If we look at why the NaN
values are introduced, we will find that when factorize
goes through the column it will enumerate all groups present regardless of whether they correspond to a column or not.
For this reason, when we reindex
the DataFrame we will end up with the following result:
idx, col = pd.factorize(df['Col']) df.reindex(columns=col)
idx = array([0, 1, 1, 2], dtype=int64) col = Index(['B', 'A', 'C'], dtype='object') df.reindex(columns=col) B A C 0 5 1 NaN 1 6 2 NaN 2 7 3 NaN 3 8 4 NaN # Reindex adds the missing column with the Default `NaN`
If we want to specify a default value, we can specify the fill_value
argument of reindex
which allows us to modify the behaviour as it relates to missing column values:
idx, col = pd.factorize(df['Col']) df.reindex(columns=col, fill_value=0)
idx = array([0, 1, 1, 2], dtype=int64) col = Index(['B', 'A', 'C'], dtype='object') df.reindex(columns=col, fill_value=0) B A C 0 5 1 0 1 6 2 0 2 7 3 0 3 8 4 0 # Notice reindex adds missing column with specified value `0`
This means that we can do:
idx, col = pd.factorize(df['Col']) df['Val'] = df.reindex( columns=col, fill_value=0 # Default value for Missing column values ).to_numpy()[np.arange(len(df)), idx]
df
:
Col A B Val 0 B 1 5 5 1 A 2 6 2 2 A 3 7 3 3 C 4 8 0
*Notice the dtype
of the column is int
, since NaN
was never introduced, and, therefore, the column type was not changed.
LookUp with Missing Values in the lookup Col
factorize
has a default na_sentinel=-1
, meaning that when NaN
values appear in the column being factorized the resulting idx
value is -1
import numpy as np import pandas as pd df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}) # Col A B # 0 B 1 5 # 1 A 2 6 # 2 A 3 7 # 3 NaN 4 8 # <- Missing Lookup Key idx, col = pd.factorize(df['Col']) # idx = array([ 0, 1, 1, -1], dtype=int64) # col = Index(['B', 'A'], dtype='object') df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx] # Col A B Val # 0 B 1 5 5 # 1 A 2 6 2 # 2 A 3 7 3 # 3 NaN 4 8 4 <- Value From A
This -1
means that, by default, we’ll be pulling from the last column when we reindex. Notice the col
still only contains the values B
and A
. Meaning, that we will end up with the value from A
in Val
for the last row.
The easiest way to handle this is to fillna
Col
with some value that cannot be found in the column headers.
Here I use the empty string ''
:
idx, col = pd.factorize(df['Col'].fillna('')) # idx = array([0, 1, 1, 2], dtype=int64) # col = Index(['B', 'A', ''], dtype='object')
Now when I reindex, the ''
column will contain NaN
values meaning that the lookup produces the desired result:
import numpy as np import pandas as pd df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan], 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}) idx, col = pd.factorize(df['Col'].fillna('')) df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df
:
Col A B Val 0 B 1 5 5.0 1 A 2 6 2.0 2 A 3 7 3.0 3 NaN 4 8 NaN # Missing as expected