Skip to content
Advertisement

Reference DataFrame value corresponding to column header

I am trying to append a column to my DataFrame that is based on the values that are referenced by indicated column names.

I have the following DataFrame:

     Area      1         2         3         4      Select     
-----------------------------------------------------------
0      22     54        33        46        23           4       
1      45     36        54        32        14           1        
2      67     34        29        11        14           3       
3      54     35        19        22        45           2        
4      21     27        39        43        22           3   

The values under “Select” are referencing the values that are under the column number that “Select” shows. For example, for row 0, “Select” shows 4, which refers to the value under column “4” in row 0, which is 23. And then for row 1, “Select” shows 1, which refers to the value under column “1” in row 1, which is 36.

I want to append a new column to my DataFrame that has the values that “Select” is referencing.

And so I need to take my DataFrame and create the following DataFrame:

     Area      1         2         3         4      Select      Value
----------------------------------------------------------------------
0      22     54        33        46        23           4         23
1      45     36        54        32        14           1         36 
2      67     34        29        11        14           3         11
3      54     35        19        22        45           2         19
4      21     27        39        43        22           3         43  

I am not sure how to pull the values from under the numbered columns referenced by the “Select” column, since the column headers are just titles and not actual values to be indexed. How might this be accomplished in python?

Advertisement

Answer

We can use numpy indexing as recommended by the documentation on Looking up values by index/column labels as the replacement for the deprecated DataFrame.lookup.

With factorize Select and reindex:

idx, cols = pd.factorize(df['Select'])
df['value'] = (
    df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
)
  • Notice 1: If there is a value in the factorized column that does not correspond to a column header, the resulting value will be NaN (which indicates missing Data).

  • Notice 2: Both indexers need to be 0 based range indexes (compatible with numpy indexing). np.arange(len(df)) creates the range index based on the length of the DataFrame and therefore works in all cases.

However, if the DataFrame already has a compatible index (like in this example) df.index can be used directly.

idx, cols = pd.factorize(df['Select'])
df['value'] = (
    df.reindex(cols, axis=1).to_numpy()[df.index, idx]
)

df:

   Area   1   2   3   4  Select  value
0    22  54  33  46  23       4     23
1    45  36  54  32  14       1     36
2    67  34  29  11  14       3     11
3    54  35  19  22  45       2     19
4    21  27  39  43  22       3     43

Another option is Index.get_indexer:

df['value'] = df.to_numpy()[
    df.index.get_indexer(df.index),
    df.columns.get_indexer(df['Select'])
]
  • Notice: The same condition as above applies, if df.index is already a contiguous 0-Based index (compatible with numpy indexing) we can use df.index directly, instead of processing it with Index.get_indexer:
df['value'] = df.to_numpy()[
    df.index,
    df.columns.get_indexer(df['Select'])
]

df:

   Area   1   2   3   4  Select  value
0    22  54  33  46  23       4     23
1    45  36  54  32  14       1     36
2    67  34  29  11  14       3     11
3    54  35  19  22  45       2     19
4    21  27  39  43  22       3     43

Warning For get_indexer: if there is a value in Select that does not correspond to a column header the return value is -1 which will return the value from the last column in the DataFrame (since python supports negative indexing relative to the end). This is not as safe as NaN since it will return a value from the Select Column which is numeric and it may be difficult to tell the Data is invalid immediately.

Sample Program:

import pandas as pd

df = pd.DataFrame({
    'Select': ['B', 'A', 'C', 'D'],
    'A': [47, 2, 51, 95],
    'B': [56, 88, 10, 56],
    'C': [70, 73, 59, 56]
})

df['value'] = df.to_numpy()[
    df.index,
    df.columns.get_indexer(df['Select'])
]

print(df)

Notice in the last row the Select column is D but it pulls the value from C which is the last column in the DataFrame (-1). This is not immediately apparent that the lookup failed/is incorrect.

  Select   A   B   C value
0      B  47  56  70    56
1      A   2  88  73     2
2      C  51  10  59    59
3      D  95  56  56    56  # <- Value from C

Compared with factorize:

idx, cols = pd.factorize(df['Select'])
df['value'] = (
    df.reindex(cols, axis=1).to_numpy()[df.index, idx]
)

Notice in the last row the Select column is D and the corresponding value is NaN which is used in pandas to indicate missing data.

  Select   A   B   C  value
0      B  47  56  70   56.0
1      A   2  88  73    2.0
2      C  51  10  59   59.0
3      D  95  56  56    NaN  # <- Missing Data

Setup and imports:

import numpy as np  # (Only needed is using np.arange)
import pandas as pd

df = pd.DataFrame({
    'Area': [22, 45, 67, 54, 21],
    1: [54, 36, 34, 35, 27],
    2: [33, 54, 29, 19, 39],
    3: [46, 32, 11, 22, 43],
    4: [23, 14, 14, 45, 22],
    'Select': [4, 1, 3, 2, 3]
})
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement