Skip to content
Advertisement

New column based on values ​from other columns AND respecting pre-established rules

I’m looking for an algorithm to create a new column based on values ​​from other columns AND respecting pre-established rules. Here’s an example:

artificial data

df = data.frame(
  col_1 = c('No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'),
  col_2 = c('Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'),
  col_3 = c('Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown')
)

The goal is to create a new_column based on the values ​​of col_1, col_2, and col_3. For that, the rules are:

  • If the value ‘Yes’ is present in any of the columns, the value of the new_column will be ‘Yes’;
  • If the value ‘Yes’ is not present in any of the columns, but the value ‘No’ is present, then the value of the new_column will be ‘No’;
  • If the values ​​’Yes’ and ‘No’ are absent, then the value of new_columns will be ‘Unknown’.

I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.

Current solution:

library(dplyr)
df_1 <-
  df %>%
  mutate(
    new_column = ifelse(
      (col_1 == 'Yes' | col_2 == 'Yes' | col_3 == 'Yes'), 'Yes',
      ifelse(
        (col_1 == 'Unknown' & col_2 == 'Unknown' & col_3 == 'Unknown'), 'Unknown','No'
        )
      )
    )

I’m looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.

After searching for StackOverflow, I couldn’t find a way to my problem (I know there are several posts about creating a new column based on values ​​obtained from different columns, but none). Perhaps the search strategy was not the best. If anyone finds it, please provide the link.

I used R in the code, but the current solution works in Python using np.where. Solutions in R or Python are welcome.

Advertisement

Answer

A solution using Python:

import pandas as pd

df = pd.DataFrame({
  'col_1': ['No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'],
  'col_2': ['Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'],
  'col_3': ['Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown']
})

df['col_4'] = [('Yes' if 'Yes' in x else ('No' if 'No' in x else 'Unknown')) for x in zip(df['col_1'], df['col_2'], df['col_3'])]

print(df)

Output:

     col_1    col_2    col_3    col_4
0       No      Yes  Unknown      Yes
1      Yes      Yes      Yes      Yes
2      Yes  Unknown      Yes      Yes
3      Yes      Yes  Unknown      Yes
4      Yes  Unknown  Unknown      Yes
5      Yes       No       No      Yes
6       No  Unknown       No       No
7       No       No  Unknown       No
8       No  Unknown  Unknown       No
9  Unknown  Unknown  Unknown  Unknown
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement