I'm looking for an algorithm to create a new column based on values from other columns AND respecting pre-established rules. Here's an example: artificial data The goal is to create a new_column based on the values of col_1, col_2, and col_3. For that, the rules are: If the value 'Yes' is present in any of the columns, the value of

New column based on values from other columns AND respecting pre-established rules

I’m looking for an algorithm to create a new column based on values from other columns AND respecting pre-established rules. Here’s an example:

artificial data

df = data.frame(
  col_1 = c('No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'),
  col_2 = c('Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'),
  col_3 = c('Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown')
)

JavaScript
​x
 
df = data.frame(
  col_1 = c('No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'),
  col_2 = c('Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'),
  col_3 = c('Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown')
)
​

The goal is to create a new_column based on the values of col_1, col_2, and col_3. For that, the rules are:

If the value ‘Yes’ is present in any of the columns, the value of the new_column will be ‘Yes’;
If the value ‘Yes’ is not present in any of the columns, but the value ‘No’ is present, then the value of the new_column will be ‘No’;
If the values ’Yes’ and ‘No’ are absent, then the value of new_columns will be ‘Unknown’.

I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.

Current solution:

library(dplyr)
df_1 <-
  df %>%
  mutate(
    new_column = ifelse(
      (col_1 == 'Yes' | col_2 == 'Yes' | col_3 == 'Yes'), 'Yes',
      ifelse(
        (col_1 == 'Unknown' & col_2 == 'Unknown' & col_3 == 'Unknown'), 'Unknown','No'
        )
      )
    )

JavaScript
 
library(dplyr)
df_1 <-
  df %>%
  mutate(
    new_column = ifelse(
      (col_1 == 'Yes' | col_2 == 'Yes' | col_3 == 'Yes'), 'Yes',
      ifelse(
        (col_1 == 'Unknown' & col_2 == 'Unknown' & col_3 == 'Unknown'), 'Unknown','No'
        )
      )
    )
​

I’m looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.

After searching for StackOverflow, I couldn’t find a way to my problem (I know there are several posts about creating a new column based on values obtained from different columns, but none). Perhaps the search strategy was not the best. If anyone finds it, please provide the link.

I used R in the code, but the current solution works in Python using np.where. Solutions in R or Python are welcome.

Answer

A solution using Python:

import pandas as pd

df = pd.DataFrame({
  'col_1': ['No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'],
  'col_2': ['Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'],
  'col_3': ['Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown']
})

df['col_4'] = [('Yes' if 'Yes' in x else ('No' if 'No' in x else 'Unknown')) for x in zip(df['col_1'], df['col_2'], df['col_3'])]

print(df)

JavaScript
 
import pandas as pd
​
df = pd.DataFrame({
  'col_1': ['No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'],
  'col_2': ['Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'],
  'col_3': ['Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown']
})
​
df['col_4'] = [('Yes' if 'Yes' in x else ('No' if 'No' in x else 'Unknown')) for x in zip(df['col_1'], df['col_2'], df['col_3'])]
​
print(df)
​

Output:

     col_1    col_2    col_3    col_4
0       No      Yes  Unknown      Yes
1      Yes      Yes      Yes      Yes
2      Yes  Unknown      Yes      Yes
3      Yes      Yes  Unknown      Yes
4      Yes  Unknown  Unknown      Yes
5      Yes       No       No      Yes
6       No  Unknown       No       No
7       No       No  Unknown       No
8       No  Unknown  Unknown       No
9  Unknown  Unknown  Unknown  Unknown

JavaScript
 
     col_1    col_2    col_3    col_4
     No      Yes  Unknown      Yes
    Yes      Yes      Yes      Yes
    Yes  Unknown      Yes      Yes
    Yes      Yes  Unknown      Yes
    Yes  Unknown  Unknown      Yes
    Yes       No       No      Yes
     No  Unknown       No       No
     No       No  Unknown       No
     No  Unknown  Unknown       No
Unknown  Unknown  Unknown  Unknown
​

artificial data

The goal is to create a new_column based on the values ​​of col_1, col_2, and col_3. For that, the rules are:

I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.

I’m looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.

Advertisement

Answer

The goal is to create a new_column based on the values of col_1, col_2, and col_3. For that, the rules are: