I’m looking for an algorithm to create a new column based on values from other columns AND respecting pre-established rules. Here’s an example:
artificial data
df = data.frame( col_1 = c('No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'), col_2 = c('Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'), col_3 = c('Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown') )
The goal is to create a new_column based on the values of col_1, col_2, and col_3. For that, the rules are:
- If the value ‘Yes’ is present in any of the columns, the value of the new_column will be ‘Yes’;
- If the value ‘Yes’ is not present in any of the columns, but the value ‘No’ is present, then the value of the new_column will be ‘No’;
- If the values ’Yes’ and ‘No’ are absent, then the value of new_columns will be ‘Unknown’.
I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.
Current solution:
library(dplyr) df_1 <- df %>% mutate( new_column = ifelse( (col_1 == 'Yes' | col_2 == 'Yes' | col_3 == 'Yes'), 'Yes', ifelse( (col_1 == 'Unknown' & col_2 == 'Unknown' & col_3 == 'Unknown'), 'Unknown','No' ) ) )
I’m looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.
After searching for StackOverflow, I couldn’t find a way to my problem (I know there are several posts about creating a new column based on values obtained from different columns, but none). Perhaps the search strategy was not the best. If anyone finds it, please provide the link.
I used R in the code, but the current solution works in Python using np.where. Solutions in R or Python are welcome.
Advertisement
Answer
A solution using Python:
import pandas as pd df = pd.DataFrame({ 'col_1': ['No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'], 'col_2': ['Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'], 'col_3': ['Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown'] }) df['col_4'] = [('Yes' if 'Yes' in x else ('No' if 'No' in x else 'Unknown')) for x in zip(df['col_1'], df['col_2'], df['col_3'])] print(df)
Output:
col_1 col_2 col_3 col_4 0 No Yes Unknown Yes 1 Yes Yes Yes Yes 2 Yes Unknown Yes Yes 3 Yes Yes Unknown Yes 4 Yes Unknown Unknown Yes 5 Yes No No Yes 6 No Unknown No No 7 No No Unknown No 8 No Unknown Unknown No 9 Unknown Unknown Unknown Unknown