Skip to content
Advertisement

Find string in data frame and store new values in a new column

I am creating a script that takes a csv file which columns organisation and columns name are unknown. However I know that only one of the column contains some values in which the str ‘rs’ and ‘del’ appears.

I need to create an extra column (called ‘Type’) and store ‘dbsnp’ in the row where ‘rs’ was found and ‘deletion’ in the row where ‘del’ was found. If not str is found, leave this row in column type empty.

As example I provide this df:

Data = {'Number': ['Mukul', 'Rohan', 'Mayank', 
                  'Shubham', 'Aakash'], 
          
        'Location': ['Saharsanpur', 'MERrs', 'rsAdela', 
                     'aaaadelaa', 'aaa'], 
          
        'Pay': [25000, 30000, 35000, 40000, 45000]} 
  
df = pd.DataFrame(Data)
print(df)

 Name     Location    Pay
0    Mukul  Saharsanpur  25000
1    Rohan        MERrs  30000
2   Mayank      rsAdela  35000
3  Shubham    aaaadelaa  40000
4   Aakash          aaa  45000

I have been trying things like that

df["type"] = df["Name"].str.extract("rs")[0] 
# and then do some replace

But one of my problems is that I dont know non the name of the column neither the position.

Desire output

 Name     Location    Pay       type
0    Mukul  Saharsanpur  25000 dbsnp
1    Rohan        MERrs  30000 dbsnp
2   Mayank      rsAdela  35000 dbsnp
3  Shubham    aaaadelaa  40000 deletion
4   Aakash          aaa  450

The next for loop solve the problem of the unknown column but now I need to solve the issue of identify my str in the value.

How can I use str.contains(“rs”) in the if condition?

for index, row in df[:3].iterrows():
    for i in range(len(df.columns)): 
        if row[i] == 5:
            print(row.index[i])

Advertisement

Answer

You can do it without the loop. Here’s an approach. You can use applymap and search all the columns.

import pandas as pd
data = {'Number': ['Mukul', 'Rohan', 'Mayank', 
                  'Shubham', 'Aakash'], 
          
        'Location': ['Saharsanpur', 'MERrs', 'rsAdela', 
                     'aaaadelaa', 'aaa'], 
          
        'Pay': [25000, 30000, 35000, 40000, 45000]} 
  
df = pd.DataFrame(data)

df['rs'] = df.astype(str).applymap(lambda x: 'rs' in x).any(1)
df['del'] = df.astype(str).applymap(lambda x: 'del' in x).any(1)

df['type']=''
df.loc[df['rs'] == True, 'type'] = 'dbsnp'
df.loc[df['del'] == True, 'type'] = 'deletion'

df = df.drop(columns=['rs','del'])
print (df)

Based on the data in the table, rsAdela has both rs and del. Since I am applying rs first and del second, the row is flagged for deletion. You can choose to swap the order to decide if you want to retain value as dbsnp or deletion.

The code processes all the columns irrespective of dtype.

The output of the above data is:

    Number     Location    Pay      type
0    Mukul  Saharsanpur  25000     dbsnp
1    Rohan        MERrs  30000     dbsnp
2   Mayank      rsAdela  35000  deletion
3  Shubham    aaaadelaa  40000  deletion
4   Aakash          aaa  45000          
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement