I am creating a script that takes a csv file which columns organisation and columns name are unknown. However I know that only one of the column contains some values in which the str ‘rs’ and ‘del’ appears.
I need to create an extra column (called ‘Type’) and store ‘dbsnp’ in the row where ‘rs’ was found and ‘deletion’ in the row where ‘del’ was found. If not str is found, leave this row in column type empty.
As example I provide this df:
Data = {'Number': ['Mukul', 'Rohan', 'Mayank', 'Shubham', 'Aakash'], 'Location': ['Saharsanpur', 'MERrs', 'rsAdela', 'aaaadelaa', 'aaa'], 'Pay': [25000, 30000, 35000, 40000, 45000]} df = pd.DataFrame(Data) print(df) Name Location Pay 0 Mukul Saharsanpur 25000 1 Rohan MERrs 30000 2 Mayank rsAdela 35000 3 Shubham aaaadelaa 40000 4 Aakash aaa 45000
I have been trying things like that
df["type"] = df["Name"].str.extract("rs")[0] # and then do some replace
But one of my problems is that I dont know non the name of the column neither the position.
Desire output
Name Location Pay type 0 Mukul Saharsanpur 25000 dbsnp 1 Rohan MERrs 30000 dbsnp 2 Mayank rsAdela 35000 dbsnp 3 Shubham aaaadelaa 40000 deletion 4 Aakash aaa 450
The next for loop solve the problem of the unknown column but now I need to solve the issue of identify my str in the value.
How can I use str.contains(“rs”) in the if condition?
for index, row in df[:3].iterrows(): for i in range(len(df.columns)): if row[i] == 5: print(row.index[i])
Advertisement
Answer
You can do it without the loop. Here’s an approach. You can use applymap and search all the columns.
import pandas as pd data = {'Number': ['Mukul', 'Rohan', 'Mayank', 'Shubham', 'Aakash'], 'Location': ['Saharsanpur', 'MERrs', 'rsAdela', 'aaaadelaa', 'aaa'], 'Pay': [25000, 30000, 35000, 40000, 45000]} df = pd.DataFrame(data) df['rs'] = df.astype(str).applymap(lambda x: 'rs' in x).any(1) df['del'] = df.astype(str).applymap(lambda x: 'del' in x).any(1) df['type']='' df.loc[df['rs'] == True, 'type'] = 'dbsnp' df.loc[df['del'] == True, 'type'] = 'deletion' df = df.drop(columns=['rs','del']) print (df)
Based on the data in the table, rsAdela
has both rs
and del
. Since I am applying rs
first and del
second, the row is flagged for deletion
. You can choose to swap the order to decide if you want to retain value as dbsnp
or deletion
.
The code processes all the columns irrespective of dtype.
The output of the above data is:
Number Location Pay type 0 Mukul Saharsanpur 25000 dbsnp 1 Rohan MERrs 30000 dbsnp 2 Mayank rsAdela 35000 deletion 3 Shubham aaaadelaa 40000 deletion 4 Aakash aaa 45000