I’m fairly new to numpy and pandas, let’s say that I have a 2D numpy array and I need to delete all rows in which the second value contain only the letters 'A', 'C', 'T', 'G' and 'N'
file = [['id' 'genome'], ['0' 'ATGTTTGTTTTT'], ['1' 'ATGTTTGTXXXX'], ['2' 'ATGDD2GTTTTT'] ]
so after filtering I can get this
[['id' 'genome'], ['0' 'ATGTTTGTTTTT']]
I wanted to do 3 for loops that are checking each char one by one but this is sooo slow when I have 500 rows
Advertisement
Answer
Use Series.str.contains with values and ^ for start and $ for end of string:
file = [['id', 'genome'],
['0', 'ATGTTTGTTTTT'],
['1', 'ATGTTTGTXXXX'],
['2', 'ATGDD2GTTTTT']
]
df = pd.DataFrame(file[1:], columns=file[0])
print (df)
df = df[df['genome'].str.contains('^[ACTGN]+$')]
print (df)
id genome
0 0 ATGTTTGTTTTT