I’m fairly new to numpy
and pandas
, let’s say that I have a 2D numpy array and I need to delete all rows in which the second value contain only the letters 'A'
, 'C'
, 'T'
, 'G'
and 'N'
file = [['id' 'genome'], ['0' 'ATGTTTGTTTTT'], ['1' 'ATGTTTGTXXXX'], ['2' 'ATGDD2GTTTTT'] ]
so after filtering I can get this
[['id' 'genome'], ['0' 'ATGTTTGTTTTT']]
I wanted to do 3 for loops that are checking each char one by one but this is sooo slow when I have 500 rows
Advertisement
Answer
Use Series.str.contains
with values and ^
for start and $
for end of string:
file = [['id', 'genome'], ['0', 'ATGTTTGTTTTT'], ['1', 'ATGTTTGTXXXX'], ['2', 'ATGDD2GTTTTT'] ] df = pd.DataFrame(file[1:], columns=file[0]) print (df) df = df[df['genome'].str.contains('^[ACTGN]+$')] print (df) id genome 0 0 ATGTTTGTTTTT