I’m fairly new to numpy
and pandas
, let’s say that I have a 2D numpy array and I need to delete all rows in which the second value contain only the letters 'A'
, 'C'
, 'T'
, 'G'
and 'N'
JavaScript
x
7
1
file =
2
[['id' 'genome'],
3
['0' 'ATGTTTGTTTTT'],
4
['1' 'ATGTTTGTXXXX'],
5
['2' 'ATGDD2GTTTTT']
6
]
7
so after filtering I can get this
JavaScript
1
3
1
[['id' 'genome'],
2
['0' 'ATGTTTGTTTTT']]
3
I wanted to do 3 for loops that are checking each char one by one but this is sooo slow when I have 500 rows
Advertisement
Answer
Use Series.str.contains
with values and ^
for start and $
for end of string:
JavaScript
1
15
15
1
file = [['id', 'genome'],
2
['0', 'ATGTTTGTTTTT'],
3
['1', 'ATGTTTGTXXXX'],
4
['2', 'ATGDD2GTTTTT']
5
]
6
7
df = pd.DataFrame(file[1:], columns=file[0])
8
print (df)
9
10
11
df = df[df['genome'].str.contains('^[ACTGN]+$')]
12
print (df)
13
id genome
14
0 0 ATGTTTGTTTTT
15