I have a tabular file that looks like this:
JavaScript
x
8
1
query_name KEGG_KOs
2
PROKKA_00013 NaN
3
PROKKA_00015 bactNOG[38]
4
PROKKA_00017 NA|NA|NA
5
PROKKA_00019 K00240
6
PROKKA_00020 K00246
7
PROKKA_00022 K02887
8
I’m trying to create a script to go through and delete the entire row if column 2 (‘KEGG_KOs’) does not begin with ‘K0’. I’m trying to create an output of:
JavaScript
1
5
1
query_name KEGG_KOs
2
PROKKA_00019 K00240
3
PROKKA_00020 K00246
4
PROKKA_00022 K02887
5
Previous responses have referred people to pandas DataFrame but I’ve had no luck using those responses to help. Any would be greatly appreciated, cheers.
I had tried (but this only isolates a specific K0 line.
JavaScript
1
3
1
df = pd.read_csv("eggnog.txt", delimiter="t", names=["#query_name", "KEGG_KOs"])
2
print(df.loc[df['KEGG_KOs'] == 'K00240'])
3
Advertisement
Answer
Use boolean indexing
with startswith
or contains
with regex
for start of string ^
and parameter na=False
, because missing values:
JavaScript
1
7
1
df1 = df[df['KEGG_KOs'].str.startswith('K0', na=False)]
2
print (df1)
3
query_name KEGG_KOs
4
3 PROKKA_00019 K00240
5
4 PROKKA_00020 K00246
6
5 PROKKA_00022 K02887
7
Or:
JavaScript
1
2
1
df1 = df[df['KEGG_KOs'].str.contains('^K0', na=False)]
2