Skip to content
Advertisement

Delete rows that do not contain specific text

I have a tabular file that looks like this:

query_name      KEGG_KOs
PROKKA_00013    NaN
PROKKA_00015    bactNOG[38]
PROKKA_00017    NA|NA|NA
PROKKA_00019    K00240
PROKKA_00020    K00246
PROKKA_00022    K02887

I’m trying to create a script to go through and delete the entire row if column 2 (‘KEGG_KOs’) does not begin with ‘K0’. I’m trying to create an output of:

query_name     KEGG_KOs
PROKKA_00019    K00240
PROKKA_00020    K00246
PROKKA_00022    K02887

Previous responses have referred people to pandas DataFrame but I’ve had no luck using those responses to help. Any would be greatly appreciated, cheers.

I had tried (but this only isolates a specific K0 line.

df = pd.read_csv("eggnog.txt", delimiter="t", names=["#query_name", "KEGG_KOs"])
print(df.loc[df['KEGG_KOs'] == 'K00240'])

Advertisement

Answer

Use boolean indexing with startswith or contains with regex for start of string ^ and parameter na=False, because missing values:

df1 = df[df['KEGG_KOs'].str.startswith('K0', na=False)]
print (df1)
     query_name KEGG_KOs
3  PROKKA_00019   K00240
4  PROKKA_00020   K00246
5  PROKKA_00022   K02887

Or:

df1 = df[df['KEGG_KOs'].str.contains('^K0', na=False)]
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement