regex match not working on simple string with Pyteomics parser

Tags: , , , ,



I am performing an in silico digestion of the human proteome, meaning that I am trying to chopped the amino acid sequence of every protein at a certain position. I am using the Pyteomics parser function Pyteomics Parser within a bigger function that I have created.

I am getting this error: PyteomicsError: Pyteomics error, message: “Not a valid modX sequence: {‘sequence’: ‘AKDEVQKN’}”

However, I am unsure how AKDEVQKN doesn’t match the modX_reqquence compilier:

_modX_sequence = re.compile(r'^([^-]+-)?((?:[^A-Z-]*[A-Z])+)(-[^-]+)?$')

From my understanding of this regex, it should find any string that doesn’t start with (-) and is followed by a series of alphabetical characters.

This is the function I am trying to use it on.

import re
import pyteomics
from pyteomics import fasta, parser
def ButcherShop(df, target, rule,min_length=7,exception=None,max_legnth=100, pH=2.0):
>     raw = df[target]
>     unique_peptides = set()
>     for peptide in raw:
>         new_peptides = parser.cleave(peptide, rule=rule,min_length=min_length,exception=exception)
>         unique_peptides.update(new_peptides)
>     print(f'Done,{len(unique_peptides)} sequences of >= 7 amino acids!')
>     pep_dic = [{'sequence': i} for i in unique_peptides]
>     for peptides in pep_dic:
>         pep_dic['parsed_sequence'] = parser.parse(peptides,show_unmodified_termini=False)
>         pep_dic['xlength'] = len(peptides)
>         pep_dic['charge'] = int(round(electrochem.charge(peptides, pH=pH)))
>         pep_dic['mass']=int(round(Peptide_mass(peptides)))
>     pep_dic = [peptide for peptide in pep_dic if peptide['length'] <= int(max_length)]
>     pep_df = pd.DataFrame.from_dict(pep_dic)
>     return unique_peptides,pep_dic,pep_df

Thank you for any insight on how to address this.

** Update: If I run on a different set, I am getting the same error which may suggest it is the library itself.

Screenshot of Error: Error

Answer

Pyteomics maintainer here.

The error message actually tells you the source of the problem: PyteomicsError: Pyteomics error, message: "Not a valid modX sequence: {'sequence': 'AKDEVQKN'}"

It means that instead of a string 'AKDEVQKN' you passed a dictionary {'sequence': 'AKDEVQKN'}. This actually happens here:

pep_dic = [{'sequence': i} for i in unique_peptides]
for peptides in pep_dic:
    pep_dic['parsed_sequence'] = parser.parse(peptides,show_unmodified_termini=False)
    ...

You should pass the sequence itself to parse, not the dict:

pep_dic['parsed_sequence'] = parser.parse(peptides['sequence'], show_unmodified_termini=False)


Source: stackoverflow