Skip to content
Advertisement

Python Pandas Extract word from column that contains String with Regex

I have this data frame (columns are strings):

        ORF                                             ORFDesc
3     b1731              succinate-semialdehyde dehydrogenase
4      b234              succinate-semialdehyde dehydrogenase
24    b2780                             L-alanine dehydrogenase
27     b753          methylmalmonate semialdehyde dehydrogenase
29    b1187               pyrroline-5-carboxylate dehydrogenase
...............................................................                                               
1922  b1124                         probable epoxide hydrolase 
1923  b2214                         probable epoxide hydrolase 
1924  b3670                          probable epoxide hydrolase
1925   b134                          probable epoxide hydrolase
2382  b2579    1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase

I need to get 'ORF' values for rows with 'ORFDesc' that contains a word with “hydro” but only with 13 characters. I explain, word length must be 13 characters, not the whole description.

I’m using

df['IDClass'][df['ORFDesc'].str.contains("hydro", na=False)]

In order to match the rows that contain “hydro” but I need to reject the ones with length != 13.

I would like to use a regex so I can make a new Column ‘word’ like:

ORF                                             ORFDesc                word
3     b1731              succinate-semialdehyde dehydrogenase          dehydrogenase
4      b234              succinate-semialdehyde dehydrogenase          dehydrogenase
24    b2780                             L-alanine dehydrogenase        dehydrogenase
27     b753          methylmalmonate semialdehyde dehydrogenase           .
29    b1187               pyrroline-5-carboxylate dehydrogenase             .
...............................................................                                               
1922  b1124                         probable epoxide hydrolase         hydrolase 
1923  b2214                         probable epoxide hydrolase         hydrolase 
1924  b3670                          probable epoxide hydrolase        ....
1925   b134                          probable epoxide hydrolase         ..
2382  b2579    1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase        .

And then be able to discard rows by using length in ‘word’ column.

What pattern will it be?

EDIT:

I have tryed this but still dont work:

pattern = 'b(?=w*hydro)w+b'

Advertisement

Answer

You can use

b(?=w{13}b)w*hydro

See the regex demo

Details

  • b – a word boundary
  • (?=w{13}b) – a positive lookahead that requires 13 word chars to be present immediately to the right of the current location followed with a word boundary
  • w*hydro – zero or more word chars and then hydro.

Python code:

df['ORF'][df['ORFDesc'].str.contains(r"b(?=w{13}b)w*hydro", na=False)]
Advertisement