I have this data frame (columns are strings):
ORF ORFDesc 3 b1731 succinate-semialdehyde dehydrogenase 4 b234 succinate-semialdehyde dehydrogenase 24 b2780 L-alanine dehydrogenase 27 b753 methylmalmonate semialdehyde dehydrogenase 29 b1187 pyrroline-5-carboxylate dehydrogenase ............................................................... 1922 b1124 probable epoxide hydrolase 1923 b2214 probable epoxide hydrolase 1924 b3670 probable epoxide hydrolase 1925 b134 probable epoxide hydrolase 2382 b2579 1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase
I need to get 'ORF'
values for rows with 'ORFDesc'
that contains a word with “hydro” but only with 13 characters. I explain, word length must be 13 characters, not the whole description.
I’m using
df['IDClass'][df['ORFDesc'].str.contains("hydro", na=False)]
In order to match the rows that contain “hydro” but I need to reject the ones with length != 13.
I would like to use a regex so I can make a new Column ‘word’ like:
ORF ORFDesc word 3 b1731 succinate-semialdehyde dehydrogenase dehydrogenase 4 b234 succinate-semialdehyde dehydrogenase dehydrogenase 24 b2780 L-alanine dehydrogenase dehydrogenase 27 b753 methylmalmonate semialdehyde dehydrogenase . 29 b1187 pyrroline-5-carboxylate dehydrogenase . ............................................................... 1922 b1124 probable epoxide hydrolase hydrolase 1923 b2214 probable epoxide hydrolase hydrolase 1924 b3670 probable epoxide hydrolase .... 1925 b134 probable epoxide hydrolase .. 2382 b2579 1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase .
And then be able to discard rows by using length in ‘word’ column.
What pattern will it be?
EDIT:
I have tryed this but still dont work:
pattern = 'b(?=w*hydro)w+b'
Advertisement
Answer
You can use
b(?=w{13}b)w*hydro
See the regex demo
Details
b
– a word boundary(?=w{13}b)
– a positive lookahead that requires 13 word chars to be present immediately to the right of the current location followed with a word boundaryw*hydro
– zero or more word chars and thenhydro
.
Python code:
df['ORF'][df['ORFDesc'].str.contains(r"b(?=w{13}b)w*hydro", na=False)]