I have this data frame (columns are strings):
JavaScript
x
13
13
1
ORF ORFDesc
2
3 b1731 succinate-semialdehyde dehydrogenase
3
4 b234 succinate-semialdehyde dehydrogenase
4
24 b2780 L-alanine dehydrogenase
5
27 b753 methylmalmonate semialdehyde dehydrogenase
6
29 b1187 pyrroline-5-carboxylate dehydrogenase
7
8
1922 b1124 probable epoxide hydrolase
9
1923 b2214 probable epoxide hydrolase
10
1924 b3670 probable epoxide hydrolase
11
1925 b134 probable epoxide hydrolase
12
2382 b2579 1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase
13
I need to get 'ORF'
values for rows with 'ORFDesc'
that contains a word with “hydro” but only with 13 characters. I explain, word length must be 13 characters, not the whole description.
I’m using
JavaScript
1
2
1
df['IDClass'][df['ORFDesc'].str.contains("hydro", na=False)]
2
In order to match the rows that contain “hydro” but I need to reject the ones with length != 13.
I would like to use a regex so I can make a new Column ‘word’ like:
JavaScript
1
13
13
1
ORF ORFDesc word
2
3 b1731 succinate-semialdehyde dehydrogenase dehydrogenase
3
4 b234 succinate-semialdehyde dehydrogenase dehydrogenase
4
24 b2780 L-alanine dehydrogenase dehydrogenase
5
27 b753 methylmalmonate semialdehyde dehydrogenase .
6
29 b1187 pyrroline-5-carboxylate dehydrogenase .
7
8
1922 b1124 probable epoxide hydrolase hydrolase
9
1923 b2214 probable epoxide hydrolase hydrolase
10
1924 b3670 probable epoxide hydrolase .
11
1925 b134 probable epoxide hydrolase ..
12
2382 b2579 1,3,4,6-tetrachloro-1,4-cyclohexadiene hydrolase .
13
And then be able to discard rows by using length in ‘word’ column.
What pattern will it be?
EDIT:
I have tryed this but still dont work:
JavaScript
1
2
1
pattern = 'b(?=w*hydro)w+b'
2
Advertisement
Answer
You can use
JavaScript
1
2
1
b(?=w{13}b)w*hydro
2
See the regex demo
Details
b
– a word boundary(?=w{13}b)
– a positive lookahead that requires 13 word chars to be present immediately to the right of the current location followed with a word boundaryw*hydro
– zero or more word chars and thenhydro
.
Python code:
JavaScript
1
2
1
df['ORF'][df['ORFDesc'].str.contains(r"b(?=w{13}b)w*hydro", na=False)]
2