What I’ve managed to do
I’m new to both python and regex. With python’s re.compile, in massive number of text files, I wanted to find all kinds of dashes surrounded by spaces. I used:
search.results = re.compile(r's[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD]s')
(Yeah, I know about the regex
module on PyPI, but I’m trying to use what I know better)
It seems to have worked fine: I got all kinds of dash-like characters with spaces around them.
What I’d like to do now
Now I’m trying to do the opposite: find all the dash-like characters that are not surrounded by spaces (that is, with a space to the left, or a space to the right, or no spaces around them at all).
What I’ve tried
So I tried to use the same regex above and just swap the s at the beginning, and then the s at the end, and then both the s-es with S (to find all characters that are not space-characters). And now the regex suddenly seems to have gone mad and is finding all knids of words rather than dashes and their neighbouting letters, which I expected it to do. I’ve no idea what’s going on.
What went wrong?
Advertisement
Answer
To match a specific single-char pattern not in between two chars you can use a pattern of the following type:
b(?!(?<=a.)c) (?<!a)b|b(?!c)
where a
and c
can be the same chars.
The b(?!(?<=a.)c)
pattern matches any b
that is not immediately followed with c
that is, in its turn, not immediately preceded with a
and any one char (here, .
is fine to use because all we want from the lookbehind pattern is to reach the place after the b
).
Here, if you wanted to match a normal regular hyphen not in between whitespaces, you could use -(?!(?<=s.)s)
.
If you put the character class of your choice into the pattern, it will look like
(?<!s)[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD]|[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD](?!s)
Or
[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD](?!(?<=s.)s)
See the regex demo #1 and regex demo #2. The second is more efficient.
This technique is also described in the “Matching dots or commas as (not) part of numbers” YT video of mine.