Skip to content

Dashes with(out) spaces with python’s regex

What I’ve managed to do

I’m new to both python and regex. With python’s re.compile, in massive number of text files, I wanted to find all kinds of dashes surrounded by spaces. I used:

search.results = re.compile(r's[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD]s')

(Yeah, I know about the regex module on PyPI, but I’m trying to use what I know better) It seems to have worked fine: I got all kinds of dash-like characters with spaces around them.

What I’d like to do now

Now I’m trying to do the opposite: find all the dash-like characters that are not surrounded by spaces (that is, with a space to the left, or a space to the right, or no spaces around them at all).

What I’ve tried

So I tried to use the same regex above and just swap the s at the beginning, and then the s at the end, and then both the s-es with S (to find all characters that are not space-characters). And now the regex suddenly seems to have gone mad and is finding all knids of words rather than dashes and their neighbouting letters, which I expected it to do. I’ve no idea what’s going on.

What went wrong?

Advertisement

Answer

To match a specific single-char pattern not in between two chars you can use a pattern of the following type:

b(?!(?<=a.)c)
(?<!a)b|b(?!c)

where a and c can be the same chars.

The b(?!(?<=a.)c) pattern matches any b that is not immediately followed with c that is, in its turn, not immediately preceded with a and any one char (here, . is fine to use because all we want from the lookbehind pattern is to reach the place after the b).

Here, if you wanted to match a normal regular hyphen not in between whitespaces, you could use -(?!(?<=s.)s).

If you put the character class of your choice into the pattern, it will look like

(?<!s)[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD]|[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD](?!s)

Or

[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD](?!(?<=s.)s)

See the regex demo #1 and regex demo #2. The second is more efficient.

This technique is also described in the “Matching dots or commas as (not) part of numbers” YT video of mine.