Skip to content
Advertisement

Dashes with(out) spaces with python’s regex

What I’ve managed to do

I’m new to both python and regex. With python’s re.compile, in massive number of text files, I wanted to find all kinds of dashes surrounded by spaces. I used:

search.results = re.compile(r's[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD]s')

(Yeah, I know about the regex module on PyPI, but I’m trying to use what I know better) It seems to have worked fine: I got all kinds of dash-like characters with spaces around them.

What I’d like to do now

Now I’m trying to do the opposite: find all the dash-like characters that are not surrounded by spaces (that is, with a space to the left, or a space to the right, or no spaces around them at all).

What I’ve tried

So I tried to use the same regex above and just swap the s at the beginning, and then the s at the end, and then both the s-es with S (to find all characters that are not space-characters). And now the regex suddenly seems to have gone mad and is finding all knids of words rather than dashes and their neighbouting letters, which I expected it to do. I’ve no idea what’s going on.

What went wrong?

Advertisement

Answer

To match a specific single-char pattern not in between two chars you can use a pattern of the following type:

b(?!(?<=a.)c)
(?<!a)b|b(?!c)

where a and c can be the same chars.

The b(?!(?<=a.)c) pattern matches any b that is not immediately followed with c that is, in its turn, not immediately preceded with a and any one char (here, . is fine to use because all we want from the lookbehind pattern is to reach the place after the b).

Here, if you wanted to match a normal regular hyphen not in between whitespaces, you could use -(?!(?<=s.)s).

If you put the character class of your choice into the pattern, it will look like

(?<!s)[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD]|[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD](?!s)

Or

[u00adu2212u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0Du002Du058Au05BEu1806u2010u2011u2012u2013u2014u2015u2E3Au2E3BuFE58uFE63uFF0Du10EAD](?!(?<=s.)s)

See the regex demo #1 and regex demo #2. The second is more efficient.

This technique is also described in the “Matching dots or commas as (not) part of numbers” YT video of mine.

Advertisement