Skip to content
Advertisement

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in ‘()’.

Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx

If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like ‘(‘ and ‘/’.

Advertisement

Answer

>>> foo = re.compile( r"(?<=(K()[^)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']

Explanation

In regex-world, a lookbehind is a way of saying “I want to match ham, but only if it’s preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^)]*, but only if it’s preceded by (K(.

Now (K( is a nice, easy regex, because it’s plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!

Finally, when you put something in square brackets in regex-world, it means “any of the characters in here is OK”. If you put something inside square brackets where the first character is ^, it means “any character not in here is OK”. So [^)] means “any character that isn’t a right-bracket”, and [^)]* means “as many characters as possible that aren’t right-brackets”.

Putting it all together, (?<=(K()[^)]* means “match as many characters as you can that aren’t right-brackets, preceded by the string (K(.

Oh, one last thing. Because means something inside strings in Python as well as inside regexes, we use raw strings — r"spam" instead of just "spam". That tells Python to ignore the ‘s.

Another way

If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don’t have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!

To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement