Skip to content
Advertisement

Python regular expression help needed, multiple lines regex

I was trying to scape a link out of a .eml file but somehow I always get “NONE” as return for my search. But I don’t even get the link with the confirm brackets, no problem in getting that valid link once the string is pulled.

One problem that I see is, that the string that is found by the REGEX has multiple lines, but the REGES itself seems to be valid.

CODE/REGEX I USE:

def get_url(raw):
    #get rid of whitespaces
    raw = raw.replace(' ', '')
    #search for the link
    url = re.search('href=3D(.*?)token([^s]+)W([^s]+)W([^s]+)W([^s]+)W([^s]+)', raw).group(1)
    return url


Advertisement

Answer

First thing, the .eml is encoded in MIME quoted-printable (the hint is the = signs at the end of the line. You should decode this first, instead of dealing with the encoded raw text.

Second, regex is overkill. Some nice string.split() usage will work just as fine. Regex is extremely usefull in it’s proper usage scenarios, but some simple python can usually do the same without having to use regex’ flavor of magic, which can be confusing as [REDACTED].

Note that if you’re building regex, it’s always adviced to use one of the gazillion regex editors as these will help you build your regex… My personal favorite is regex101

EDIT: added regex way to do it.

import quopri
import re


def get_url_by_regex(raw):
    decoded = quopri.decodestring(raw).decode("utf-8") 
    return re.search('(<a href=")(.*?)(")', decoded).group(2)


def get_url(raw):
    decoded = quopri.decodestring(raw).decode("utf-8") 
    for line in decoded.split('n'):
        if 'token=' in line:
            return line.split('<a href="')[1].split('"')[0]
    return None  # just in case this is needed


print(get_url(raw_email))
print(get_url_by_regex(raw_email))

result is:

https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement