Python regular expression help needed, multiple lines regex

Tags: , , , ,



I was trying to scape a link out of a .eml file but somehow I always get “NONE” as return for my search. But I don’t even get the link with the confirm brackets, no problem in getting that valid link once the string is pulled.

One problem that I see is, that the string that is found by the REGEX has multiple lines, but the REGES itself seems to be valid.

CODE/REGEX I USE:

def get_url(raw):
    #get rid of whitespaces
    raw = raw.replace(' ', '')
    #search for the link
    url = re.search('href=3D(.*?)token([^s]+)W([^s]+)W([^s]+)W([^s]+)W([^s]+)', raw).group(1)
    return url

print(get_url(raw_email))

The EML file passage:

<table cellpadding=3D"0" cellspacing=3D"0" border=3D"0" width=3D"640" class=
=3D"responsive"><tbody><tr><td style=3D"padding:11px 10px 10px;" align=3D"c=
enter">
<table border=3D"0" cellspacing=3D"0" cellpadding=3D"0"><tbody><tr><td styl=
e=3D"padding: 10px 35px;background-color:#222222;color:#ffffff;" align=3D"c=
enter">
<div style=3D"font-size: 12px; font-family: Helvetica, Arial, sans-serif; f=
ont-weight: normal; color: #ffffff; text-decoration: none; display: inline-=
block;"><a href=3D"https://app.rule.io/subscriber/optIn?token=3DeyJ0eXAiOiJ=
KV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjE5NzQ0MDc2OSwic3Vic2NyaWJlckZvcm0iOjExO=
TAsImlzcyI6Imh0dHBzOi8vYXBwLnJ1bGUuaW8iLCJpYXQiOjE2MjM3ODc4NjUsImV4cCI6MTYy=
NDM5MjY2NSwibmJmIjoxwhrejdkjekdksLCJqdGkiOiJoYVRveXJCTUNFdU1GejJ1In0.fERYAI=
cfscU9kisH3CugY6DCjFOVnarhN5IA447Z_M" style=3D"text-decoration: none;" targ=
et=3D"_blank"><span style=3D"color:#FFFFFF;">CONFIRM</span></a></div>
</td>
</tr></tbody></table></td>
</tr></tbody></table></div>
</td>
</tr><tr class=3D"iguana iguana-duplicatable ng-scope" data-rule-block-name=
=3D"Text column" style=3D""><td style=3D"padding:10px">
<div align=3D"center">
<table cellpadding=3D"0" cellspacing=3D"0" border=3D"0" width=3D"640" class=
=3D"responsive"><tbody><tr><td style=3D"padding-top:5px;padding-bottom:0;pa=
dding-right:20px;padding-left:20px;">
<div style=3D"color:#222222;line-height:1.2em;font-family: Helvetica, Arial=
, sans-serif;font-size: 10px; font-weight: normal;" align=3D"center"><span =
style=3D"font-size:16px;">If you become one of the participants in our =E2=
=80=98First-come First-Served=E2=80=99 you are not guaranteed a pair, but t=
he opportunity to purchase one pair along with other participants. Since it=
 is first come first served, be sure to check your e-mail including your sp=
am folder, so you don=E2=80=99t risk missing out on this limited sneaker.</=

the link i need:

https://app.rule.io/subscriber/optIn?token=3DeyJ0eXAiOiJ= KV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjE5NzQ0MDc2OSwic3Vic2NyaWJlckZvcm0iOjExO= TAsImlzcyI6Imh0dHBzOi8vYXBwLnJ1bGUuaW8iLCJpYXQiOjE2MjM3ODc4NjUsImV4cCI6MTYy= NDM5MjY2NSwibmJmIjoxwhrejdkjekdksLCJqdGkiOiJoYVRveXJCTUNFdU1GejJ1In0.CfERYAI= cfscU9kisH3CugY6DCjFOVnarhN5IA447Z_M

Answer

First thing, the .eml is encoded in MIME quoted-printable (the hint is the = signs at the end of the line. You should decode this first, instead of dealing with the encoded raw text.

Second, regex is overkill. Some nice string.split() usage will work just as fine. Regex is extremely usefull in it’s proper usage scenarios, but some simple python can usually do the same without having to use regex’ flavor of magic, which can be confusing as [REDACTED].

Note that if you’re building regex, it’s always adviced to use one of the gazillion regex editors as these will help you build your regex… My personal favorite is regex101

EDIT: added regex way to do it.

import quopri
import re


def get_url_by_regex(raw):
    decoded = quopri.decodestring(raw).decode("utf-8") 
    return re.search('(<a href=")(.*?)(")', decoded).group(2)


def get_url(raw):
    decoded = quopri.decodestring(raw).decode("utf-8") 
    for line in decoded.split('n'):
        if 'token=' in line:
            return line.split('<a href="')[1].split('"')[0]
    return None  # just in case this is needed


print(get_url(raw_email))
print(get_url_by_regex(raw_email))

result is:

https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]


Source: stackoverflow