I was trying to scape a link out of a .eml file but somehow I always get “NONE” as return for my search. But I don’t even get the link with the confirm brackets, no problem in getting that valid link once the string is pulled.
One problem that I see is, that the string that is found by the REGEX has multiple lines, but the REGES itself seems to be valid.
CODE/REGEX I USE:
def get_url(raw): #get rid of whitespaces raw = raw.replace(' ', '') #search for the link url = re.search('href=3D(.*?)token([^s]+)W([^s]+)W([^s]+)W([^s]+)W([^s]+)', raw).group(1) return url
Advertisement
Answer
First thing, the .eml
is encoded in MIME quoted-printable
(the hint is the =
signs at the end of the line. You should decode this first, instead of dealing with the encoded raw text.
Second, regex is overkill. Some nice string.split()
usage will work just as fine. Regex is extremely usefull in it’s proper usage scenarios, but some simple python can usually do the same without having to use regex’ flavor of magic, which can be confusing as [REDACTED].
Note that if you’re building regex, it’s always adviced to use one of the gazillion regex editors as these will help you build your regex… My personal favorite is regex101
EDIT: added regex way to do it.
import quopri import re def get_url_by_regex(raw): decoded = quopri.decodestring(raw).decode("utf-8") return re.search('(<a href=")(.*?)(")', decoded).group(2) def get_url(raw): decoded = quopri.decodestring(raw).decode("utf-8") for line in decoded.split('n'): if 'token=' in line: return line.split('<a href="')[1].split('"')[0] return None # just in case this is needed print(get_url(raw_email)) print(get_url_by_regex(raw_email))
result is:
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED] https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]