I’m trying to extract multiple Emails from string. I’m using this regex:
re.findall(r'[w.-]+@[w.-]+(?:.[w]+)+', text)
It works fine, but sometimes in text Email names with the same domain are grouped in curly brackets:
{annie,bonnie}@gmail.com
So my question is how properly to parse it and extract as separate emails:
annie@gmail.com, bonnie@gmail.com
?
I’ve tried to modify regex to take into account brackets and comma, following with simple function, but in that case I get a lot of garbage from string.
Any help appreciated.
Advertisement
Answer
You can use
(?:{([^{}]*)}|bw[w.-]*)(@[w.-]+.w+)
See the regex demo. Details:
(?:{([^{}]*)}|bw[w.-]*)
– a non-capturing group matching:{([^{}]*)}
– a{
, then Group 1 capturing any zero or more chars other than{
and}
and then a}
|
– orbw[w.-]*
– a word boundary (it will make matching more efficient), a word char, and then zero or more word, dot or hyphen chars(@[w.-]+.w+)
– Group 2: a@
, one or more word, dot or hyphen chars, then a.
and one or more word chars.
See a Python demo:
import re text = "Emails like {annie,bonnie}@gmail.com, annie2@gmail.com, then a bonnie2@gmail.com." emails = [] rx_email = re.compile( r'(?:{([^{}]*)}|bw[w.-]*)(@[w.-]+.w+)' ) for m in rx_email.finditer(text): if m.group(1): for email in m.group(1).split(','): emails.append(f'{email}{m.group(2)}') else: emails.append(m.group()) print(emails) # => ['annie@gmail.com', 'bonnie@gmail.com', 'annie2@gmail.com', 'bonnie2@gmail.com']
The logic is
- Get the emails with
{...}
in front of@
while capturing the contents inside the braces into Group 1 and the@...
into Group 2 - Check if Group 1 was matched, and if yes, split the contents with a comma and build the resulting matches by concatenating the comma-separating user names with the domain part
- If Group 1 did not match, just append the match value to the resulting list.