Skip to content
Advertisement

Parse multipart/related emails

I’m trying to parse emails and convert tables within them into pandas dataframes. Since some of the emails are multipart, I took some code from this answer.

The following code works fine but it breaks with multipart/related emails (no tables are found).

HOST = 'imap.gmail.com'
m = imaplib.IMAP4_SSL(HOST, 993)
m.login(USERNAME, PASSWORD)
m.select('Inbox')

result, data = m.uid('search', None, "UNSEEN", '(FROM "xxx@xxx.xxx")')
print(result)
if result == 'OK':
      for num in data[0].split()[:]:
            result, data = m.uid('fetch', num, '(RFC822)')
            if result == 'OK':
                  email_message = email.message_from_bytes(data[0][1])
                  b = email_message 
                  body = ""

                  print(b.is_multipart())
                  if b.is_multipart():
                      for part in b.walk():
                          ctype = part.get_content_type()
                          cdispo = str(part.get('Content-Disposition'))

                          # skip any text/plain (txt) attachments
                          if ctype == 'text/plain' and 'attachment' not in cdispo:
                              body = part.get_payload(decode=True)  # decode
                              break
                  else:
                      body = b.get_payload(decode=True)
                  soup = BeautifulSoup(body)
                  table = soup.find_all('table')
                  df = pd.read_html(str(table))[0]
                  display(df)

Here’s the header of one of the multipart/related emails:

Delivered-To: xxxxxx@gmail.com
Received: by 2002:a05:6a10:cc86:0:0:0:0 with SMTP id gj6csp6140432pxb;
        Mon, 27 Dec 2021 14:52:14 -0800 (PST)
X-Google-Smtp-Source: ABdhPJxPtKdKdVFNfgIE5xJdGrqDvekcD9MVkXdJaQyjJcVjc63N0KmOSN1LKvqLDbzssUU+6xjG
X-Received: by 2002:a05:620a:1132:: with SMTP id p18mr13912209qkk.778.1640645534051;
        Mon, 27 Dec 2021 14:52:14 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1640645534; cv=none;
        d=google.com; s=arc-20160816;
        b=JUwqNu9ZFFy3j5ke7GddEIhpUGSdzB0gby+k5PFr3AwQv+/JtDY6p9ksOhReeFkQpd
         2rNOhn9HknPnVpu1s+S9BT+YIrKWo8jrCzqJRWkaiY7MN80BGjw+oSkoD+WTNoo9rk7t
         ojil3vIatY02Unl5FfYlOUxZbFZ7Xb3xT44Zd9lRI7aQNrLZxSjeQAF/oL+N8eE0rMXo
         T5McU5R165sEb81twUpHrSkbp34/v31W25kOwx68Mb7hkuOTv/komZiQy1oiP+xzUKDH
         CxKOgF/UgzVD5mhyB6DSSEN22DQ4ybrmshmd+B5wugSVlY9hfw0t89kJQGChKUphk9GH
         /VWw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=feedback-id:mime-version:date:message-id:subject:to:from
         :dkim-signature:dkim-signature;
        bh=iqw+mlksCZlkG8lxD5rVcYUL5uh/jJYU8nLc+GpCr/4=;
        b=qnu0Xb2/dj8zwtelmnry7/okDbUj4QpsNPtWtovwrbtlDIpnSS8HRq4qzVzUy6TDFE
         flm0XO489XNMO/GJ8Jw0J5Duujhnto3PiBRrAtIcA4CXkKhRe3SpXYk7D+PjROg+Zngk
         5lqA9RgxerLMq+wMRD4WlcZVuWmmUtBhY/T9XbXOXUlJJJa9qn6AlKNOp5ZV8CDxweTp
         yCDuQpJSCrbp1mldDe3N6lQAUXfaoGIBu6Kv7hpdZHwdrNMIeuhyCHTI4JF1IV0lK+G0
         DzJg76RxnRQ3q0eacW9X/hzbMLZeljxfUO18BeDzRp45i3XqVyVsC53TirpmYv7OcB50
         MaWA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@xxxxxx.com header.s=xdzpvx2vm2fr73bjeppds7oqr3jbfy5s header.b=BATglTQY;
       dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b=YLjw7lGE;
       spf=pass (google.com: domain of 0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com designates 54.240.11.40 as permitted sender) smtp.mailfrom=0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com;
       dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=xxxxxx.com
Return-Path: <0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com>
Received: from a11-40.smtp-out.amazonses.com (a11-40.smtp-out.amazonses.com. [54.240.11.40])
        by mx.google.com with ESMTPS id g19si4414275qtm.154.2021.12.27.14.52.13
        for <xxxxxx@gmail.com>
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Mon, 27 Dec 2021 14:52:14 -0800 (PST)
Received-SPF: pass (google.com: domain of 0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com designates 54.240.11.40 as permitted sender) client-ip=54.240.11.40;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@xxxxxx.com header.s=xdzpvx2vm2fr73bjeppds7oqr3jbfy5s header.b=BATglTQY;
       dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b=YLjw7lGE;
       spf=pass (google.com: domain of 0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com designates 54.240.11.40 as permitted sender) smtp.mailfrom=0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com;
       dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=xxxxxx.com
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=xdzpvx2vm2fr73bjeppds7oqr3jbfy5s; d=xxxxxx.com; t=1640645533; h=Content-Type:From:To:Subject:Message-ID:Date:MIME-Version; bh=y7l8Len/FG0elemUfgWg28W0SEj5eOJRIMBt9xIFrQo=; b=BATglTQY6PkcRChCgrX9BMdkZVwppc3CCPZ2QliEN6VGtr4YxW7l0C1n3mMgeRCL 0fXjKZwX3enRf9cHfKFJQErkxlmUfyKkLbtKJ4xNd78r4D04aCgUBRgovY05e2lE2vq KZEiJhF7oUN+QyxE87GahoQ88S/7cVjVVIh0RSHQ=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug; d=amazonses.com; t=1640645533; h=Content-Type:From:To:Subject:Message-ID:Date:MIME-Version:Feedback-ID; bh=y7l8Len/FG0elemUfgWg28W0SEj5eOJRIMBt9xIFrQo=; b=YLjw7lGEYZH+SQ4mx1EEdMVAo2v0EzbKGyGHmzH1CkvlnMv9yjMn4x3/BYhpOTxm yZ532qDZBGIIUPkCjoKOAz6K6a11xzPBREIl8Bz0O0kJyEcoShGahRbY4bgNCkOocx8 IJD+NREMTfVK6wlsxzoWRS+HAnVfg1pU80yORo7M=
Content-Type: multipart/related; type="text/html"; boundary="--_NmP-f890ebfb5c0d8a34-Part_1"
From: xxxxxx <noreply@xxxxxx.com>
To: xxxxxx@gmail.com
Subject: Watchlist Summary for Mon, December 27, 2021 (Futures)
Message-ID: <0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@email.amazonses.com>
Date: Mon, 27 Dec 2021 22:52:13 +0000
MIME-Version: 1.0
Feedback-ID: 1.us-east-1.xy6STr9N8VtfY9IEmltVU/dtudHWlVMH37XgJn5/ROY=:AmazonSES
X-SES-Outgoing: 2021.12.27-54.240.11.40

----_NmP-f890ebfb5c0d8a34-Part_1
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html><html lang=3D"en"><head><meta charset=3D"UTF-8"><meta http-e=
quiv=3D"Content-Type" content=3D"text/html; charset=3DUTF-8"><meta http-equ=
iv=3D"X-UA-Compatible" content=3D"IE=3Dedge"><meta name=3D"viewport" conten=
t=3D"width=3Ddevice-width, initial-scale=3D1.0"><!-- So that mobile webkit =
will display zoomed in--><meta name=3D"format-detection" content=3D"telepho=
ne=3Dno"><!-- disable auto telephone linking in iOS--><title></title><style=
 type=3D"text/css">}
.ad p {
  margin-top: 4px;
}
</style><style type=3D"text/css">#data-table,
.data-table { max-width:100%; min-width:100%; width:100%; border-collapse:c=
ollapse; }
#data-table th,
#data-table td,
.data-table th,
.data-table td { color:#000000; border-collapse:collapse; padding:4px; whit=
e-space:nowrap; border:1px solid #D8D8D8; }
#data-table .body tr:nth-of-type(odd),
.data-table .body tr:nth-of-type(odd) { background-color:#f3f3f3; }
#data-table table tbody .spacer td,
.data-table table tbody .spacer td { border:none; }

.preHeaderHide { display:none !important; mso-hide:all !important; }
/* Outlook link fix */
#outlook a { padding:0; }
/* Resets: see reset.css for details */
.ReadMsgBody { width:100%; background-color:#ebebeb; }
/* Hotmail background and line height fixes */
.ExternalClass { width:100%; background-color:#ebebeb; }
.ExternalClass, .ExternalClass p, .ExternalClass span, .ExternalClass font,=
 .ExternalClass td, .ExternalClass div { line-height:100%; }

Any ideas? Thanks

Advertisement

Answer

you want to parse text/html parts

you should check for content type == ‘text/html’

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement