Extract Reply Chains from Emails

Say you only want to grab the original message from an email like this.

Great talking with you. Let's catchup soon.

Thanks,
Mark Anderson
VP of Engineering
888-222-4444

On Fri, Nov 19, 2018 at 12:03 PM, Paul Johnson <paul@example.com> wrote:

> Let's talk at 11. 
> Thanks
> Paul Johnson

And you only want this…

Great talking with you. Let's catchup soon.

How do you do that? We’ll cover various programming solutions below.

Why is this hard?

There are a bunch of issues that need to be addressed which making simple solutions difficult.

  • Signature identification
  • Various formats for headers
    • On Fri, Nov 19th…
    • On 10/9/2018
    • Headers that wrap across lines
    • From:, To:, Date: style headers
  • Various languages need to all be implemented
  • Header formats change over time

We’ll start with SigParser…

SigParser

SigParser is a stateless API for parsing emails. POST an email body our REST endpoint (https://api.sigparser.com/api/Email) and it converts that email body into a structured format with lots of information about the email. For example, it has an array of “emails” embedded in the single email. It also has a property called “cleanedemailbody”.

A request looks like this:

{
    "from_address":"jsmith@example.com", 
    "from_name":"John Smith",
    "plainbody":"Here is the email body."
}

And ther response has this structure…

{
  ...
  "emails": [
    {
      "from_EmailAddress": "string",
      "from_Name": "string",
      "textBody": "string",
      "htmlLines": [
        "string"
      ],
      "date": "2018-11-09T21:41:36.078Z",
      "didParseCorrectly": true,
      "to": [
        {
          "name": "string",
          "emailAddress": "string"
        }
      ],
      "cc": [
        {
          "name": "string",
          "emailAddress": "string"
        }
      ]
    }
  ],
  "cleanedemailbody": "string",
  "cleanedemailbody_ishtml": true,
  "cleanedemailbody_plain": "string"
}

SigParser also addresses many of the issues with the rest of the solutions in this list.

Stackoverflow Solutions

There are lots of answers on Stackoverflow on how to address this but all of them are fairly hacky.

https://stackoverflow.com/questions/28182745/how-can-i-separate-an-email-from-its-previous-answer

GitHub’s Email Reply Parser (Ruby)

email_reply_parser finds the root email from plain text email bodies. This is what GitHub uses to display comments created from email replies.

Issues:

https://github.com/github/email_reply_parser

Install:

gem install email_reply_parser

To Use:

parsed_body = EmailReplyParser.parse_reply(email_body)

(Ported) Email Reply Parser for .NET

https://github.com/EricJWHuang/EmailReplyParser

A port of GitHub’s email reply parser but for .NET.

(Ported) Email Reply Parser for Python

https://github.com/zapier/email-reply-parser

From the team at Zapier. Does the same thing as GitHub’s but can also provide the reply chain excluding the root email contents if you need that.

pip install email_reply_parser

Tutorial

read(email) or parse_reply(email) both return a string of text.

from email_reply_parser import EmailReplyParser

...

EmailReplyParser.read(email_message)

OR

EmailReplyParser.parse_reply(email_message)

Conclusion

If you’re comfortable using a stateless REST API like SigParser, that will be your best option for reliable email parsing. But if you can’t call external services, then using something like “Email Reply Parser” from GitHub will be your next best option.