Extract Reply Chains from Emails

What is a reply chain?

Say you only want to grab the original message from an email like this.

Great talking with you. Let's catchup soon.

Thanks,
Mark Anderson
VP of Engineering
888-222-4444

On Fri, Nov 19, 2018 at 12:03 PM, Paul Johnson <paul@example.com> wrote:

> Let's talk at 11. 
> Thanks
> Paul Johnson

And you only want this…

Great talking with you. Let's catchup soon.

How do you do that? We’ll cover various programming solutions below.

Why is this hard?

There are a lot of issues that need to be solved when writing your own parser:

  • Signature identification
  • Various formats for headers
    • On Fri, Nov 19th…
    • On 10/9/2018
    • Headers that wrap across lines
    • From:, To:, Date: style headers
  • Reply chains indicated by > or multiple >>>
  • Some lines look like signatures but aren’t
  • Corrupted email headers
  • Common for plain text emails to split reply headers
  • Multi-language support if required
  • Header formats change over time

Due to this, we suggest not coding your own signature parsing algorithm. It is non-trivial.

Solutions

There are really only two solutions for splitting email chains. First is SigParser’s email parsing API and the other is to use GitHub’s email extractor tool or a port of it for stripping off the replies from an email chain.

SigParser Email API GitHub’s Email Parser
Extract First Email Yes Yes
Plain Text Emails Yes Yes
HTML Emails Yes No
Extract Child Emails Yes No
Scrape Contacts
Phones, Titles…
Yes No
Languages English, German, Spanish English
Usage method Stateless REST API Install code in your app
Updates Updated automatically as email clients change their headers Abandonded (last commit was July 2016)

SigParser

SigParser is a stateless API for parsing emails. POST an email body our REST endpoint (https://api.sigparser.com/api/Email) and it converts that email body into a structured format with lots of information about the email. For example, it has an array of “emails” embedded in the single email. It also has a property called “cleanedemailbody”. You can try the API without any coding on our try page.

A request looks like this:

{
    "from_address":"jsmith@example.com", 
    "from_name":"John Smith",
    "plainbody":"Here is the email body."
}

For parsing the email reply chains, the response has these fields. There are other fields you can see in the spec page not related to this guide such as phone numbers and titles of contacts.

{
  ...
  "emails": [
    {
      "from_EmailAddress": "string",
      "from_Name": "string",
      "textBody": "string",
      "htmlLines": [
        "string"
      ],
      "date": "2018-11-09T21:41:36.078Z",
      "didParseCorrectly": true,
      "to": [
        {
          "name": "string",
          "emailAddress": "string"
        }
      ],
      "cc": [
        {
          "name": "string",
          "emailAddress": "string"
        }
      ]
    }
  ],
  "cleanedemailbody": "string",
  "cleanedemailbody_ishtml": true,
  "cleanedemailbody_plain": "string"
}

SigParser is a REST API so it can really work with any language.

SigParser API called with Curl

curl -k -i -H "Content-Type: application/json" -H "x-api-key: <ApiKey>" -X POST -d '{"from_address":"jsmith@example.com", "from_name":"John Smith","plainbody":"Here is the email body."}' 'https://api.sigparser.com/api/Email'

SigParser API called from C# and .NET

Use our Nuget library to easily connect to our REST API. Also read our guide on how to to pull email from Google/Gmail email accounts.

var client = new SigParser.Client(ApiKey);
            var result = client.Parse(new SigParser.EmailParseRequest
            {
                plainbody = @"
Hi John,

Lets get coffee tomorrow.

Thanks
Steve Johnson
888-333-3323 Mobile
San Diego, CA
",
                from_name = "Steve Johnson",
                from_address = "sjohnson@example.com"
            }).Result;

SigParser API called with Python

import requests
import json

url = "https://api.sigparser.com/api/Email"

payload = json.dumps( {"from_address": "jsmith@example.com","from_name": "John Smith", "plainbody": "This is an email.", "htmlbody": null } )
headers = {
    'content-type': "application/json",
    'x-api-key': "212121212121212",
    'cache-control': "no-cache"
    }

response = requests.request("POST", url, data=payload, headers=headers)

print(response.text)

SigParser API called with JavaScript

var request = require("request");

var options = { 
    method: 'POST',
    url: 'https://api.sigparser.com/api/Email',
    headers: 
    {   'cache-control': 'no-cache',
        'x-api-key': '232323232323',
        'content-type': 'application/json' },
    body: { from_address: 'jsmith@example.com', from_name: 'John Smith', plainbody: "This is the body of the email." },
    json: true 
};

request(options, function (error, response, body) {
    if (error) throw new Error(error);
    console.log(body);
});

SigParser Codeless

Use SigParser’s APIs without writing any code. Tools such as Integromat or Zapier make this easy. We have some tutorial videos explaining how to get started. We suggest using Integromat as it is the easiest to work with collections in.

Watch a video on how to connect Gmail, SigParser and Google Sheets with Integromat

GitHub’s Email Reply Parser (Ruby)

GitHub’s github/email_reply_parser finds the root email from plain text email bodies. This is what GitHub uses to display comments created from email replies. It seems to have been abandoned or rarely updated even though there are a number of outstanding issues with it.

Issues:

https://github.com/github/email_reply_parser

Install:

gem install email_reply_parser

To Use:

parsed_body = EmailReplyParser.parse_reply(email_body)

These ports has many of the same issues as the original GitHub email reply parser project but are okay if reliability isn’t a big deal.

(Ported) Email Reply Parser for .NET

Port for .NET (not Core) of GitHub’s email reply parser but for .NET. No HTML support and same limitations as the original project. We don’t know who the maintainer of this project is.

(Ported) Email Reply Parser for Python

Port for Python from the team at Zapier. Does the same thing as GitHub’s but can also provide the reply chain excluding the root email contents if you need that.

No HTML support and most of the same limitations as the original project.

pip install email_reply_parser

Tutorial

read(email) or parse_reply(email) both return a string of text.

from email_reply_parser import EmailReplyParser

...

EmailReplyParser.read(email_message)

OR

EmailReplyParser.parse_reply(email_message)

Conclusion

If you can use a stateless REST API like SigParser, that will be your best option for reliable email parsing. But if you can’t call external services, then using something like “Email Reply Parser” from GitHub will be your next best option.

Get Your API Key

Try the SigParser API. Signup and get an API key with 1,500 free emails per month. Upgrade or downgrade at any time. Our API is entirely serverless and stateless.
Get Your API Key Now!