What most people don't know about email forensics

My company was once suspected of manufacturing evidence. Our product was being used by the plaintiff in a civil litigation to identify responsive emails. Apparently, we found too many of them...many thousands more than the defense believed existed, or for that matter, wanted to acknowledge existed. The defense turned to two well known competing vendors to validate our process, and neither came close to producing what we had found. So they made a motion to exclude them from evidence by arguing in essence that since our work could not be replicated, it could not be real. It didn't seem to matter that those two other vendors did not replicate each other's work, just that their results were dramatically less than ours.

To a logical mind this must be confusing. After all, emails either exist in a collection or they don't. How can three different products processing the same collection of data find completely different sets of results?

This was not a case of the email collection being limited to a set of known custodian mailboxes. That technique is rife with problems that I won't go into in this posting. In this case, all three vendors had access to backup copies of what was believed to be (but later disproved) the complete set of email database backups for the time period in question.

The answer lies in an understanding of the nature of database systems.

Modern email systems ( like Microsoft Exchange and IBM Notes) are based on database technology.  Databases are sophisticated structures designed to efficiently create, modify and delete the data they contain. They do that by maintaining searchable indexes of the key attributes of all the data stored in them. In this case, the data being stored is the email plus its meta-data and attachments. 

Databases allocate space in fixed-sized chunks called pages. Some of these pages are used for the index, and some are used to store the data. The ones that store data are dependent on those that store the index.

Without the index pages, emails stored in
the data pages are generally inaccessible

If the index pages gets out of sync with the data pages, the data in the data pages can become inaccessible by normal means, and will therefore be missed by an eDiscovery collection process. 

How can this happen? With email, there are two common causes:

 1. Emails deleted on a poorly administered or incompletely collected email system.

On a well administered system, when a user permanently deletes an email, the email is moved into a save area akin to a recycle bin. It stays there until the next backup is made of the database, insuring that a copy of all deleted emails can be preserved in that backup. Only then is the reference to that email permanently deleted from the index, but even then the pages storing the email will remain intact and forensically discoverable until they are eventually reallocated for some other purpose by the email system. On a poorly administered email system, these deleted emails will not be backed up, and therefore, may be missing from the collection. On the other hand, if the email system is well administered but the collection is still incomplete, if for instance, key backups sets are missing from the collection, then you risk missing the deleted emails that were preserved on those backup sets.

2. Indexes corrupted when improperly copying or collecting the email database.

The copy may have been made on a active but non-quiesced email server, or the emails in question may not have been committed to the database at the time of the copy and the transaction log containing the non-committed email may not be part of the collection,  or the tool in use may not be able to process transaction logs, or the backup of the email may span more than one backup tape and one or more of those tapes are missing from the collection.

In these cases, there will be emails still contained in the data pages of the database that cannot be accessed by the index pages.

In my experience, some of the above issues exist in just about every eDiscovery data collection. That means that if you use tools that rely on the integrity of the database index process those collections, you are very likely missing emails. In the litigation in question, it was a perfect storm combining all of the problems listed above. That was the reason the discrepancy was so large.

The solution is to use a forensic tool that examines every database page and reconstructs the emails it finds without requiring the database index to locate them. Here again, a knowledge of database architectures comes in handy. Enterprise email databases are designed with redundancies and data integrity safeguards built into them that allow them to find and reconnect pages that have accidentally been removed from the index. Using this knowledge, a forensic indexing tool can, by mere inspection of a page, tell whether it contains valid data, even if that data is inaccessible from the index.

The data found by non-forensic means was only the tip of a responsive iceberg.

That was the simple reason for the difference in results between the data we produced, and the data produced by those two other vendors. Our data did not rely on the database index to find emails. It forensically scanned each database page and found emails that had been deleted or belonged to poorly made copies of the email database. The defense's vendors used a tool that was limited to the data accessible via the database index.

Armed with that knowledge, we had the plaintiff ask the defense them to give us a list of emails they believed we had manufactured. With that information we could prove those emails in fact did exist. We could give them offset and length of each piece of those emails so they could use any third party binary editor to verify their existence.

We never got that list. Upon hearing our response, the defense moved on to a different strategy.

Explore topics