Near-Duplicate Detection Finds Documents No One Thought Could be Found

Expert consultants (inexpensively) apply analytics to find documents that would not have been found through traditional review and client saves $134,955.

Legal teams often come across important documents from third-party sources or other matters requiring a content comparison in order to identify similar or duplicate documents within their client’s ESI. In many cases the assumption is made that a comparison across datasets cannot be performed or, if there is a possible solution using analytics, it will be too expensive to implement.

So…what should legal teams do in this situation? Apply near-duplicate detection workflows.

What is Near-Duplicate Detection?

Near-duplicate detection is an advanced analytics technology used to identify near-duplicate—or duplicate—documents based purely on textual content and then groups those documents together for review according to similarity. Near-duplicate detection technology is ideal for cases of all sizes because it is powerful, yet inexpensive to implement.

This white paper shows how one legal team applied predictive coding to avoid reviewing 70% of opposing counsel’s production and saved $1.4 million.

Here is a case study where D4 used near-duplicate detection to compare key production documents gathered across similar matters against source ESI, in order to find the original, electronic versions of documents needed for review and production on a new matter.


D4’s client had approximately 690 sample documents produced in other matters, but were identified as relevant for a new matter. Many of these documents were TIFF images or in PDF format with existing production bates numbers, and our client wanted to find original, “clean” electronic versions in their client’s source ESI for production purposes. Our client had 33 GB of source ESI to compare the sample documents against, and had two requirements for the project:

  1. Compare the text in the 690 sample documents against the extracted text from the 33 GBs of source ESI, and locate the original, “clean” native files for review and production.
  2. If the original native files for the 690 sample documents were identified, D4 was to compare the extracted text from those native files against the entire 33 GB dataset to determine if any additional versions of those documents existed. If additional versions were identified then those new documents would be included in review and production.


Our client didn’t think there would be any way to perform comparisons across the datasets and had prepared for a traditional, linear review. Once near-duplicate technology and workflows were discussed, our client gave approval to move forward. The next challenge was validating the quality of the text for the 690 sample production documents that were in bates numbered TIFF and PDF image format. Unfortunately, these sample documents were the only versions available to our client, so if the text was missing, or the quality was low, there was a chance we wouldn’t be able to move forward with near-duplicate detection.


After determining that most of the existing production text couldn’t be used, D4 ran OCR on a portion of the sample documents to improve the quality of the text, which gave us enough content for near-duplicate detection. We continued to move forward and had all text and metadata extracted for the 33 GBs of client ESI resulting in 190,000 documents. The text and metadata for both datasets were then loaded into a litigation review platform and near-duplicate detection was run across the entire document population. Utilizing a few near-duplicate comparison workflows, D4 was able to compare the 690 sample documents against the entire 33 GBs of ESI (190,000 documents) and locate 644 native files for the sample documents. After discussing the results with the client, a single reviewer was able to quickly search the metadata of the 33 GBs (190,000 documents) and find native files for the 46 remaining sample documents.

Working with the extracted text of the 690 native files we found using near-duplicate detection workflows, we expanded our near-duplicate comparisons across the entire 33 GBs (190,000 documents) and identified an additional 840 documents with similar, relevant content that the client included in their production. The final result was both project requirements were successfully completed and the only items used to accomplish this were text, metadata and near-duplicate analytics.


The entire project – from consulting through near-duplicate analysis, review and production – cost our client $7,545, which included the cost of analytics. To do a linear review of 33 GBs/190,000 documents at $.75 per document, this project would have cost our client $142,500. By utilizing near-duplicate detection analytics and only working with extracted text and metadata, our client saved $134,955 and found documents they most likely would not have found through traditional review.

With data volumes and the percentage of non-relevant documents rapidly increasing, it is has never been more important for legal teams to analyze source ESI prior to processing and review preventing eDiscovery costs from spiraling out of control. I have said it once, but I will say it again: Analytics is much more than predictive coding and assisted review and this case study proves if you work with consultants who understand the different analytics technologies and workflows, implementing some form of analytics will more than pay for itself.

It is important to note, the process of near-duplication does not remove documents, it only groups similar documents together. Therefore, while near-duplication is an effective strategy for litigation support, it should not be used to dedupe document collections.

Attract & Retain Top Talent

With a rapidly changing industry, it's vital to offer the right compensation and set the right expectation. With our Salary Guide, get detailed job descriptions, industry insights and local salary data to equip your managers with hiring confidence and expertise.

Get your copy »

Get email updates about more content like this.


| Next articles in The Column blog |

Get the foundation you need to hire the best legal talent.

Request your copy of our 2021 Salary Guide »