1. November 2012
E-discovery —
Taking Predictive
Coding Out of
the Black Box
Joseph H. Looby
Senior Managing Director
FTI TECHNOLOGY
IN CASES OF COMMERCIAL LITIGATION, the process of discovery can
place a huge burden on defendants. Often, they have to review millions
or even tens of millions of documents to find those that are relevant to a
case. These costs can spiral into millions of dollars quickly.
2. November 2012
M
anual reviews are time consuming and very
expensive. And the computerized alternate —
keyword search — is too imprecise a tool for
extracting relevant documents from a huge pool.
As the number of documents to be reviewed for
a typical lawsuit increases, the legal profession
needs a better way.
Predictive discovery is that approach. this year courts in the United States
Illustration by Big Human Predictive discovery is a machine- have begun to endorse the use of this
based method that rapidly can method. But according to a survey
review a large pool of documents to FTI Consulting recently conducted, a
determine which are relevant to a significant obstacle stands in the way of
lawsuit. It takes attorneys’ judgments widespread adoption: Most attorneys do
about the relevance of a sample set of not understand how predictive discovery
documents and builds a model that works and are apprehensive about using it.
accurately extrapolates that expertise
to the entire population of potential In this article, we explain how predictive
documents. Predictive discovery is discovery functions and why attorneys
faster, cheaper and more accurate than should familiarize themselves with it.
traditional discovery approaches, and
Companies defending themselves commercial lawsuit might have entailed about $10 million — a staggering sum.
against litigation frequently must searching through tens of thousands of Conducting keyword searches of online
produce documents that may be documents. A team of attorneys would documents, of course, is much faster. But
relevant to a case by searching through pore through all the paper, with perhaps in certain circumstances, such searches
archives. The same is true for companies one or two subject matter experts have been shown to produce as little as
being investigated by a regulatory to define the search parameters and 20 percent of the relevant documents,
body. The process of plumbing one’s oversee the procedure. A larger number of along with many that are irrelevant.
own archives to produce relevant more junior lawyers frequently would do
documents is known as discovery. When the bulk of the work. This process (known With the number of documents in large
the documents are in electronic form, in the trade as linear human or manual cases soaring (we currently have a case
the process is called e-discovery. Large review) would cost a few thousand dollars involving 50 million documents), costs
companies today often must review (today, it runs about $1 per document). are getting out of hand. Yet the price of
millions or even tens of millions of letters, failing to produce relevant documents
e-mails, records, manuscripts, transcripts But now, the large number of documents can be extraordinarily high in big-stakes
and other documents for this purpose. to be reviewed for a single case can make litigation. The legal profession needs a
manual searches a hugely expensive and new approach that improves reliability
The discovery process once was entirely time-consuming solution. For a case with and minimizes costs.
manual. In the 1980s, the typical 10 million documents, the bill could run
3. November 2012
There now is a third approach. Predictive documents to identify the full responsive set. because most of the expense is incurred
discovery takes expert judgment from to establish the upfront search rules.
a sample set of documents relevant (or Studies show that predictive coding Afterwards, the cost increases only slightly
responsive) to a case and uses computer usually finds a greater proportion — as the size of the collection grows.
modeling to extrapolate that judgment typically in the range of 75 percent
to the remainder of the pool. In a typical (a measure known as recall) — of the In a recent engagement, we were asked
discovery process using predictive responsive documents than other to review 5.4 million documents. Using
coding, an expert reviews a sample of approaches. A seminal 1985 study found predictive discovery, we easily met the
the full set of documents, say 10,000- keyword searches yielded average recall deadline and identified 480,000 relevant
20,000 of them, and ranks them as rates of about 20 percent. A 2011 study documents — around 9 percent of the
responsive (R) or non-responsive (NR) to showed recall rates for linear human total — for about $1 million less than linear
the case. A computer model then builds review to be around 60 percent. And, manual review or keyword search would
a set of rules that reflects the attorney’s according to these same studies, the have cost.
judgment on the sample set. Working on proportion of relevant documents in
the sample set and comparing its verdict the pool of documents identified by
on each document with the expert’s, the predictive coding (a measure known as Definition: Trial and Error
software continually improves its model precision) is better, too. The process of experimenting
until the results meet the standards with various methods of doing
agreed to by the parties and sometimes The cost of predictive discovery is a fraction something until one finds the
a court. Then the software applies of that for linear manual review or keyword most successful.
its rules to the entire population of search, especially for large document sets,
So how does predictive discovery work?
As suggested above, there are five key steps:
1 Experts code a representative sample of the document set as
responsive or non-responsive.
2 Predictive coding software reviews the sample and assigns weights
to features of the documents.
The software refines the weights by testing them against every
3 document in the sample set.
4 The effectiveness of the model is measured statistically to assess
whether it is producing acceptable results.
5 The software is run on all the documents to identify the
responsive set.
1. “An Evaluation Of Retrieval Effectiveness For A Full-Text Document Retrieval System,” by David C. Blair of the University of Michigan and M.E. Maron of the University of California at
Berkeley, 1985
2. “Technology-Assisted Review in E-discovery Can Be More Effective and More Efficient than Exhaustive Manual Review,” by Maura R. Grossman and Gordon V. Cormack, Richmond
Journal of Law and Technology, Vol. XVII, Issue 3, 2011.
4. November 2012
Let’s look at these steps in more detail.
1
COLLECTION REFERENCE SET EXPERT REVIEW RESULTS
Experts code a
representative sample 17,000
of the document set as Responsive
responsive or non- Sample Training Set (23%)
responsive. 1,000,000
In a typical case, a plaintiff might ask a 983,000 Attorney Expert Review
company to produce the universe of Document Collection
Non–Responsive
documents regarding some specifics about
Project X. The company prepares a list of Prediction Set responsive 3,900
its employees involved in the project. From non–responsive 13,100
all the places where teams of employees
working on Project X stored documents, Estimate Of Goal
the firm secures every one of the potentially We are 99% confident that 23% (+/- 1%) of the 1,000,000 documents are responsive.
relevant e-mails and documents. These can
easily total 1 million or more items.
These documents are broken into two and valuation issues ….” for instance, Each feature is given a weight. If the
subsets: the sample or training set and the yields 33 features as shown below. feature tends to indicate a document is
prediction set (i.e., the remainder). To have responsive, the weight will be positive.
a high level of confidence in the sample, The total number of features in a set If the feature tends to indicate a
the training set must be around 17,000 of 1 million documents is huge, but document is non-responsive, the weight
documents. One or more expert attorneys even large numbers of features easily will be negative. At the start, all the
codes the training set’s documents as R or can be processed today with relatively weights are zero.
NR with respect to the litigation. inexpensive technology.
The results of this exercise will predict with
reasonable accuracy the total number TRAINING SET EXAMPLE FEATURE SET
of Rs in the full set. A statistically valid
random sample of 17,000 documents Words, Phrases & responsive 3,900
Feature #
Metadata skillings 1
generally is sufficient to ensure that our non–responsive 13,100 Unigrams abrupt 2
answer will be accurate to +/-1 percent departure 3
will 4
at the 99 percent confidence level. For raise 5
example, if the number of R documents suspicions 6
in the sample set is 3,900 (23 percent PROCESSING of 7
accounting 8
of 17,000), it is 99 percent certain that improprieties 9
between 22 to 24 percent of the 1 million and 10
1 Extract Machine-Readable Text valuation 11
documents are responsive. issues 12
2
Example skillings abrupt 13
Predictive coding Bigrams abrupt departure 14
departure will 15
software reviews the From: Sherron Watkins will raise 16
sample and assigns To: Kenneth Lay raise suspicions 17
Sent: August 2001 suspicions of 18
weights to features of accounting 19
of the documents. Skilling’s abrupt departure will raise
accounting improprieties 20
improprieties and 21
suspicions of accounting improprieties
and valuation 22
and valuation issues...
For the sample set of documents, the valuation issues 23
skillings abrupt departure 24
software creates a list of all the features Trigrams abrupt departure will 25
of every document in the sample set. 3 Make Features departure will raise 26
Features are single words or strings - Break stream of text into tokens
will raise suspicions 27
raise suspicions of 28
of consecutive words; for example, - Remove whitespace and punctuation suspicions of accounting 29
up to three words long. The sentence - Normalize to lowercase of accounting improprieties 30
accounting improprieties and 31
“Skilling’s abrupt departure will raise - Convert tokens into Feature Set
improprieties and valuation 32
suspicions of accounting improprieties and valuation issues 33
5. November 2012
3
The software refines the score for every document according to the
WEIGHT TABLE
weights by testing them final weight table. The machine score for a
against every document Feature Weight document is the sum of the weights of all
in the sample set. will raise -0.6
its features. In the illustration to the left, we
show two documents: a document with a
The next step is one of trial and error. The suspicions 0.7 score of -5.3 (probably non-responsive) and
model picks a document from the training of -
a document with a score of +8.7 (probably
set and tries to guess if the document is responsive).
responsive. To do this, the software sums improprieties and valuation 1.8
the importance weights for the words in Then, if predictive coding is to be used to
the document to calculate a score. If the select the highly responsive documents
score is positive (+), the software guesses from the collection of 1 million and
NR R
responsive, or R — this is the trial. to discard the highly non-responsive
documents, a line has to be drawn. The line
If the model guesses wrong (i.e., the model is the minimum score for documents that
and attorney disagree), the software will be considered responsive. Everything
adjusts the importance weights of the with a higher score (i.e., above the line),
words recorded in the model’s weight subject to review, will comprise the
table. This, of course, is the error. Gradually, -5.3 0 +8.7
responsive set.
through trial and error, the importance
weights become better indicators of In our 17,000 document sample, of which
responsiveness. Pretty soon, the computer better reflection of the attorney expert’s the expert found 3,900 to be R, there might
will have created a weight table with judgment. This process is called machine be 7,650 documents that scored more
scores of importance and unimportance learning. The computer can take several than zero. Of these, 3,519 are documents
for the words and phrases in the training passes through the sample set until the the expert coded R, and 4,131 are not.
set, an excerpt of which might look like the weight table is as good as it can be based Therefore, with the line drawn at 0, we will
illustration at the right. on the sample. achieve a recall of 90 percent (3,519 divided
by 3,900) and a precision of 46 percent
As the software reviews more documents, Once the weight table is ready, the program (3,519 divided by 7,650).
the weight table becomes an increasingly is run on the full sample to generate a
ATTORNEY EXPERT COMPUTER MODEL
REVIEW (AT >0)
NR R
7,650
FN
TP
TN
Vs
FP
-1 0 +1
Found
responsive 3,900 True + 3,519 90% Recall
non–responsive 13,100 False + 4,131 46% Precision
17,000 7,650
FALSE TRUE FALSE
POSITIVES POSITIVES NEGATIVES
6. November 2012
4
The effectiveness of between 0 and +0.3 are missed). On the
the model is measured basis of the sample set, the software
statistically to assess can determine recall and precision for
whether it is producing any line. Typically, there is a trade-off;
acceptable results. as precision goes up (and the program
delivers fewer NR documents), recall
By drawing the line higher, one can declines (and more of the R documents
increase the ratio of recall to precision. are left behind). However, not only are
For example, if the line is drawn at +0.3, recall and precision scores higher than
the precision may be significantly higher for other techniques, the trade-off can be
(say 70 percent) but with lower recall managed much more explicitly and precisely.
(i.e., 80 percent because some of the Rs
ATTORNEY EXPERT COMPUTER MODEL
REVIEW (AT >0.3)
NR R
4,450
FN
TP
Vs.
TN
FP
-1 0 0.3 +1
Found
responsive 3,900 True + 3,125 80% Recall
non–responsive 13,100 False + 1,325 70% Precision
17,000 4,450
FALSE TRUE FALSE
POSITIVES POSITIVES NEGATIVES
Recall measures how well a process retrieves Precision measures how well a process
relevant documents. For example, 80% recall retrieves only relevant documents. For example,
means a process is returning 80% of the estimated 70% precision means for every 70 responsive
responsive documents in the collection. documents found, 30 non-responsive
documents also were returned.
80% = 3,125 / 3,125 + 775 70% = 3,125 / 3,125 + 1,325
Recall True True False Precision True True False
Positives Positives Negatives Positives Positives Positives
(TP) (TP) (FN) (TP) (TP) (FP)