This document discusses strategies for managing large volumes of data from various sources. It recommends identifying necessary data to retain while securely disposing of unnecessary data. It also argues that technology-assisted review of data can be more effective and efficient than exhaustive manual review. Various search techniques like exact, boolean, conceptual, algorithmic and those using seed sets are examined in terms of their advantages and limitations for finding responsive documents within data sets. The use of sampling and statistics to estimate the quantity of responsive documents and support defensibility of the review process is also covered.
2. Myriad
sources
Google
Docs
Employee sources
Internal
Enterprise data sources
External
Managed
External
Cloud
External
Gmail
Google
Docs
3. The End Game? To Retain What’s Needed
• Know what you need to keep
• Employ the right expertise to find it
– The right tools
– The right expertise
– Deployed effectively against diverse sources
• Securely dispose of the rest
4. “Overall, the myth that exhaustive
manual review is the most effective –
and therefore, the most defensible –
approach to document review is
strongly refuted. Technology-assisted
review can (and does) yield more
accurate results than exhaustive
manual review, with much lower effort.
Search
“superior
to manual
reviews”
Richmond Journal of Law
and Technology (2011)
___________________________
TECHNOLOGY-ASSISTED
REVIEW IN
E-DISCOVERY CAN BE MORE
EFFECTIVE AND MORE
EFFICIENT THAN EXHAUSTIVE
MANUAL REVIEW
Maura R. Grossman
Gordon V. Cormack
XVII RICH. J.L. & TECH. 11 (2011),
http://jolt.richmond.edu/v17i3/article11.pdf , p.48
5. Search
Results
Vary
NIST TREC
Legal Track
Interactive
Task
2008-2010
0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.6
0.4
0.2
0.0
Recall
Precision
High
Recall
High
Precision
2008
2009
2010
Keyword Search
(Blair & Maron,1985)
Manual Review
(Grossman &
Cormack, 2011)
Precision
Recall
(Sponsored by National Institute of
Standards and Technology
TREC Legal Track
http://trec-legal.umiacs.umd.edu)
6. Search is
run on an
index
Token Locations
action 3:1; 24:10;
45:112;
all 3:5; 4; 23
accountants 2:2; 41::33
business 2:3; 4::56
conferences
3:12; 7:1; 88:5;
95:1
date 1:1; 4:1; 5:3;
8:13
dec 1:3; 155:9
Same search queries
provide different
results depending on
the tool
• Google
• Exact search
• Algorithmic
search
7. Target documents:
common cold
virus
cough!
fever
congest!
loss w/3
appetite
allergies
sneez!
smoking
flu
computers
traffic
malaise
sore throat
runny nose
o known
o adjustable
o over-inclusive – anchor
o under-inclusive – add
Exact
Search
Boolean,
Rule-
Based,
Modeling
Linguistic
Patterns
8. Exact
Search
Rule-
Based,
Modeling
Linguistic
Patterns
enron #w5 [data, documents, e{ }mail{s},
record{s}, evidence{s}, info{rmation}, copy[y,
ies], file{s}] #w10 [shred{s, ded, dding},
destroy{s, ed, ing}]
TreC09_204_ST_Retention_
Deletion
BM
o known
o adjustable
o over-inclusive – anchor
o under-inclusive – add
9. Concept
Search:
Thesaurus
addition
Target documents:
common cold
virus
cough!
fever
chills
congest!
loss w/3
appetite
sneez!
heat
hotness
torridness
delirium
ecstasy
excitement
febrile
disease
ferment
fervor
fire
flush
frenzy
intensity
germ
micro
organism
bacterium
bug
microbe
bacillus
ailment
disease
illness
infection
pathogen
sickness
flu
venom
o unknown
o imbedded
o not adjustable
o over-inclusive
10. Algorithmic
Search
Computes
document
“totals” and
compares
totals
Document 1
“total”
Document 2
“total”
α
β
o unknown
o imbedded
o hard to adjust
o over-inclusive
o under-inclusive
11. Algorithmic
search with
“seed sets”
NR
NRRNR R R cough
cough
smokin
ache
malaise
sleep
sneezed
cocaine
congest
chill
chill
ice
virus
comput
counsel
patent
misuse
chill
ed
dripping
fever
trip
cold
runny
er
crash
g
NNRR
NR
NR
NR
R
R
NR
R
R seed set
“total”
NR
seed set
“total”
α
β
Seed set
o unknown
o imbedded
o hard to adjust
o over-inclusive
o under-inclusive
12. Statistics
Supports
Defensibility
Yield Estimate
– Estimate of
responsive
documents in
data set
Data set – 100,000 documents
1000 doc
sample
15,000 docs
estimated
responsive yield
150 target
docs
150
150/1000 target docs in sample = 15%
Hence estimated 15,000/100,000 target docs in data set
13. Statistics
Supports
Defensibility
Sample of
Results –
“Not Tagged” Data
90,000 documents
1000
doc
sample
“Tagged “ Data
10,000 documents
1000
doc
sample
700
70% correctly
tagged
90/1000
target docs
missed
90
10,000 x 70% correct = 7,000 target docs tagged
90,000 x 9% missed = 8,100 target docs missed
46% recall:
7,000/15,100
More target docs
missed than
tagged.