Invited talk at TRUST Women’s Institute for Summer Enrichment (WISE), Cornell, NY Jun 16, 2014. Infrastructure support for text mining research of big data repository like HathiTrust raises challenges in access and security when the bulk of the repository is protected by copyright.
The Ultimate Guide to Choosing WordPress Pros and Cons
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
1. Case
Study
in
Big
Data
:
the
Socio-‐Technical
Issues
of
HathiTrust
Digital
Texts
Women’s
Ins*tute
for
Summer
Enrichment
Cornell
University,
Jun
16,
2014
Beth
Plale
Professor,
School
of
Informa?cs
and
Compu?ng
Director,
Data
To
Insight
Center
Indiana
University
HATHI TRUST
RESEARCH CENTER!
2. • Who
are
the
Players?
HathiTrust,
Google,
Authors
Guild
• The
Object
of
AJen?on
:
11
M
books
from
university
libraries
• Rulings
around
copyright
• HTRC,
or
why
I
care
• Is
security
of
HTRC
Data
Capsule
good
enough?
4. Books
Digi*za*on
Project
(2007)
Libraries
of
U
Michigan,
U
California,
Virginia,
Wisconsin,
Indiana,
…
digi*zed
books
digi*zed
books
digi*ze
5. digi*zed
books
digi*zed
books
Legal
ac*on
Mar
2011:
New
York
federal
judge
rejected
a
$125
million
legal
selement
that
Google
had
worked
out
with
the
authors
and
publishers
over
the
copyright
issues
Nov
2013:
same
Judge
issued
ruling
saying
that
Google's
use
of
the
works
was
a
"fair
use"
under
copyright
law
Google/
Authors
Guild
6. • June
2014:
2nd
Circuit
Court
of
Appeals
ruling
on
Authors
Guild
versus
HathiTrust
(Cornell,
U
Michigan,
U
California,
U
Wisconsin,
Indiana)
is
a
major
victory
for
fair
use
digi*zed
books
Legal
ac*on
7. Highlights
2014
ruling
• With
respect
to
the
full-‐text
database,
the
court
found
that
although
a
copy
of
the
en*re
work
is
made,
the
purpose
of
a
full-‐text
searchable
database
is
so
different
from
that
of
the
underlying
works
that
the
use
must
be
considered
transforma*ve.
In
fact,
the
court
wrote,
"the
crea*on
of
a
full-‐text
searchable
database
is
a
quintessen*ally
transforma*ve
use".
June
10,
2014
|
By
Parker
Higgins
Another
Fair
Use
Victory
for
Book
Scanning
in
HathiTrust
8. • The
Authors
Guild
argued
that
HathiTrust's
use
of
an
iden*cal
server
and
two
tape
back-‐
ups
cons*tuted
"excessive"
copying.
• Thankfully,
the
court
rejected
that
premise,
acknowledging
that
when
it
comes
to
digital
technology,
an
approach
that
focuses
only
on
individual
copies
made
is
insufficient.
June
10,
2014
|
By
Parker
Higgins
Another
Fair
Use
Victory
for
Book
Scanning
in
HathiTrust
Highlights
2014
ruling
9. Does
Authors
Guild
Represent
All
Authors?
• The
Authors
Guild
members
are
overwhelmingly
trade-‐book
authors;
the
books
scanned
by
the
Hathi
Trust
are
overwhelmingly
scholarly
books
wrien
as
part
of
an
academic
tradi*on
that
takes
free
access
and
sharing
as
its
founda*on.
• The
Authors
Alliance
:
new
organiza*on
represen*ng
authors
who
are
primarily
concerned
with
being
read.
Court
finds
full-‐book
scanning
is
fair
use
Cory
Doctorow
at
3:00
pm
Sat,
Jun
14,
2014
10. Highlight
2014
Ruling
• Given
that
consistent
fair
use
record
for
book
digi*za*on,
today's
ruling
might
not
be
totally
surprising.
S*ll,
the
text
of
the
opinion
is
encouraging,
and
reflects
a
court
that
respects
the
Cons/tu/onal
purpose
of
copyright
as
a
tool
to
promote
the
progress
of
science
and
the
useful
arts—not
a
blunt
instrument
for
rightsholders
to
regulate
all
downstream
uses.
June
10,
2014
|
By
Parker
Higgins
Another
Fair
Use
Victory
for
Book
Scanning
in
HathiTrust
11. • Who
are
the
Players?
HathiTrust,
Google,
Authors
Guild
• The
Object
of
Aen*on
:
11
M
books
from
university
libraries
• Rulings
around
copyright
• HTRC,
or
why
I
care
• Is
security
of
HTRC
Data
Capsule
good
enough?
12. HTRC,
or
why
I
care:
HathiTrust
digital
library
is
“big
data”;
and
Text
mining
is
the
new
library
catalog
search
13. Similar
model,
different
ends
$$
HTRC
goes
beyond
“full
text
searchable
database”
Scholarly
search
Scholarly
mining
14. #HTRC
@HathiTrust
HathiTrust
• HathiTrust
is
a
consor*um
of
academic
&
research
ins*tu*ons,
offering
a
collec*on
of
millions
of
*tles
digi*zed
from
libraries
around
the
world.
– Founding
members:
University
of
Michigan,
Indiana
University,
University
of
California,
and
University
of
Virginia
http://www.hathitrust.org/htrc
http://www.hathitrust.org
à
Dis*nguished
from
16. #HTRC
@HathiTrust
Content
of
HathiTrust
• Books
and
journals
– Plus
pilots
around
images,
audio,
born-‐digital
• Digi*za*on
sources
– Google
(96.8%,
10,162,104)
– Internet
Archive
(2.9%,
301,972)
– Local
(0.3%,
31,840)
19. #HTRC
@HathiTrust
Mo?va?on
for
HTRC
à HathiTrust repository is massive scale
-- latent goldmine for text based research
à Restricted nature of parts of
HathiTrust content suggests need for
new forms of access that preserves
intimate nature of interaction with texts
while at same time honoring restrictions
on access
à Size and restrictions demand new
paradigm: computation moves to the
data (not vice versa)
20. #HTRC
@HathiTrust
HathiTrust
Research
Center
•
The
HathiTrust
Research
Center
(HTRC)
was
established
in
2011
to
enable
computa*onal
research
across
a
comprehensive
body
of
published
works,
for
the
purposes
of
scholarship,
educa*on,
and
inven*on.
• HTRC
Execu*ve
Commiee
– Beth
Plale,
co-‐Director,
Professor
of
Informa*cs
and
Compu*ng,
Indiana
University
– J.
Stephen
Downie,
co-‐Director,
Professor
of
Informa*on
Science,
University
of
Illinois
– Robert
McDonald,
Indiana
University
Libraries
– Beth
Namachchivaya
Sandore,
University
of
Illinois
Library
– John
Unsworth,
CIO,
Dean
of
Library,
Brandies
University
21. HTRC
system
Complexity
hiding
interface
The
complexity
Tabular
info
Sta*s*cal
plots
Spa*al
plots
Request
23. Text
mining
at
scale:
quick
tutorial
on
topic
modeling
of
texts
24. #HTRC
@HathiTrust
Topic
Modeling
• Can
answer
more
complex
or
nuanced
ques*ons
– What
are
the
primary
themes
of
an
author?
– What
are
the
primary
themes
of
a
research
domain?
– When
did
a
new
topic
enter
a
research
domain?
• Provides
more
data
than
word
counts
– 100s
of
topics
can
be
extracted.
– Underlying
data
(topics,
volume,
and
page)
is
available
25. #HTRC
@HathiTrust
Themes
for
Authors
Two
topics
with
iden*cal
centrali*es
(e.g.,
Dickens)
but
separate
themes
More
strongly
focused
on
book
(illustra*ons,
volume,
literature)
More
strongly
focused
on
author
himself
(leers,
household,
house)
27. Digging
into
philosophy
of
science
Establish
points
of
contact
between
philosophy
and
science:
where
philosophical
arguments
on
anthropomorphism
appear
in
science
texts
Colin
Allen,
IU
28. The
How
• 1315
volumes
from
HTRC
selected
using
keyword
search
for
‘darwin’,
‘romanes’,
‘anthropomorphism’,
and
‘compara*ve
psychology’
• Set
contains
lots
of
uninteres*ng
books:
e.g.,
college
course
catalogs
• Apply
topic
modeling
on
86
volume
subset
• Using
iPy
Notebook
33. Copyright:
A
Reality
Full
text
download
is
limited
by
both
size
and
by
copyright
34. HTRC
solu*on
to
fully-‐flexible
text
mining
research
on
en*re
HT
digital
repository:
HTRC
Data
Capsule
Funded
by
Alfred
P.
Sloan
Founda*on;
in
collabora*on
with
Atul
Prakash,
University
of
Michigan
35. #HTRC
@HathiTrust
Ques*ons
driving
HTRC
Data
Capsule
• Non-‐consump*ve
use:
can
framework
provide
safe
handling
of
large
amounts
of
protected
data?
• Openness:
can
framework
support
user-‐
contributed
analysis
without
resor*ng
to
code
walkthroughs
prior
to
acceptance?
• Large-‐scale
and
low
cost:
can
protec*ons
be
extended
to
u*liza*on
of
large-‐scale
na*onal
(public)
computa*onal
resources?
36. #HTRC
@HathiTrust
HTRC
Data
Capsules
• Trusts
text
mining
researcher
to
not
deliberately
leak
repository
data
• Prevents
malware
ac*ng
on
user’s
behalf
from
leaking
data.
• V1.0
limits
analysis
to
running
within
single
VM
37. VM
Image
Manager
VM
Image
Store
VM
Image
Builder
VM
Manager
VM
instance
Secure
Capsule
cluster
SSH
Research
results
Researcher
HTRC
Data
Capsule
Architectural
Components
Registry
Services,
worksets
38. VM
Image
Manager
VM
Image
Store
VM
Image
Builder
VM
Manager
VM
instance
Upon
run,
Secure
Capsule:
controls
I/O
behind
scenes
SSH
Research
results
Researcher
HTRC
Data
Capsule
interac*on
Researcher
requests
new
VM
of
type
X
Researcher
install
tools
onto
VM
through
window
on
her
desktop.
Registry
Services,
worksets
Final
loca*on
of
results
is
registry
1)
2)
Image
instance
is
created
3)
4)
43. HTRC
goes
beyond
“full
text
searchable
database”.
Security
has
to
be
top
concern.
scholarly
research
HTRC
goes
beyond
“full
text
searchable
database”