Bridging Digital Humanities Research and Big Data Repositories of Digital Text
1. Bridging
Digital
Humani/es
Research
and
Large
Repositories
of
Digital
Text
2nd
Encuentro
de
Humanistas
Digitales
|
21.May.14
Biblioteca
Vasconcelos,
Mexico
City
Beth
Plale
Professor,
School
of
Informa/cs
and
Compu/ng
Director,
Data
To
Insight
Center
Indiana
University
Tweet
us
-‐
@HathiTrust
#HTRC
HATHI TRUST
RESEARCH CENTER!
2. SeHng
Stage
• “InformaLcs”
is
the
applicaLon
of
computer
and
informaLon
science
(CIS)
to
the
data
that
consLtutes
the
primary
research
material
of
that
field.
• In
Europe,
digital
humaniLes
is
someLmes
called
“cultural
informaLcs”,
but
that
misses
point
that
informaLcs
researcher
brings
CIS
methodologies
to
problems
in
humaniLes,
whereas
DH
researchers
bring
humaniLes
methodologies
to
problems.
• I
am
an
informaLcs
researcher
(CIS
methodologies)
with
15
year
record
in
geo-‐informaLcs,
and
over
last
5
years,
a
growing
understanding
of
methodology
and
moLvaLons
of
the
digital
humaniLes
researcher
3. Digital
humani,es
is
an
emerging
discipline
that
applies
computaLon
to
research
in
the
humaniLes.
More
than
simply
conducLng
research
with
computers,
digital
humaniLes
scholars
use
informaLon
technology
as
a
central
part
of
their
methodology.
University
of
Illinois
Library
web
site,
2014
4. Digital
HumaniLes
acLviLes
categorized
• Access:
big
part
of
what
[digital
humaniLes
scholar]
does
is
study
cultural
heritage
materials
-‐
books,
newspapers,
painLngs,
film,
sculptures,
music,
ancient
tablets,
buildings,
etc.
Prey
much
everything
on
that
list
is
being
digiLzed
in
very
large
numbers.
• Produc/on:
we're
already
seeing
more
and
more
scholars
producing
their
work
for
the
Web.
It
might
take
the
form
of
scholarly
websites,
blogs,
wikis,
or
whatever.
[…]
the
enLre
producLon
cycle
uses
technology
(collecLng,
ediLng,
discussing
with
others)
before
the
final
product
is
created.
• Consump/on:
people
get
their
materials
in
all
kinds
of
new
ways.
Reading
has
changed
with
the
Web.
The
way
we
read
is
changing.
Bits
and
pieces
of
varied
content
from
so
many
places
and
perspecLves.
Interview
with
Bre
Bobley,
NEH,
2009
hp://www.hastac.org/node/1934
5. Why
does
it
maer?
“If
I
had
to
predict
some
interesLng
things
for
the
future
in
the
area
of
access,
I'd
sum
it
up
in
one
word:
scale.
Big,
massive,
scale.
That's
what
digiLzaLon
brings
-‐
access
to
far,
far
more
cultural
heritage
materials
than
you
could
ever
access
before.”
2009
interview
with
Bre
Bobley,
Nat’l
Endowment
of
HumaniLes,
US,
on
predicLons
for
the
future
for
Digital
HumaniLes
6. Bobley’s
PredicLon,
cont.
In
a
world
of
big,
massive
scale,
he
asks:
• “How
might
quanLtaLve
technology-‐based
methodologies
like
data
mining
help
you
to
beer
understand
a
giant
corpus?
Help
you
zero
in
on
issues?”
• “What
if
you
are
a
historian
and
you
now
have
access
to
every
newspaper
around
the
world?”
• “How
might
searching
and
mining
that
kind
of
dataset
radically
change
your
results?”
7. Goal
of
Talk
Introduce
technical
architectural
big
data
developments
around
HathiTrust,
emerging
examples
of
use,
…
to
facilitate
discussion
around
whether
Bre
Bobley’s
2009
predicLon
of
“scale.
Big,
massive,
scale”,
which
is
here
today,
can
now
deliver
on
advances
for
digital
humaniLes
8. #HTRC
@HathiTrust
HathiTrust
• HathiTrust
is
a
consorLum
of
academic
&
research
insLtuLons,
offering
a
collecLon
of
millions
of
Ltles
digiLzed
from
libraries
around
the
world.
– Founding
members:
University
of
Michigan,
Indiana
University,
University
of
California,
and
University
of
Virginia
http://www.hathitrust.org/htrc
http://www.hathitrust.org
à
DisLnguished
from
10. #HTRC
@HathiTrust
Content
of
HathiTrust
• Books
and
journals
– Plus
pilots
around
images,
audio,
born-‐digital
• DigiLzaLon
sources
– Google
(96.8%,
10,162,104)
– Internet
Archive
(2.9%,
301,972)
– Local
(0.3%,
31,840)
13. #HTRC
@HathiTrust
Mo/va/on
for
HTRC
à HathiTrust repository is massive scale
-- latent goldmine for text based research
à Restricted nature of parts of
HathiTrust content suggests need for
new forms of access that preserves
intimate nature of interaction with texts
while at same time honoring restrictions
on access
à Size and restrictions demand new
paradigm: computation moves to the
data (not vice versa)
14. #HTRC
@HathiTrust
HathiTrust
Research
Center
•
The
HathiTrust
Research
Center
(HTRC)
was
established
in
2011
to
enable
computaLonal
research
across
a
comprehensive
body
of
published
works,
for
the
purposes
of
scholarship,
educaLon,
and
invenLon.
• HTRC
ExecuLve
Commiee
– Beth
Plale,
co-‐Director,
Professor
of
InformaLcs
and
CompuLng,
Indiana
University
– J.
Stephen
Downie,
co-‐Director,
Professor
of
InformaLon
Science,
University
of
Illinois
– Robert
McDonald,
Indiana
University
Libraries
– Beth
Namachchivaya
Sandore,
University
of
Illinois
Library
– John
Unsworth,
CIO,
Dean
of
Library,
Brandies
University
15. HTRC
system
Complexity
hiding
interface
The
complexity
Tabular
info
StaLsLcal
plots
SpaLal
plots
Request
17. Return
to
categories
of
DH
acLvity
HTRC
in
current
form
best
at
suppor/ng:
• Access:
by
narrowing
down
to
essenLal
materials
quickly
–
separaLng
wheat
from
chaff
“big
part
of
what
[digital
humaniLes
scholar]
does
is
study
cultural
heritage
materials
-‐
books,
newspapers,
painLngs,
film,
sculptures,
music,
ancient
tablets,
buildings,
etc.”
• Produc/on:
by
supporLng
computaLonal
invesLgaLon
over
massive
scale
of
texts
that
will
require
large-‐scale
computers
(cloud
compuLng)
• Consump/on:
by
tracking
the
bits
and
pieces
(i.e.,
the
HTRC
workset)
“The
way
we
read
is
changing.
Bits
and
pieces
of
varied
content
from
so
many
places
and
perspecLves.”
Interview
with
Bre
Bobley,
NEH,
2009
19. EXAMPLES
OF
RESEARCH
THAT
IS
POSSIBLE
AT
SCALE
• Topic
modeling
• Author
Gender
IdenLficaLon
• Using
Topic
Modeling
to
Locate
(down
to
sentence
level)
Philosophical
Arguments
in
Science
Texts
20. #HTRC
@HathiTrust
Topic
Modeling
• Can
answer
more
complex
or
nuanced
quesLons
– What
are
the
primary
themes
of
an
author?
– What
are
the
primary
themes
of
a
research
domain?
– When
did
a
new
topic
enter
a
research
domain?
• Provides
more
data
than
word
counts
– 100s
of
topics
can
be
extracted.
– Underlying
data
(topics,
volume,
and
page)
is
available
21. #HTRC
@HathiTrust
Themes
for
Authors
Two
topics
with
idenLcal
centraliLes
(e.g.,
Dickens)
but
separate
themes
More
strongly
focused
on
book
(illustraLons,
volume,
literature)
More
strongly
focused
on
author
himself
(leers,
household,
house)
23. GENDER
IDENTIFICATION
OF
HTRC
AUTHORS
BY
NAMES
Stacy
Kowalczyk,
Asst.
Professor,
Dominican
University
Zong
Peng,
HTRC,
Indiana
University
Talk
by
Stacy
Kowalczyk,
hp://www.hathitrust.org/htrc_uncamp2013
24. #HTRC
@HathiTrust
Gender
IdenLficaLon
of
Text
• QuesLon
InvesLgated:
Can
we
use
author
names
in
bibliographic
records
to
idenLfy
gender?
• Looked
at
2.6
million
bibliographic
records
– Extracted
personal
author
data
– Marc
100
abcd
and
700
abcd
• 606,437
unique
personal
author
strings
• Bibliographic
data
is
not
fielded
like
patent
names
• Relying
on
Standard
cataloging
pracLce
– Last
name,
first
name
middle
name,
Ltles/honorifics,
dates
25. #HTRC
@HathiTrust
Authors
vs
Names
There
is
the
author,
then
there
are
the
names
under
which
the
author
is
published…
• Methuen,
Algernon
Methuen
Marshall,
Sir
bart.,
1856-‐1924
• Methuem,
Algernon
• Methuen
Algernon
• Methuen
Marshall,
Sir,
bart.,
1856-‐
• Methuen,
A.
Sir,
1856-‐1924
• Methuen,
A.
Sir,
bart.,
1856-‐1924
• Methuen
Marshall,
Sir
bart
1856-‐1924
• Methuen,
Algernon
Methuen
Marshall,
Sir,
1856-‐1924
• Methuen,
Algernon
Methuen
Marshall,
Sir,
bart.,
1856-‐1924
• Methuen,
Algernon,
1856-‐1924
26. #HTRC
@HathiTrust
Sources
of
Data
• The
Virtual
InternaLonal
Authority
File
– Hosted
by
OCLC
• Harvested
names
from
mulLple
data
sources
– Census
bureau
– Baby
name
sites
• EU
Patent
Research
names
list
(Frietsch
et
al,
2009;
Naldi
et
al.
2005)
– Developed
an
extensive
list
of
European
names
• Titles
and
honorifics
– MulLple
web
resources
– Sir,
Baron,
Count,
Duke,
Father,
Cardinal,
etc
– Lady,
Mrs.
Miss,
Countess,
Duchess,
Sister,
etc
27. #HTRC
@HathiTrust
IniLal
Gender
Results
• Approximately
80%
of
name
strings
have
iniLal
gender
idenLficaLon
– Female
• 59,365
• 10%
– Male
• 425,994
• 70%
– Unknown
• 114,204
• 19%
– Ambiguous
• 5,965
• Less
than
1%
28. #HTRC
@HathiTrust
Results
by
Data
Source
Against
the
whole
set
of
name
strings
• VIAF
– 19%
hit
rate
• Web
Names
– 54%
hit
rate
• Patents
Names
– 8%
29. Colin
Allen,
Jamie
Murdock
Cogni/ve
Science,
Indiana
University
Ref
talk
by
Jamie
Murdock,
hp://www.hathitrust.org/htrc_uncamp2013
30. Digging
into
philosophy
of
science
• Establish
points
of
contact
between
philosophy
and
science:
where
philosophical
arguments
on
anthropomorphism
appear
in
science
texts
• Use
topic
modeling
to
idenLfy
the
volumes
and
pages
within
these
volumes
that
are
“rich”
in
a
chosen
topic
• Use
semi-‐formal
discourse
analysis
technique
to
idenLfy
key
arguments
in
selected
pages
to
incrementally
expose
and
represent
argument
structures
31. The
How
• 1315
volumes
from
HTRC
selected
using
keyword
search
for
‘darwin’,
‘romanes’,
‘anthropomorphism’,
and
‘comparaLve
psychology’
• Set
contains
lots
of
uninteresLng
books:
e.g.,
college
course
catalogs
• Apply
topic
modeling
on
86
volume
subset
• Using
iPy
Notebook
41. Drop
to
sentence
level
• Select
three
books*
with
highest
aggregate
of
20-‐40
topic-‐relevant
pages
for
more
precise
analysis
• Model
the
three
books
at
the
sentence
level
(uses
machine
learning)
*
Start
from
1315
texts
to
start,
down
to
86,
then
down
to
most
relevant
3
43. Copyright:
A
Reality
Full
text
download
is
limited
by
both
size
and
by
copyright
44. #HTRC
@HathiTrust
CompuLng
with
Copyrighted
materials:
HTRC
Data
Capsule
• Copyrighted
materials
can
be
computed
on,
but
cannot
be
shared
by
humans
for
human
(reading)
consumpLon
• Needs
computaLonal
framework
to
enable
compuLng
but
restricLng
human
consumpLon
• A
secure
compuLng
framework
that:
– Trusts
that
researcher
will
not
deliberately
leak
data
– Prevents
malware
acLng
on
user's
behalf
from
leaking
data.
• Supports
Openness:
accepts
user-‐contributed
analysis
• Supports
Large-‐scale
and
low
cost:
protecLons
can
be
extended
to
uLlizaLon
of
public
supercomputers
45. VM
Image
Manager
VM
Image
Store
VM
Image
Builder
VM
Manager
VM
instance
Secure
Capsule
cluster
SSH
Research
results
Researcher
HTRC
Data
Capsule
Architectural
Components
Registry
Services,
worksets
46. VM
Image
Manager
VM
Image
Store
VM
Image
Builder
VM
Manager
VM
instance
Upon
run,
Secure
Capsule:
controls
I/O
behind
scenes
SSH
Research
results
Researcher
HTRC
Data
Capsule
interacLon
Researcher
requests
new
VM
of
type
X
Researcher
install
tools
onto
VM
through
window
on
her
desktop.
Registry
Services,
worksets
Final
locaLon
of
results
is
registry
1)
2)
Image
instance
is
created
3)
4)
47. 47
HTRC
secure
data
capsule:
view
from
researcher
desktop
49. 2009:
“If
I
had
to
predict
some
interesLng
things
for
the
future
in
the
area
of
access,
I'd
sum
it
up
in
one
word:
scale.
Big,
massive,
scale.
That's
what
digiLzaLon
brings
-‐
access
to
far,
far
more
cultural
heritage
materials
than
you
could
ever
access
before.”
à Paradigm: computation moves to the
data (not vice versa)
2014:
We
are
at
massive
scale
of
data,
but
data
access
is
constrained.
Can
digital
humani/es
researchers
work
within
constraints?
Will
they
find
it
worthwhile
to
do
so?
Reality:
Full
text
download
is
limited
by
size
and
copyright