SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Bridging	
  Digital	
  Humani/es	
  Research	
  and	
  Large	
  
Repositories	
  of	
  Digital	
  Text	
  
2nd	
  Encuentro	
  de	
  Humanistas	
  Digitales	
  |	
  21.May.14	
  
Biblioteca	
  Vasconcelos,	
  Mexico	
  City	
  
	
  
Beth	
  Plale	
  
Professor,	
  School	
  of	
  Informa/cs	
  and	
  Compu/ng	
  
Director,	
  Data	
  To	
  Insight	
  Center	
  	
  
Indiana	
  University	
  
Tweet	
  us	
  -­‐	
  @HathiTrust	
  	
  #HTRC	
  
HATHI TRUST
RESEARCH CENTER!
SeHng	
  Stage	
  
•  “InformaLcs”	
  is	
  the	
  applicaLon	
  of	
  computer	
  and	
  
informaLon	
  science	
  (CIS)	
  to	
  the	
  data	
  that	
  consLtutes	
  
the	
  primary	
  research	
  material	
  of	
  that	
  field.	
  	
  
•  In	
  Europe,	
  digital	
  humaniLes	
  is	
  someLmes	
  called	
  
“cultural	
  informaLcs”,	
  but	
  that	
  misses	
  point	
  that	
  
informaLcs	
  researcher	
  brings	
  CIS	
  methodologies	
  to	
  
problems	
  in	
  humaniLes,	
  whereas	
  DH	
  researchers	
  bring	
  
humaniLes	
  methodologies	
  to	
  problems.	
  	
  
•  I	
  am	
  an	
  informaLcs	
  researcher	
  (CIS	
  methodologies)	
  
with	
  15	
  year	
  record	
  in	
  geo-­‐informaLcs,	
  and	
  over	
  last	
  5	
  
years,	
  a	
  growing	
  understanding	
  of	
  methodology	
  and	
  
moLvaLons	
  of	
  the	
  digital	
  humaniLes	
  researcher	
  
Digital	
  humani,es	
  is	
  an	
  emerging	
  discipline	
  
that	
  applies	
  computaLon	
  to	
  research	
  in	
  the	
  
humaniLes.	
  More	
  than	
  simply	
  conducLng	
  
research	
  with	
  computers,	
  digital	
  humaniLes	
  
scholars	
  use	
  informaLon	
  technology	
  as	
  a	
  
central	
  part	
  of	
  their	
  methodology.	
  	
  
University	
  of	
  Illinois	
  Library	
  web	
  site,	
  2014	
  
Digital	
  HumaniLes	
  acLviLes	
  
categorized	
  
•  Access:	
  	
  	
  big	
  part	
  of	
  what	
  [digital	
  humaniLes	
  scholar]	
  does	
  
is	
  study	
  cultural	
  heritage	
  materials	
  -­‐	
  books,	
  newspapers,	
  
painLngs,	
  film,	
  sculptures,	
  music,	
  ancient	
  tablets,	
  buildings,	
  
etc.	
  Prey	
  much	
  everything	
  on	
  that	
  list	
  is	
  being	
  digiLzed	
  in	
  
very	
  large	
  numbers.	
  	
  
•  Produc/on:	
  	
  we're	
  already	
  seeing	
  more	
  and	
  more	
  scholars	
  
producing	
  their	
  work	
  for	
  the	
  Web.	
  It	
  might	
  take	
  the	
  form	
  of	
  
scholarly	
  websites,	
  blogs,	
  wikis,	
  or	
  whatever.	
  	
  […]	
  the	
  enLre	
  
producLon	
  cycle	
  uses	
  technology	
  (collecLng,	
  ediLng,	
  
discussing	
  with	
  others)	
  before	
  the	
  final	
  product	
  is	
  created.	
  
•  Consump/on:	
  	
  people	
  get	
  their	
  materials	
  in	
  all	
  kinds	
  of	
  
new	
  ways.	
  	
  Reading	
  has	
  changed	
  with	
  the	
  Web.	
  	
  The	
  way	
  
we	
  read	
  is	
  changing.	
  	
  Bits	
  and	
  pieces	
  of	
  varied	
  content	
  from	
  
so	
  many	
  places	
  and	
  perspecLves.	
  	
  	
  
Interview	
  with	
  Bre	
  Bobley,	
  NEH,	
  2009	
  
hp://www.hastac.org/node/1934	
  
Why	
  does	
  it	
  maer?	
  	
  
“If	
  I	
  had	
  to	
  predict	
  some	
  interesLng	
  
things	
  for	
  the	
  future	
  in	
  the	
  area	
  of	
  
access,	
  I'd	
  sum	
  it	
  up	
  in	
  one	
  word:	
  	
  
scale.	
  	
  Big,	
  massive,	
  scale.	
  	
  That's	
  what	
  
digiLzaLon	
  brings	
  -­‐	
  access	
  to	
  far,	
  far	
  
more	
  cultural	
  heritage	
  materials	
  than	
  
you	
  could	
  ever	
  access	
  before.”	
  	
  	
  
2009	
  interview	
  with	
  Bre	
  Bobley,	
  Nat’l	
  
Endowment	
  of	
  HumaniLes,	
  US,	
  on	
  predicLons	
  
for	
  the	
  future	
  for	
  Digital	
  HumaniLes	
  
Bobley’s	
  PredicLon,	
  cont.	
  
In	
  a	
  world	
  of	
  big,	
  massive	
  scale,	
  he	
  asks:	
  
•  “How	
  might	
  quanLtaLve	
  technology-­‐based	
  
methodologies	
  like	
  data	
  mining	
  help	
  you	
  to	
  
beer	
  understand	
  a	
  giant	
  corpus?	
  	
  Help	
  you	
  zero	
  
in	
  on	
  issues?”	
  	
  	
  
•  “What	
  if	
  you	
  are	
  a	
  historian	
  and	
  you	
  now	
  have	
  
access	
  to	
  every	
  newspaper	
  around	
  the	
  world?”	
  	
  	
  
•  “How	
  might	
  searching	
  and	
  mining	
  that	
  kind	
  of	
  
dataset	
  radically	
  change	
  your	
  results?”	
  	
  	
  
Goal	
  of	
  Talk	
  
Introduce	
  technical	
  architectural	
  big	
  data	
  
developments	
  around	
  HathiTrust,	
  emerging	
  
examples	
  of	
  use,	
  
	
  
…	
  to	
  facilitate	
  discussion	
  around	
  whether	
  Bre	
  
Bobley’s	
  2009	
  predicLon	
  of	
  “scale.	
  	
  Big,	
  massive,	
  
scale”,	
  which	
  is	
  here	
  today,	
  can	
  now	
  deliver	
  on	
  
advances	
  for	
  digital	
  humaniLes	
  	
  
	
  
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  
•  HathiTrust	
  is	
  a	
  consorLum	
  of	
  academic	
  &	
  
research	
  insLtuLons,	
  offering	
  a	
  collecLon	
  of	
  
millions	
  of	
  Ltles	
  digiLzed	
  from	
  libraries	
  
around	
  the	
  world.	
  
– Founding	
  members:	
  University	
  of	
  Michigan,	
  
Indiana	
  University,	
  University	
  of	
  California,	
  and	
  
University	
  of	
  Virginia	
  
http://www.hathitrust.org/htrc	
  
http://www.hathitrust.org	
  
à	
  DisLnguished	
  
from	
  
#HTRC	
  	
  @HathiTrust	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  of	
  HathiTrust	
  
•  Books	
  and	
  journals	
  
– Plus	
  pilots	
  around	
  images,	
  audio,	
  born-­‐digital	
  
•  DigiLzaLon	
  sources	
  
– Google	
  (96.8%,	
  10,162,104)	
  
– Internet	
  Archive	
  (2.9%,	
  301,972)	
  
– Local	
  (0.3%,	
  31,840)	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  Sources	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  distribuLon	
  
360,000	
  volumes	
  
in	
  Spanish	
  
#HTRC	
  	
  @HathiTrust	
  
Mo/va/on	
  for	
  HTRC	
  
à HathiTrust repository is massive scale
-- latent goldmine for text based research
à Restricted nature of parts of
HathiTrust content suggests need for
new forms of access that preserves
intimate nature of interaction with texts
while at same time honoring restrictions
on access
à Size and restrictions demand new
paradigm: computation moves to the
data (not vice versa)
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  Research	
  Center	
  
•  	
  The	
  HathiTrust	
  Research	
  Center	
  (HTRC)	
  was	
  
established	
  in	
  2011	
  to	
  enable	
  computaLonal	
  research	
  
across	
  a	
  comprehensive	
  body	
  of	
  published	
  works,	
  for	
  
the	
  purposes	
  of	
  scholarship,	
  educaLon,	
  and	
  invenLon.	
  	
  
•  HTRC	
  ExecuLve	
  Commiee	
  
–  Beth	
  Plale,	
  co-­‐Director,	
  Professor	
  of	
  InformaLcs	
  and	
  
CompuLng,	
  Indiana	
  University	
  
–  J.	
  Stephen	
  Downie,	
  co-­‐Director,	
  Professor	
  of	
  InformaLon	
  
Science,	
  University	
  of	
  Illinois	
  
–  Robert	
  McDonald,	
  Indiana	
  University	
  Libraries	
  
–  Beth	
  Namachchivaya	
  Sandore,	
  University	
  of	
  Illinois	
  Library	
  
–  John	
  Unsworth,	
  CIO,	
  Dean	
  of	
  Library,	
  Brandies	
  University	
  
	
  
HTRC	
  system	
  	
  
Complexity	
  hiding	
  interface	
  
The	
  complexity	
  
Tabular	
  info	
  
StaLsLcal	
  plots	
  
SpaLal	
  plots	
  
Request	
  
 
	
  
Complexity	
  hiding	
  interface	
  
	
  
	
  
Return	
  to	
  categories	
  of	
  DH	
  acLvity	
  
HTRC	
  in	
  current	
  form	
  best	
  at	
  suppor/ng:	
  
•  Access:	
  	
  	
  by	
  narrowing	
  down	
  to	
  essenLal	
  materials	
  quickly	
  –	
  
separaLng	
  wheat	
  from	
  chaff	
  
“big	
  part	
  of	
  what	
  [digital	
  humaniLes	
  scholar]	
  does	
  is	
  study	
  cultural	
  
heritage	
  materials	
  -­‐	
  books,	
  newspapers,	
  painLngs,	
  film,	
  sculptures,	
  
music,	
  ancient	
  tablets,	
  buildings,	
  etc.”	
  	
  
•  Produc/on:	
  by	
  supporLng	
  computaLonal	
  invesLgaLon	
  
over	
  massive	
  scale	
  of	
  texts	
  that	
  will	
  require	
  large-­‐scale	
  
computers	
  (cloud	
  compuLng)	
  
•  Consump/on:	
  	
  by	
  tracking	
  the	
  bits	
  and	
  pieces	
  (i.e.,	
  the	
  
HTRC	
  workset)	
  
“The	
  way	
  we	
  read	
  is	
  changing.	
  	
  Bits	
  and	
  pieces	
  of	
  varied	
  content	
  
from	
  so	
  many	
  places	
  and	
  perspecLves.”	
  	
  	
  
Interview	
  with	
  Bre	
  Bobley,	
  NEH,	
  2009	
  
Workset	
  manages	
  engagement	
  with	
  texts	
  
EXAMPLES	
  OF	
  RESEARCH	
  THAT	
  IS	
  
POSSIBLE	
  AT	
  SCALE	
  
•  Topic	
  modeling	
  
•  Author	
  Gender	
  IdenLficaLon	
  
•  Using	
  Topic	
  Modeling	
  to	
  Locate	
  (down	
  to	
  sentence	
  
level)	
  Philosophical	
  Arguments	
  in	
  Science	
  Texts	
  
#HTRC	
  	
  @HathiTrust	
  
Topic	
  Modeling	
  
•  Can	
  answer	
  more	
  complex	
  or	
  nuanced	
  
quesLons	
  
– What	
  are	
  the	
  primary	
  themes	
  of	
  an	
  author?	
  
– What	
  are	
  the	
  primary	
  themes	
  of	
  a	
  research	
  
domain?	
  
– When	
  did	
  a	
  new	
  topic	
  enter	
  a	
  research	
  domain?	
  
•  Provides	
  more	
  data	
  than	
  word	
  counts	
  
– 100s	
  of	
  topics	
  can	
  be	
  extracted.	
  	
  	
  
– Underlying	
  data	
  (topics,	
  volume,	
  and	
  page)	
  is	
  
available	
  
#HTRC	
  	
  @HathiTrust	
  
Themes	
  for	
  Authors	
  
Two	
  topics	
  with	
  idenLcal	
  centraliLes	
  (e.g.,	
  Dickens)	
  but	
  separate	
  
themes	
  
More	
  strongly	
  focused	
  on	
  book	
  
(illustraLons,	
  volume,	
  literature)	
  
More	
  strongly	
  focused	
  on	
  author	
  
himself	
  	
  (leers,	
  household,	
  house)	
  
Ted Underwood, Univ of Illinois
GENDER	
  IDENTIFICATION	
  OF	
  HTRC	
  
AUTHORS	
  BY	
  NAMES	
  
	
  
Stacy	
  Kowalczyk,	
  Asst.	
  Professor,	
  Dominican	
  University	
  
Zong	
  Peng,	
  HTRC,	
  Indiana	
  University	
  
Talk	
  by	
  Stacy	
  Kowalczyk,	
  hp://www.hathitrust.org/htrc_uncamp2013	
  
#HTRC	
  	
  @HathiTrust	
  
Gender	
  IdenLficaLon	
  of	
  Text	
  
•  QuesLon	
  InvesLgated:	
  Can	
  we	
  use	
  author	
  names	
  in	
  	
  
bibliographic	
  records	
  to	
  idenLfy	
  gender?	
  
•  Looked	
  at	
  2.6	
  million	
  bibliographic	
  records	
  
–  Extracted	
  personal	
  author	
  data	
  	
  
–  Marc	
  100	
  abcd	
  and	
  700	
  abcd	
  
•  606,437	
  unique	
  personal	
  author	
  strings	
  
•  Bibliographic	
  data	
  is	
  not	
  fielded	
  like	
  patent	
  names	
  
•  Relying	
  on	
  Standard	
  cataloging	
  pracLce	
  
–  Last	
  name,	
  first	
  name	
  middle	
  name,	
  	
  Ltles/honorifics,	
  
dates	
  
#HTRC	
  	
  @HathiTrust	
  
Authors	
  vs	
  Names	
  
There	
  is	
  the	
  author,	
  then	
  there	
  are	
  the	
  names	
  under	
  which	
  
the	
  author	
  is	
  published…	
  
•  Methuen,	
  Algernon	
  Methuen	
  Marshall,	
  Sir	
  bart.,	
  1856-­‐1924	
  
•  Methuem,	
  Algernon	
  	
  
•  Methuen	
  Algernon	
  	
  
•  Methuen	
  Marshall,	
  Sir,	
  bart.,	
  1856-­‐	
  	
  
•  Methuen,	
  A.	
  Sir,	
  1856-­‐1924	
  	
  
•  Methuen,	
  A.	
  Sir,	
  bart.,	
  1856-­‐1924	
  	
  
•  Methuen	
  Marshall,	
  Sir	
  bart	
  1856-­‐1924	
  	
  
•  Methuen,	
  Algernon	
  Methuen	
  Marshall,	
  Sir,	
  1856-­‐1924	
  
•  Methuen,	
  Algernon	
  Methuen	
  Marshall,	
  Sir,	
  bart.,	
  
1856-­‐1924	
  
•  Methuen,	
  Algernon,	
  1856-­‐1924	
  	
  
	
  
#HTRC	
  	
  @HathiTrust	
  
Sources	
  of	
  Data	
  
•  The	
  Virtual	
  InternaLonal	
  Authority	
  File	
  
–  Hosted	
  by	
  OCLC	
  
•  Harvested	
  names	
  from	
  mulLple	
  data	
  sources	
  
–  Census	
  bureau	
  	
  
–  Baby	
  name	
  sites	
  
•  EU	
  Patent	
  Research	
  names	
  list	
  (Frietsch	
  et	
  al,	
  2009;	
  
Naldi	
  et	
  al.	
  2005)	
  
–  Developed	
  an	
  extensive	
  list	
  of	
  European	
  names	
  
•  Titles	
  and	
  honorifics	
  
–  MulLple	
  web	
  resources	
  	
  
–  Sir,	
  Baron,	
  Count,	
  Duke,	
  Father,	
  Cardinal,	
  etc	
  
–  Lady,	
  Mrs.	
  Miss,	
  Countess,	
  Duchess,	
  Sister,	
  etc	
  
#HTRC	
  	
  @HathiTrust	
  
IniLal	
  Gender	
  Results	
  
•  Approximately	
  80%	
  of	
  name	
  strings	
  have	
  iniLal	
  
gender	
  idenLficaLon	
  
–  Female	
  
•  59,365	
  
•  10%	
  
–  Male	
  
•  425,994	
  
•  70%	
  
–  Unknown	
  
•  114,204	
  
•  19%	
  
–  Ambiguous	
  
•  5,965	
  
•  Less	
  than	
  1%	
  
#HTRC	
  	
  @HathiTrust	
  
Results	
  by	
  Data	
  Source	
  
Against	
  the	
  whole	
  set	
  of	
  name	
  strings	
  
•  VIAF	
  	
  	
  
– 19%	
  hit	
  rate	
  	
  
•  Web	
  Names	
  
– 54%	
  hit	
  rate	
  
•  Patents	
  Names	
  
– 8%	
  
	
  
Colin	
  Allen,	
  Jamie	
  Murdock	
  
Cogni/ve	
  Science,	
  Indiana	
  University	
  
Ref	
  talk	
  by	
  Jamie	
  Murdock,	
  hp://www.hathitrust.org/htrc_uncamp2013	
  
Digging	
  into	
  philosophy	
  of	
  science	
  
•  Establish	
  points	
  of	
  contact	
  between	
  philosophy	
  
and	
  science:	
  where	
  philosophical	
  arguments	
  on	
  
anthropomorphism	
  appear	
  in	
  science	
  texts	
  
•  Use	
  topic	
  modeling	
  to	
  idenLfy	
  the	
  volumes	
  and	
  
pages	
  within	
  these	
  volumes	
  that	
  are	
  “rich”	
  in	
  a	
  
chosen	
  topic	
  
•  Use	
  semi-­‐formal	
  discourse	
  analysis	
  technique	
  to	
  
idenLfy	
  key	
  arguments	
  in	
  selected	
  pages	
  to	
  
incrementally	
  expose	
  and	
  represent	
  argument	
  
structures	
  
The	
  How	
  
•  1315	
  volumes	
  from	
  HTRC	
  selected	
  using	
  
keyword	
  search	
  for	
  ‘darwin’,	
  ‘romanes’,	
  
‘anthropomorphism’,	
  and	
  ‘comparaLve	
  
psychology’	
  
•  Set	
  contains	
  lots	
  of	
  uninteresLng	
  books:	
  	
  e.g.,	
  
college	
  course	
  catalogs	
  
•  Apply	
  topic	
  modeling	
  on	
  86	
  volume	
  subset	
  	
  
•  Using	
  iPy	
  Notebook	
  
Volume	
  level	
  topic	
  modeling	
  on	
  
‘anthropomorphism’	
  yields	
  set	
  of	
  
topics	
  
..	
  Of	
  set	
  of	
  topics,	
  choose	
  ‘16’	
  as	
  best	
  
Volumes	
  most	
  similar	
  to	
  topic	
  16	
  
Repeat	
  topic	
  modeling	
  at	
  page	
  level	
  
Topic	
  model	
  at	
  page	
  level	
  for	
  topics	
  
anthropomorphism,	
  animal,	
  and	
  psychology	
  
Pick	
  top	
  3:	
  topics	
  16,	
  10,	
  26	
  
Show	
  documents	
  of	
  topics	
  10,	
  16,	
  26	
  
Drop	
  to	
  sentence	
  level	
  
•  Select	
  three	
  books*	
  with	
  highest	
  aggregate	
  of	
  
20-­‐40	
  topic-­‐relevant	
  pages	
  for	
  more	
  precise	
  
analysis	
  
•  Model	
  the	
  three	
  books	
  at	
  the	
  sentence	
  level	
  
(uses	
  machine	
  learning)	
  
*	
  Start	
  from	
  1315	
  texts	
  to	
  start,	
  down	
  to	
  
86,	
  then	
  down	
  to	
  most	
  relevant	
  3	
  
Promising	
  early	
  results	
  …	
  
Copyright:	
  A	
  Reality	
  	
  
Full	
  text	
  download	
  is	
  limited	
  by	
  both	
  
size	
  and	
  by	
  copyright	
  
#HTRC	
  	
  @HathiTrust	
  
CompuLng	
  with	
  Copyrighted	
  
materials:	
  HTRC	
  Data	
  Capsule	
  
•  Copyrighted	
  materials	
  can	
  be	
  computed	
  on,	
  but	
  cannot	
  be	
  
shared	
  by	
  humans	
  for	
  human	
  (reading)	
  consumpLon	
  
•  Needs	
  computaLonal	
  framework	
  to	
  enable	
  compuLng	
  but	
  
restricLng	
  human	
  consumpLon	
  
•  A	
  secure	
  compuLng	
  framework	
  that:	
  
–  Trusts	
  that	
  researcher	
  will	
  not	
  deliberately	
  leak	
  data	
  
–  Prevents	
  malware	
  acLng	
  on	
  user's	
  behalf	
  from	
  leaking	
  
data.	
  
•  Supports	
  Openness:	
  accepts	
  user-­‐contributed	
  analysis	
  	
  
•  Supports	
  Large-­‐scale	
  and	
  low	
  cost:	
  	
  protecLons	
  can	
  be	
  
extended	
  to	
  uLlizaLon	
  of	
  public	
  supercomputers	
  
VM	
  Image	
  
Manager	
  
VM	
  Image	
  
Store	
  
VM	
  Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
  
Secure	
  
Capsule	
  
cluster	
  
SSH	
   Research	
  
results	
  
Researcher	
  
HTRC	
  Data	
  
Capsule	
  
Architectural	
  
Components	
  
	
  
	
  
Registry	
  	
  
Services,	
  
worksets	
  
	
  
	
  
VM	
  
Image	
  
Manager	
  
VM	
  
Image	
  
Store	
  
VM	
  
Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
  
Upon	
  run,	
  
Secure	
  
Capsule:	
  
controls	
  I/O	
  
behind	
  
scenes	
  
SSH	
   Research	
  
results	
  
Researcher	
  
HTRC	
  Data	
  
Capsule	
  
interacLon	
  
Researcher	
  
requests	
  	
  
new	
  VM	
  of	
  
type	
  X	
  
Researcher	
  install	
  tools	
  onto	
  
VM	
  through	
  window	
  on	
  her	
  
desktop.	
  	
  
	
  
	
  
Registry	
  	
  
Services,	
  
worksets	
  
	
  
	
  
Final	
  locaLon	
  
of	
  results	
  is	
  
registry	
  
1)	
  
2)	
  
Image	
  
instance	
  is	
  
created	
  
3)	
  
4)	
  
47	
  
HTRC	
  secure	
  data	
  capsule:	
  view	
  from	
  researcher	
  desktop	
  
Thanks	
  to	
  our	
  sponsors	
  
2009:	
  “If	
  I	
  had	
  to	
  predict	
  some	
  interesLng	
  things	
  for	
  
the	
  future	
  in	
  the	
  area	
  of	
  access,	
  I'd	
  sum	
  it	
  up	
  in	
  one	
  
word:	
  	
  scale.	
  	
  Big,	
  massive,	
  scale.	
  	
  That's	
  what	
  
digiLzaLon	
  brings	
  -­‐	
  access	
  to	
  far,	
  far	
  more	
  cultural	
  
heritage	
  materials	
  than	
  you	
  could	
  ever	
  access	
  before.”	
  	
  
à Paradigm: computation moves to the
data (not vice versa)
2014:	
  	
  We	
  are	
  at	
  massive	
  scale	
  of	
  data,	
  but	
  data	
  
access	
  is	
  constrained.	
  	
  Can	
  digital	
  humani/es	
  
researchers	
  work	
  within	
  constraints?	
  	
  Will	
  they	
  find	
  
it	
  worthwhile	
  to	
  do	
  so?	
  
Reality:	
  	
  Full	
  text	
  download	
  is	
  
limited	
  by	
  size	
  and	
  copyright	
  

Más contenido relacionado

La actualidad más candente

DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionPier Luca Lanzi
 
Research Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social SciencesResearch Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social SciencesCelia Emmelhainz
 
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012University of South Australlia
 
Linked Open Data for Digital Humanities
Linked Open Data for Digital HumanitiesLinked Open Data for Digital Humanities
Linked Open Data for Digital HumanitiesChristophe Guéret
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningPier Luca Lanzi
 
Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms: Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms: Martin Donnelly
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...Hazel Hall
 
Research Data Management at the University of Edinburgh
Research Data Management at the University of EdinburghResearch Data Management at the University of Edinburgh
Research Data Management at the University of EdinburghEDINA, University of Edinburgh
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Librariespetermurrayrust
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityJames Hendler
 
Supporting The Health Researcher Of The Future
Supporting The Health Researcher Of The FutureSupporting The Health Researcher Of The Future
Supporting The Health Researcher Of The FutureAndy Tattersall
 
2-6-14 ESI Supplemental Webinar: The Data Information Literacy Project
2-6-14 ESI Supplemental Webinar: The Data Information  Literacy Project2-6-14 ESI Supplemental Webinar: The Data Information  Literacy Project
2-6-14 ESI Supplemental Webinar: The Data Information Literacy ProjectDuraSpace
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017petermurrayrust
 
Digital Humanities by Ingrid Thomson
Digital Humanities  by Ingrid ThomsonDigital Humanities  by Ingrid Thomson
Digital Humanities by Ingrid Thomsonpvhead123
 
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Micah Altman
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) CommonsJames Hendler
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
From Theory to Practice: Can Opennesss Improve the Quality of OER Research?
From Theory to Practice: Can Opennesss Improve the Quality of OER Research? From Theory to Practice: Can Opennesss Improve the Quality of OER Research?
From Theory to Practice: Can Opennesss Improve the Quality of OER Research? Beck Pitt
 
Research Data in the Arts and Humanities: A Few Tricky Questions
Research Data in the Arts and Humanities: A Few Tricky QuestionsResearch Data in the Arts and Humanities: A Few Tricky Questions
Research Data in the Arts and Humanities: A Few Tricky QuestionsMartin Donnelly
 

La actualidad más candente (20)

DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
 
Research Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social SciencesResearch Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social Sciences
 
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Linked Open Data for Digital Humanities
Linked Open Data for Digital HumanitiesLinked Open Data for Digital Humanities
Linked Open Data for Digital Humanities
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms: Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms:
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
 
Research Data Management at the University of Edinburgh
Research Data Management at the University of EdinburghResearch Data Management at the University of Edinburgh
Research Data Management at the University of Edinburgh
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/Interoperability
 
Supporting The Health Researcher Of The Future
Supporting The Health Researcher Of The FutureSupporting The Health Researcher Of The Future
Supporting The Health Researcher Of The Future
 
2-6-14 ESI Supplemental Webinar: The Data Information Literacy Project
2-6-14 ESI Supplemental Webinar: The Data Information  Literacy Project2-6-14 ESI Supplemental Webinar: The Data Information  Literacy Project
2-6-14 ESI Supplemental Webinar: The Data Information Literacy Project
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
Digital Humanities by Ingrid Thomson
Digital Humanities  by Ingrid ThomsonDigital Humanities  by Ingrid Thomson
Digital Humanities by Ingrid Thomson
 
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
From Theory to Practice: Can Opennesss Improve the Quality of OER Research?
From Theory to Practice: Can Opennesss Improve the Quality of OER Research? From Theory to Practice: Can Opennesss Improve the Quality of OER Research?
From Theory to Practice: Can Opennesss Improve the Quality of OER Research?
 
Research Data in the Arts and Humanities: A Few Tricky Questions
Research Data in the Arts and Humanities: A Few Tricky QuestionsResearch Data in the Arts and Humanities: A Few Tricky Questions
Research Data in the Arts and Humanities: A Few Tricky Questions
 

Similar a Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Dh presentation helig 2014
Dh presentation helig 2014Dh presentation helig 2014
Dh presentation helig 2014HELIGLIASA
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabUniversity of Edinburgh
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsBeth Plale
 
Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015University of Cape Town
 
AHRC Digital Transformations theme: the Story So Far
AHRC Digital Transformations theme: the Story So FarAHRC Digital Transformations theme: the Story So Far
AHRC Digital Transformations theme: the Story So FarAndrew Prescott
 
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4ARLGSW
 
Digital Humanities in Practice, DHC 2012
Digital Humanities in Practice, DHC 2012Digital Humanities in Practice, DHC 2012
Digital Humanities in Practice, DHC 2012Monica Bulger
 
Forty Years of the OTA
Forty Years of the OTAForty Years of the OTA
Forty Years of the OTAMartin Wynne
 
Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)Demmy Verbeke
 
Digital Scholarly Communication @Claremont Colleges
Digital Scholarly Communication @Claremont CollegesDigital Scholarly Communication @Claremont Colleges
Digital Scholarly Communication @Claremont CollegesAshley Sanders, Ph.D.
 
Being an Open Scholar in a Connected World
Being an Open Scholar in a Connected WorldBeing an Open Scholar in a Connected World
Being an Open Scholar in a Connected WorldStian Håklev
 
Laurent Romary #OAdata 7 May 2013
Laurent Romary #OAdata 7 May 2013Laurent Romary #OAdata 7 May 2013
Laurent Romary #OAdata 7 May 2013dri_ireland
 
Leading the library of the future: w(h)ither technical services?
Leading the library of the future: w(h)ither technical services?Leading the library of the future: w(h)ither technical services?
Leading the library of the future: w(h)ither technical services?Keith Webster
 
Leading the library of the future: w(h)ither technical services?
Leading the library of the future: w(h)ither technical services?Leading the library of the future: w(h)ither technical services?
Leading the library of the future: w(h)ither technical services?Keith Webster
 
Getting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open AccessGetting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open AccessAbby Clobridge
 

Similar a Bridging Digital Humanities Research and Big Data Repositories of Digital Text (20)

Dh presentation helig 2014
Dh presentation helig 2014Dh presentation helig 2014
Dh presentation helig 2014
 
AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101  AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLab
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
 
Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015
 
AHRC Digital Transformations theme: the Story So Far
AHRC Digital Transformations theme: the Story So FarAHRC Digital Transformations theme: the Story So Far
AHRC Digital Transformations theme: the Story So Far
 
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
 
Dh presentation 2018
Dh presentation 2018Dh presentation 2018
Dh presentation 2018
 
Digital Humanities in Practice, DHC 2012
Digital Humanities in Practice, DHC 2012Digital Humanities in Practice, DHC 2012
Digital Humanities in Practice, DHC 2012
 
Prescottdigitrans
PrescottdigitransPrescottdigitrans
Prescottdigitrans
 
Forty Years of the OTA
Forty Years of the OTAForty Years of the OTA
Forty Years of the OTA
 
Mtholyoke
MtholyokeMtholyoke
Mtholyoke
 
Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)
 
20080606 VöGler GöTtingen E Humanities
20080606 VöGler GöTtingen E Humanities20080606 VöGler GöTtingen E Humanities
20080606 VöGler GöTtingen E Humanities
 
Digital Scholarly Communication @Claremont Colleges
Digital Scholarly Communication @Claremont CollegesDigital Scholarly Communication @Claremont Colleges
Digital Scholarly Communication @Claremont Colleges
 
Being an Open Scholar in a Connected World
Being an Open Scholar in a Connected WorldBeing an Open Scholar in a Connected World
Being an Open Scholar in a Connected World
 
Laurent Romary #OAdata 7 May 2013
Laurent Romary #OAdata 7 May 2013Laurent Romary #OAdata 7 May 2013
Laurent Romary #OAdata 7 May 2013
 
Leading the library of the future: w(h)ither technical services?
Leading the library of the future: w(h)ither technical services?Leading the library of the future: w(h)ither technical services?
Leading the library of the future: w(h)ither technical services?
 
Leading the library of the future: w(h)ither technical services?
Leading the library of the future: w(h)ither technical services?Leading the library of the future: w(h)ither technical services?
Leading the library of the future: w(h)ither technical services?
 
Getting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open AccessGetting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open Access
 

Más de Beth Plale

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open ScienceBeth Plale
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science researchBeth Plale
 
Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Beth Plale
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedBeth Plale
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsBeth Plale
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADBeth Plale
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceBeth Plale
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for scienceBeth Plale
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013Beth Plale
 

Más de Beth Plale (9)

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science research
 
Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEAD
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail Science
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for science
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013
 

Último

If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaKayode Fayemi
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Vipesco
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesPooja Nehwal
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfSenaatti-kiinteistöt
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsaqsarehman5055
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxNikitaBankoti2
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMoumonDas2
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyPooja Nehwal
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Chameera Dedduwage
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 

Último (20)

If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptx
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 

Bridging Digital Humanities Research and Big Data Repositories of Digital Text

  • 1. Bridging  Digital  Humani/es  Research  and  Large   Repositories  of  Digital  Text   2nd  Encuentro  de  Humanistas  Digitales  |  21.May.14   Biblioteca  Vasconcelos,  Mexico  City     Beth  Plale   Professor,  School  of  Informa/cs  and  Compu/ng   Director,  Data  To  Insight  Center     Indiana  University   Tweet  us  -­‐  @HathiTrust    #HTRC   HATHI TRUST RESEARCH CENTER!
  • 2. SeHng  Stage   •  “InformaLcs”  is  the  applicaLon  of  computer  and   informaLon  science  (CIS)  to  the  data  that  consLtutes   the  primary  research  material  of  that  field.     •  In  Europe,  digital  humaniLes  is  someLmes  called   “cultural  informaLcs”,  but  that  misses  point  that   informaLcs  researcher  brings  CIS  methodologies  to   problems  in  humaniLes,  whereas  DH  researchers  bring   humaniLes  methodologies  to  problems.     •  I  am  an  informaLcs  researcher  (CIS  methodologies)   with  15  year  record  in  geo-­‐informaLcs,  and  over  last  5   years,  a  growing  understanding  of  methodology  and   moLvaLons  of  the  digital  humaniLes  researcher  
  • 3. Digital  humani,es  is  an  emerging  discipline   that  applies  computaLon  to  research  in  the   humaniLes.  More  than  simply  conducLng   research  with  computers,  digital  humaniLes   scholars  use  informaLon  technology  as  a   central  part  of  their  methodology.     University  of  Illinois  Library  web  site,  2014  
  • 4. Digital  HumaniLes  acLviLes   categorized   •  Access:      big  part  of  what  [digital  humaniLes  scholar]  does   is  study  cultural  heritage  materials  -­‐  books,  newspapers,   painLngs,  film,  sculptures,  music,  ancient  tablets,  buildings,   etc.  Prey  much  everything  on  that  list  is  being  digiLzed  in   very  large  numbers.     •  Produc/on:    we're  already  seeing  more  and  more  scholars   producing  their  work  for  the  Web.  It  might  take  the  form  of   scholarly  websites,  blogs,  wikis,  or  whatever.    […]  the  enLre   producLon  cycle  uses  technology  (collecLng,  ediLng,   discussing  with  others)  before  the  final  product  is  created.   •  Consump/on:    people  get  their  materials  in  all  kinds  of   new  ways.    Reading  has  changed  with  the  Web.    The  way   we  read  is  changing.    Bits  and  pieces  of  varied  content  from   so  many  places  and  perspecLves.       Interview  with  Bre  Bobley,  NEH,  2009   hp://www.hastac.org/node/1934  
  • 5. Why  does  it  maer?     “If  I  had  to  predict  some  interesLng   things  for  the  future  in  the  area  of   access,  I'd  sum  it  up  in  one  word:     scale.    Big,  massive,  scale.    That's  what   digiLzaLon  brings  -­‐  access  to  far,  far   more  cultural  heritage  materials  than   you  could  ever  access  before.”       2009  interview  with  Bre  Bobley,  Nat’l   Endowment  of  HumaniLes,  US,  on  predicLons   for  the  future  for  Digital  HumaniLes  
  • 6. Bobley’s  PredicLon,  cont.   In  a  world  of  big,  massive  scale,  he  asks:   •  “How  might  quanLtaLve  technology-­‐based   methodologies  like  data  mining  help  you  to   beer  understand  a  giant  corpus?    Help  you  zero   in  on  issues?”       •  “What  if  you  are  a  historian  and  you  now  have   access  to  every  newspaper  around  the  world?”       •  “How  might  searching  and  mining  that  kind  of   dataset  radically  change  your  results?”      
  • 7. Goal  of  Talk   Introduce  technical  architectural  big  data   developments  around  HathiTrust,  emerging   examples  of  use,     …  to  facilitate  discussion  around  whether  Bre   Bobley’s  2009  predicLon  of  “scale.    Big,  massive,   scale”,  which  is  here  today,  can  now  deliver  on   advances  for  digital  humaniLes      
  • 8. #HTRC    @HathiTrust   HathiTrust   •  HathiTrust  is  a  consorLum  of  academic  &   research  insLtuLons,  offering  a  collecLon  of   millions  of  Ltles  digiLzed  from  libraries   around  the  world.   – Founding  members:  University  of  Michigan,   Indiana  University,  University  of  California,  and   University  of  Virginia   http://www.hathitrust.org/htrc   http://www.hathitrust.org   à  DisLnguished   from  
  • 10. #HTRC    @HathiTrust   Content  of  HathiTrust   •  Books  and  journals   – Plus  pilots  around  images,  audio,  born-­‐digital   •  DigiLzaLon  sources   – Google  (96.8%,  10,162,104)   – Internet  Archive  (2.9%,  301,972)   – Local  (0.3%,  31,840)  
  • 11. #HTRC    @HathiTrust   Content  Sources  
  • 12. #HTRC    @HathiTrust   Content  distribuLon   360,000  volumes   in  Spanish  
  • 13. #HTRC    @HathiTrust   Mo/va/on  for  HTRC   à HathiTrust repository is massive scale -- latent goldmine for text based research à Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à Size and restrictions demand new paradigm: computation moves to the data (not vice versa)
  • 14. #HTRC    @HathiTrust   HathiTrust  Research  Center   •   The  HathiTrust  Research  Center  (HTRC)  was   established  in  2011  to  enable  computaLonal  research   across  a  comprehensive  body  of  published  works,  for   the  purposes  of  scholarship,  educaLon,  and  invenLon.     •  HTRC  ExecuLve  Commiee   –  Beth  Plale,  co-­‐Director,  Professor  of  InformaLcs  and   CompuLng,  Indiana  University   –  J.  Stephen  Downie,  co-­‐Director,  Professor  of  InformaLon   Science,  University  of  Illinois   –  Robert  McDonald,  Indiana  University  Libraries   –  Beth  Namachchivaya  Sandore,  University  of  Illinois  Library   –  John  Unsworth,  CIO,  Dean  of  Library,  Brandies  University    
  • 15. HTRC  system     Complexity  hiding  interface   The  complexity   Tabular  info   StaLsLcal  plots   SpaLal  plots   Request  
  • 16.     Complexity  hiding  interface      
  • 17. Return  to  categories  of  DH  acLvity   HTRC  in  current  form  best  at  suppor/ng:   •  Access:      by  narrowing  down  to  essenLal  materials  quickly  –   separaLng  wheat  from  chaff   “big  part  of  what  [digital  humaniLes  scholar]  does  is  study  cultural   heritage  materials  -­‐  books,  newspapers,  painLngs,  film,  sculptures,   music,  ancient  tablets,  buildings,  etc.”     •  Produc/on:  by  supporLng  computaLonal  invesLgaLon   over  massive  scale  of  texts  that  will  require  large-­‐scale   computers  (cloud  compuLng)   •  Consump/on:    by  tracking  the  bits  and  pieces  (i.e.,  the   HTRC  workset)   “The  way  we  read  is  changing.    Bits  and  pieces  of  varied  content   from  so  many  places  and  perspecLves.”       Interview  with  Bre  Bobley,  NEH,  2009  
  • 18. Workset  manages  engagement  with  texts  
  • 19. EXAMPLES  OF  RESEARCH  THAT  IS   POSSIBLE  AT  SCALE   •  Topic  modeling   •  Author  Gender  IdenLficaLon   •  Using  Topic  Modeling  to  Locate  (down  to  sentence   level)  Philosophical  Arguments  in  Science  Texts  
  • 20. #HTRC    @HathiTrust   Topic  Modeling   •  Can  answer  more  complex  or  nuanced   quesLons   – What  are  the  primary  themes  of  an  author?   – What  are  the  primary  themes  of  a  research   domain?   – When  did  a  new  topic  enter  a  research  domain?   •  Provides  more  data  than  word  counts   – 100s  of  topics  can  be  extracted.       – Underlying  data  (topics,  volume,  and  page)  is   available  
  • 21. #HTRC    @HathiTrust   Themes  for  Authors   Two  topics  with  idenLcal  centraliLes  (e.g.,  Dickens)  but  separate   themes   More  strongly  focused  on  book   (illustraLons,  volume,  literature)   More  strongly  focused  on  author   himself    (leers,  household,  house)  
  • 22. Ted Underwood, Univ of Illinois
  • 23. GENDER  IDENTIFICATION  OF  HTRC   AUTHORS  BY  NAMES     Stacy  Kowalczyk,  Asst.  Professor,  Dominican  University   Zong  Peng,  HTRC,  Indiana  University   Talk  by  Stacy  Kowalczyk,  hp://www.hathitrust.org/htrc_uncamp2013  
  • 24. #HTRC    @HathiTrust   Gender  IdenLficaLon  of  Text   •  QuesLon  InvesLgated:  Can  we  use  author  names  in     bibliographic  records  to  idenLfy  gender?   •  Looked  at  2.6  million  bibliographic  records   –  Extracted  personal  author  data     –  Marc  100  abcd  and  700  abcd   •  606,437  unique  personal  author  strings   •  Bibliographic  data  is  not  fielded  like  patent  names   •  Relying  on  Standard  cataloging  pracLce   –  Last  name,  first  name  middle  name,    Ltles/honorifics,   dates  
  • 25. #HTRC    @HathiTrust   Authors  vs  Names   There  is  the  author,  then  there  are  the  names  under  which   the  author  is  published…   •  Methuen,  Algernon  Methuen  Marshall,  Sir  bart.,  1856-­‐1924   •  Methuem,  Algernon     •  Methuen  Algernon     •  Methuen  Marshall,  Sir,  bart.,  1856-­‐     •  Methuen,  A.  Sir,  1856-­‐1924     •  Methuen,  A.  Sir,  bart.,  1856-­‐1924     •  Methuen  Marshall,  Sir  bart  1856-­‐1924     •  Methuen,  Algernon  Methuen  Marshall,  Sir,  1856-­‐1924   •  Methuen,  Algernon  Methuen  Marshall,  Sir,  bart.,   1856-­‐1924   •  Methuen,  Algernon,  1856-­‐1924      
  • 26. #HTRC    @HathiTrust   Sources  of  Data   •  The  Virtual  InternaLonal  Authority  File   –  Hosted  by  OCLC   •  Harvested  names  from  mulLple  data  sources   –  Census  bureau     –  Baby  name  sites   •  EU  Patent  Research  names  list  (Frietsch  et  al,  2009;   Naldi  et  al.  2005)   –  Developed  an  extensive  list  of  European  names   •  Titles  and  honorifics   –  MulLple  web  resources     –  Sir,  Baron,  Count,  Duke,  Father,  Cardinal,  etc   –  Lady,  Mrs.  Miss,  Countess,  Duchess,  Sister,  etc  
  • 27. #HTRC    @HathiTrust   IniLal  Gender  Results   •  Approximately  80%  of  name  strings  have  iniLal   gender  idenLficaLon   –  Female   •  59,365   •  10%   –  Male   •  425,994   •  70%   –  Unknown   •  114,204   •  19%   –  Ambiguous   •  5,965   •  Less  than  1%  
  • 28. #HTRC    @HathiTrust   Results  by  Data  Source   Against  the  whole  set  of  name  strings   •  VIAF       – 19%  hit  rate     •  Web  Names   – 54%  hit  rate   •  Patents  Names   – 8%    
  • 29. Colin  Allen,  Jamie  Murdock   Cogni/ve  Science,  Indiana  University   Ref  talk  by  Jamie  Murdock,  hp://www.hathitrust.org/htrc_uncamp2013  
  • 30. Digging  into  philosophy  of  science   •  Establish  points  of  contact  between  philosophy   and  science:  where  philosophical  arguments  on   anthropomorphism  appear  in  science  texts   •  Use  topic  modeling  to  idenLfy  the  volumes  and   pages  within  these  volumes  that  are  “rich”  in  a   chosen  topic   •  Use  semi-­‐formal  discourse  analysis  technique  to   idenLfy  key  arguments  in  selected  pages  to   incrementally  expose  and  represent  argument   structures  
  • 31. The  How   •  1315  volumes  from  HTRC  selected  using   keyword  search  for  ‘darwin’,  ‘romanes’,   ‘anthropomorphism’,  and  ‘comparaLve   psychology’   •  Set  contains  lots  of  uninteresLng  books:    e.g.,   college  course  catalogs   •  Apply  topic  modeling  on  86  volume  subset     •  Using  iPy  Notebook  
  • 32. Volume  level  topic  modeling  on   ‘anthropomorphism’  yields  set  of   topics  
  • 33. ..  Of  set  of  topics,  choose  ‘16’  as  best  
  • 34. Volumes  most  similar  to  topic  16  
  • 35.
  • 36.
  • 37. Repeat  topic  modeling  at  page  level  
  • 38. Topic  model  at  page  level  for  topics   anthropomorphism,  animal,  and  psychology  
  • 39. Pick  top  3:  topics  16,  10,  26  
  • 40. Show  documents  of  topics  10,  16,  26  
  • 41. Drop  to  sentence  level   •  Select  three  books*  with  highest  aggregate  of   20-­‐40  topic-­‐relevant  pages  for  more  precise   analysis   •  Model  the  three  books  at  the  sentence  level   (uses  machine  learning)   *  Start  from  1315  texts  to  start,  down  to   86,  then  down  to  most  relevant  3  
  • 43. Copyright:  A  Reality     Full  text  download  is  limited  by  both   size  and  by  copyright  
  • 44. #HTRC    @HathiTrust   CompuLng  with  Copyrighted   materials:  HTRC  Data  Capsule   •  Copyrighted  materials  can  be  computed  on,  but  cannot  be   shared  by  humans  for  human  (reading)  consumpLon   •  Needs  computaLonal  framework  to  enable  compuLng  but   restricLng  human  consumpLon   •  A  secure  compuLng  framework  that:   –  Trusts  that  researcher  will  not  deliberately  leak  data   –  Prevents  malware  acLng  on  user's  behalf  from  leaking   data.   •  Supports  Openness:  accepts  user-­‐contributed  analysis     •  Supports  Large-­‐scale  and  low  cost:    protecLons  can  be   extended  to  uLlizaLon  of  public  supercomputers  
  • 45. VM  Image   Manager   VM  Image   Store   VM  Image   Builder   VM   Manager   VM   instance   Secure   Capsule   cluster   SSH   Research   results   Researcher   HTRC  Data   Capsule   Architectural   Components       Registry     Services,   worksets      
  • 46. VM   Image   Manager   VM   Image   Store   VM   Image   Builder   VM   Manager   VM   instance   Upon  run,   Secure   Capsule:   controls  I/O   behind   scenes   SSH   Research   results   Researcher   HTRC  Data   Capsule   interacLon   Researcher   requests     new  VM  of   type  X   Researcher  install  tools  onto   VM  through  window  on  her   desktop.         Registry     Services,   worksets       Final  locaLon   of  results  is   registry   1)   2)   Image   instance  is   created   3)   4)  
  • 47. 47   HTRC  secure  data  capsule:  view  from  researcher  desktop  
  • 48. Thanks  to  our  sponsors  
  • 49. 2009:  “If  I  had  to  predict  some  interesLng  things  for   the  future  in  the  area  of  access,  I'd  sum  it  up  in  one   word:    scale.    Big,  massive,  scale.    That's  what   digiLzaLon  brings  -­‐  access  to  far,  far  more  cultural   heritage  materials  than  you  could  ever  access  before.”     à Paradigm: computation moves to the data (not vice versa) 2014:    We  are  at  massive  scale  of  data,  but  data   access  is  constrained.    Can  digital  humani/es   researchers  work  within  constraints?    Will  they  find   it  worthwhile  to  do  so?   Reality:    Full  text  download  is   limited  by  size  and  copyright