SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
HathiTrust	
  and	
  HTRC:	
  the	
  changing	
  Digital	
  
Library	
  
El	
  Colegio	
  de	
  Mexico	
  |	
  20.May.14	
  
	
  
	
  
Beth	
  Plale	
  –	
  @bplale	
  	
  
Professor,	
  School	
  of	
  InformaCcs	
  and	
  CompuCng	
  
Director,	
  HathiTrust	
  Research	
  Center	
  	
  
Indiana	
  University	
  
Tweet	
  us	
  -­‐	
  @HathiTrust	
  	
  #HTRC	
  
HATHI TRUST
RESEARCH CENTER!
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  
•  HathiTrust	
  is	
  a	
  consorBum	
  of	
  academic	
  &	
  
research	
  insBtuBons,	
  offering	
  a	
  collecBon	
  of	
  
millions	
  of	
  Btles	
  digiBzed	
  from	
  libraries	
  
around	
  the	
  world.	
  
– Founding	
  members:	
  University	
  of	
  Michigan,	
  
Indiana	
  University,	
  University	
  of	
  California,	
  and	
  
University	
  of	
  Virginia	
  
http://www.hathitrust.org/htrc	
  
http://www.hathitrust.org	
  
à	
  DisBnguished	
  
from	
  
#HTRC	
  	
  @HathiTrust	
  
Take	
  look	
  at	
  Details	
  of	
  
HathiTrust	
  CollecBon	
  	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  
•  Books	
  and	
  journals	
  
– Pilots	
  around	
  images,	
  audio,	
  born-­‐digital	
  
•  DigiBzaBon	
  sources	
  
– Google	
  (96.8%,	
  10,162,104)	
  
– Internet	
  Archive	
  (2.9%,	
  301,972)	
  
– Local	
  (0.3%,	
  31,840)	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  Sources	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  Package	
  
#HTRC	
  	
  @HathiTrust	
  
Metadata	
  
•  Bibliographic	
  
•  Structural	
  
•  Rights	
  
•  AdministraBve	
  (preservaBon)	
  
•  Holdings	
  
HathiTrust	
  	
  
Repository	
  OrganizaBon	
  
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  Repository	
  OrganizaBon	
  
#HTRC	
  	
  @HathiTrust	
  
File	
  System	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  distribuBon	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  distribuBon	
  
Not	
  public	
  domain	
  
outside	
  available	
  
#HTRC	
  	
  @HathiTrust	
  
à HathiTrust repository is a latent
goldmine for text mining analysis,
analysis of large-scale corpi through
computational tools, and time-based
analysis
à Restricted nature of HT content
suggests need for new forms of
access that preserve intimate nature
of research investigation while
honoring restrictions
à Paradigm: computation moves to
the data (not vice versa)
#HTRC	
  	
  @HathiTrust	
  
	
  Mission	
  of	
  HT	
  Research	
  Center	
  
•  Research	
  arm	
  of	
  HathiTrust	
  	
  
•  Goal:	
  	
  enable	
  researchers	
  world-­‐wide	
  to	
  carry	
  out	
  
computaBonal	
  invesBgaBon	
  of	
  HT	
  repository	
  through	
  
–  Develop	
  model	
  for	
  access:	
  the	
  ‘workset’	
  
–  Develop	
  tools	
  that	
  facilitate	
  research	
  by	
  digital	
  humaniBes	
  
and	
  informaBcs	
  communiBes	
  
–  Develop	
  secure	
  cyberinfrastructure	
  that	
  allows	
  
computaBonal	
  invesBgaBon	
  of	
  enBre	
  copyrighted	
  and	
  
public	
  domain	
  HathiTrust	
  repository	
  
•  Established:	
  	
  July,	
  2011	
  
•  CollaboraBve	
  effort	
  of	
  Indiana	
  University	
  and	
  
University	
  of	
  Illinois	
  
	
  
	
  
HTRC	
  system	
  	
  
Complexity	
  hiding	
  interface	
  
The	
  complexity	
  
Tabular	
  info	
  
StaBsBcal	
  plots	
  
SpaBal	
  plots	
  
Request	
  
 
	
  
Complexity	
  hiding	
  interface	
  
	
  
	
  
Workset	
  builder	
  
#HTRC	
  	
  @HathiTrust	
  
HTRC	
  Timeline	
  
•  Phase	
  I:	
  	
  development	
  01	
  Jul	
  2011	
  –	
  31	
  Mar	
  2013	
  	
  	
  
–  HTRC	
  soiware	
  and	
  services	
  release	
  v1.0	
  
hjp://sourceforge.net/p/htrc/code/	
  	
  
•  Phase	
  II:	
  	
  outreach,	
  01	
  Apr	
  2013	
  -­‐	
  present	
  
–  2nd	
  HTRC	
  UnCamp	
  Sep	
  ‘13	
  
	
  
Ajendees	
  of	
  UnCamp’13	
  
#HTRC	
  	
  @HathiTrust	
  
Access	
  to	
  copyrighted	
  materials:	
  HTRC	
  
Data	
  Capsule	
  
A	
  secure	
  compuBng	
  framework	
  that:	
  
•  Trusts	
  that	
  researcher	
  will	
  not	
  deliberately	
  leak	
  repository	
  data,	
  but	
  
•  Prevents	
  malware	
  acBng	
  on	
  user's	
  behalf	
  from	
  leaking	
  data.	
  
	
  
Enforces:	
  
•  Non-­‐consumpBve	
  use:	
  	
  framework	
  provides	
  safe	
  handling	
  of	
  large	
  
volumes	
  of	
  protected	
  data	
  
•  Openness:	
  framework	
  supports	
  user-­‐contributed	
  analysis	
  tools	
  
(that	
  is,	
  not	
  limit	
  uses	
  to	
  a	
  known	
  set	
  of	
  algorithms)	
  
•  Efficiency:	
  framework	
  supports	
  user-­‐contributed	
  analysis	
  tools	
  
without	
  resorBng	
  to	
  code	
  walkthroughs	
  prior	
  to	
  acceptance	
  
•  Large-­‐scale	
  and	
  low	
  cost:	
  	
  protecBons	
  can	
  be	
  extended	
  to	
  uBlizaBon	
  
of	
  large-­‐scale	
  naBonal	
  (public)	
  supercomputers	
  
VM	
  Image	
  
Manager	
  
VM	
  Image	
  
Store	
  
VM	
  Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
  
Secure	
  
Capsule	
  
cluster	
  
SSH	
   Research	
  
results	
  
Researcher	
  
HTRC	
  Secure	
  
Capsule	
  
Architectural	
  
Components	
  
	
  
	
  
Registry	
  	
  
Services,	
  
worksets	
  
	
  
	
  
VM	
  
Image	
  
Manager	
  
VM	
  
Image	
  
Store	
  
VM	
  
Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
  
Upon	
  run,	
  
Secure	
  
Capsule:	
  
controls	
  I/O	
  
behind	
  
scenes	
  
SSH	
   Research	
  
results	
  
Researcher	
  
HTRC	
  Secure	
  
Capsule	
  
Architecture	
  
Researcher	
  
requests	
  	
  
new	
  VM	
  of	
  
type	
  X	
  
Researcher	
  install	
  tools	
  onto	
  
VM	
  through	
  window	
  on	
  her	
  
desktop.	
  	
  
	
  
	
  
Registry	
  	
  
Services,	
  
worksets	
  
	
  
	
  
Final	
  locaBon	
  
of	
  results	
  is	
  
registry	
  
1)	
  
2)	
  
Image	
  
instance	
  is	
  
created	
  
3)	
  
4)	
  
23	
  
HTRC	
  secure	
  data	
  capsule:	
  view	
  from	
  researcher	
  desktop	
  
EXAMPLES	
  OF	
  RESEARCH	
  CARRIED	
  OUT	
  
THROUGH	
  HATHI	
  TRUST	
  RESEARCH	
  
CENTER	
  
•  Author	
  Gender	
  IdenBficaBon	
  
•  Using	
  Topic	
  Modeling	
  to	
  Locate	
  (down	
  to	
  
sentence	
  level)	
  Philosophical	
  Arguments	
  in	
  
Science	
  Texts	
  
GENDER	
  IDENTIFICATION	
  OF	
  HTRC	
  
AUTHORS	
  BY	
  NAMES	
  
	
  
Stacy	
  Kowalczyk,	
  Asst.	
  Professor,	
  Dominican	
  University	
  
Zong	
  Peng,	
  HTRC,	
  Indiana	
  University	
  
Ref	
  talk	
  by	
  Stacy	
  Kowalczyk,	
  hjp://www.hathitrust.org/htrc_uncamp2013	
  
#HTRC	
  	
  @HathiTrust	
  
Gender	
  IdenBficaBon	
  of	
  Text	
  
•  QuesBon	
  InvesBgated:	
  Can	
  we	
  use	
  author	
  names	
  in	
  	
  
bibliographic	
  records	
  to	
  idenBfy	
  gender?	
  
•  2.6	
  million	
  bibliographic	
  records	
  
–  Extracted	
  personal	
  author	
  data	
  	
  
–  Marc	
  100	
  abcd	
  and	
  700	
  abcd	
  
•  606,437	
  unique	
  personal	
  author	
  strings	
  
•  Bibliographic	
  data	
  is	
  not	
  fielded	
  like	
  patent	
  names	
  
•  Relying	
  on	
  Standard	
  cataloging	
  pracBce	
  
–  Last	
  name,	
  first	
  name	
  middle	
  name,	
  	
  Btles/honorifics,	
  
dates	
  
Why	
  interesBng	
  to	
  HTRC?	
  Introduces	
  new	
  
source	
  of	
  metadata	
  and	
  from	
  sources	
  with	
  
varying	
  authority	
  
	
  
	
  	
  	
  	
  
	
  
Raises	
  quesBons:	
  
1)  How	
  should	
  community	
  contributed	
  metadata	
  
be	
  disBnguished	
  from	
  more	
  authoritaBve	
  
sources?	
  	
  
2)  How	
  should	
  variability	
  of	
  quality	
  even	
  within	
  a	
  
single	
  contribuBon	
  be	
  conveyed	
  to	
  community?	
  
#HTRC	
  	
  @HathiTrust	
  
Authors	
  vs	
  Names	
  
•  Methuen,	
  Algernon	
  Methuen	
  Marshall,	
  Sir	
  bart.,	
  
1856-­‐1924	
  
•  Methuem,	
  Algernon	
  	
  
•  Methuen	
  Algernon	
  	
  
•  Methuen	
  Marshall,	
  Sir,	
  bart.,	
  1856-­‐	
  	
  
•  Methuen,	
  A.	
  Sir,	
  1856-­‐1924	
  	
  
•  Methuen,	
  A.	
  Sir,	
  bart.,	
  1856-­‐1924	
  	
  
•  Methuen	
  Marshall,	
  Sir	
  bart	
  1856-­‐1924	
  	
  
•  Methuen,	
  Algernon	
  Methuen	
  Marshall,	
  Sir,	
  1856-­‐1924	
  
•  Methuen,	
  Algernon	
  Methuen	
  Marshall,	
  Sir,	
  bart.,	
  
1856-­‐1924	
  
•  Methuen,	
  Algernon,	
  1856-­‐1924	
  	
  
	
  
#HTRC	
  	
  @HathiTrust	
  
Sources	
  of	
  Data	
  
•  The	
  Virtual	
  InternaBonal	
  Authority	
  File	
  
–  Hosted	
  by	
  OCLC	
  
•  Harvested	
  names	
  from	
  mulBple	
  data	
  sources	
  
–  Census	
  bureau	
  	
  
–  Baby	
  name	
  sites	
  
•  EU	
  Patent	
  Research	
  names	
  list	
  (Frietsch	
  et	
  al,	
  2009;	
  
Naldi	
  et	
  al.	
  2005)	
  
–  Developed	
  an	
  extensive	
  list	
  of	
  European	
  names	
  
•  Titles	
  and	
  honorifics	
  
–  MulBple	
  web	
  resources	
  	
  
–  Sir,	
  Baron,	
  Count,	
  Duke,	
  Father,	
  Cardinal,	
  etc	
  
–  Lady,	
  Mrs.	
  Miss,	
  Countess,	
  Duchess,	
  Sister,	
  etc	
  
#HTRC	
  	
  @HathiTrust	
  
IniBal	
  Gender	
  Results	
  
•  Approximately	
  80%	
  of	
  name	
  strings	
  have	
  iniBal	
  
gender	
  idenBficaBon	
  
–  Female	
  
•  59,365	
  
•  10%	
  
–  Male	
  
•  425,994	
  
•  70%	
  
–  Unknown	
  
•  114,204	
  
•  19%	
  
–  Ambiguous	
  
•  5,965	
  
•  Less	
  than	
  1%	
  
#HTRC	
  	
  @HathiTrust	
  
Results	
  by	
  Data	
  Source	
  
Against	
  the	
  whole	
  set	
  of	
  name	
  strings	
  
•  VIAF	
  	
  	
  
– 19%	
  hit	
  rate	
  	
  
•  Web	
  Names	
  
– 54%	
  hit	
  rate	
  
•  Patents	
  Names	
  
– 8%	
  
	
  
Colin	
  Allen,	
  Jamie	
  Murdock	
  
CogniCve	
  Science,	
  Indiana	
  University	
  
Ref	
  talk	
  by	
  Jamie	
  Murdock,	
  hjp://www.hathitrust.org/htrc_uncamp2013	
  
The	
  InPhO	
  project	
  is	
  instrucBve	
  because	
  it	
  
demonstrates	
  an	
  interacBon	
  sequence	
  between	
  
a	
  researcher	
  and	
  his/her	
  corpus	
  that	
  is	
  
nuanced,	
  is	
  mulBstep,	
  and	
  mulB-­‐modal.	
  	
  	
  
	
  
The	
  HTRC	
  cyberinfrastructure	
  must	
  be	
  able	
  to	
  
handle	
  such	
  a	
  nuanced	
  form	
  of	
  interacBon	
  
between	
  a	
  researcher	
  and	
  their	
  texts.	
  
Digging	
  into	
  philosophy	
  of	
  science	
  
•  Establish	
  points	
  of	
  contact	
  between	
  philosophy	
  
and	
  science:	
  where	
  philosophical	
  arguments	
  on	
  
anthropomorphism	
  appear	
  in	
  science	
  texts	
  
•  Use	
  topic	
  modeling	
  to	
  idenBfy	
  the	
  volumes	
  and	
  
pages	
  within	
  these	
  volumes	
  that	
  are	
  “rich”	
  in	
  a	
  
chosen	
  topic	
  
•  Use	
  semi-­‐formal	
  discourse	
  analysis	
  technique	
  to	
  
idenBfy	
  key	
  arguments	
  in	
  selected	
  pages	
  to	
  
incrementally	
  expose	
  and	
  represent	
  argument	
  
structures	
  
The	
  How	
  
•  1315	
  volumes	
  from	
  HTRC	
  selected	
  using	
  
keyword	
  search	
  for	
  ‘darwin’,	
  ‘romanes’,	
  
‘anthropomorphism’,	
  and	
  ‘comparaBve	
  
psychology’	
  
•  Set	
  contains	
  lots	
  of	
  uninteresBng	
  books:	
  	
  e.g.,	
  
college	
  course	
  catalogs	
  
•  Apply	
  LDA	
  on	
  86	
  volume	
  subset	
  	
  
•  Using	
  iPy	
  Notebook	
  
LDA	
  topic	
  modeling	
  
•  LDA	
  (Latent	
  Dirichlet	
  Analysis)	
  uses	
  a	
  Bayesian	
  
updaBng	
  method	
  to	
  generate	
  a	
  set	
  of	
  “topics”	
  –	
  
probability	
  distribuBons	
  over	
  set	
  of	
  terms	
  in	
  a	
  corpus	
  
•  Number	
  of	
  topics	
  is	
  a	
  parameter	
  in	
  the	
  modeling	
  
technique	
  
•  Method	
  finds	
  set	
  of	
  topics	
  that	
  is	
  best	
  able	
  to	
  
reproduce	
  the	
  term	
  distribuBons	
  in	
  documents	
  
belonging	
  to	
  the	
  corpus	
  
•  Documents	
  may	
  be	
  whole	
  volumes,	
  chapters,	
  arBcles,	
  
single	
  pages,	
  even	
  individual	
  sentences	
  –	
  modeler’s	
  
choice	
  
Volume	
  level	
  topic	
  modeling	
  on	
  
‘anthropomorphism’	
  yields	
  set	
  of	
  
topics	
  
..	
  Of	
  set	
  of	
  topics,	
  choose	
  ‘16’	
  as	
  best	
  
Volumes	
  most	
  similar	
  to	
  topic	
  16	
  
Repeat	
  LDA	
  at	
  page	
  level	
  
Topic	
  model	
  at	
  page	
  level	
  for	
  topics	
  
anthropomorphism,	
  animal,	
  and	
  psychology	
  
Words	
  sorted	
  by	
  similarity	
  
Pick	
  top	
  3:	
  topics	
  16,	
  10,	
  26	
  
Show	
  documents	
  of	
  topics	
  10,	
  16,	
  26	
  
Drop	
  to	
  sentence	
  level	
  
•  Select	
  three	
  books	
  with	
  highest	
  aggregate	
  of	
  
20-­‐40	
  topic-­‐relevant	
  pages	
  for	
  more	
  precise	
  
analysis	
  
•  Manually	
  augment	
  argument	
  analysis	
  
– Remodeling	
  of	
  three	
  volumes	
  at	
  sentence	
  level	
  
– Training	
  other	
  methods	
  using	
  human	
  analysis	
  plus	
  
sentence	
  similarity	
  
Promising	
  early	
  results	
  …	
  
Scholarly	
  Commons	
  	
  
User	
  Support	
  Service	
  
•  Develop	
  training	
  materials	
  	
  
•  EducaBonal	
  workshops	
  
•  Tool	
  and	
  workset	
  creaBon	
  
•  Collaborate	
  with	
  librarians	
  and	
  
DH	
  centers	
  at	
  HT	
  insBtuBons	
  
•  Assist	
  researchers	
  in	
  HTRC	
  text	
  
data	
  mining	
  research	
  projects	
  
•  Based	
  at	
  University	
  of	
  Illinois	
  
Library	
  
	
  
47	
  
Scholarly	
  Commons	
  User	
  Support	
  
•  Gives	
  HT	
  insBtuBons	
  exclusive	
  access	
  to	
  training	
  and	
  learning	
  materials	
  
that	
  help	
  them	
  establish	
  programs	
  that	
  integrate	
  HTRC	
  tools	
  and	
  services	
  
into	
  their	
  scholarly	
  commons	
  programs	
  in	
  libraries	
  and	
  digital	
  humaniBes	
  
centers.	
  	
  	
  
•  Physically	
  located	
  on	
  the	
  University	
  of	
  Illinois	
  Library’s	
  Scholarly	
  commons.	
  	
  	
  
•  Supported	
  by	
  several	
  Library	
  staff	
  and	
  faculty.	
  	
  Key	
  among	
  these	
  is	
  the	
  
Digital	
  Humani,es	
  Research	
  Specialist	
  who	
  will	
  assist	
  with	
  the	
  
development	
  of	
  training	
  and	
  outreach	
  iniBaBves	
  in	
  support	
  of	
  researchers	
  
working	
  with	
  the	
  Hathi	
  Trust	
  Research	
  Center	
  and	
  HathiTrust	
  digital	
  library	
  
affiliates	
  who	
  seek	
  to	
  start	
  their	
  own	
  HTRC	
  research	
  services.	
  	
  
•  Effort	
  involves	
  planning,	
  implementaBon	
  and	
  conBnuous	
  development	
  of	
  
training	
  materials,	
  educaBonal	
  workshops,	
  and	
  potenBal	
  tools,	
  and	
  
outreach	
  acBviBes	
  in	
  support	
  of	
  the	
  usage	
  of	
  HTRC	
  tools	
  and	
  datasets.	
  
Thanks	
  to	
  sponsors	
  
#HTRC	
  	
  @HathiTrust	
  
http://www.hathitrust.org/htrc	
  
http://www.hathitrust.org	
  

Más contenido relacionado

La actualidad más candente

Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 

La actualidad más candente (20)

DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
 
Research Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social SciencesResearch Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social Sciences
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
 
Linked Open Data for Digital Humanities
Linked Open Data for Digital HumanitiesLinked Open Data for Digital Humanities
Linked Open Data for Digital Humanities
 
Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms: Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms:
 
Research Data Management at the University of Edinburgh
Research Data Management at the University of EdinburghResearch Data Management at the University of Edinburgh
Research Data Management at the University of Edinburgh
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/Interoperability
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
From Theory to Practice: Can Opennesss Improve the Quality of OER Research?
From Theory to Practice: Can Opennesss Improve the Quality of OER Research? From Theory to Practice: Can Opennesss Improve the Quality of OER Research?
From Theory to Practice: Can Opennesss Improve the Quality of OER Research?
 
From Structured Data to Linked Open Governmental Data
From Structured Data to Linked Open Governmental DataFrom Structured Data to Linked Open Governmental Data
From Structured Data to Linked Open Governmental Data
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving Data
 
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
 

Similar a Plale HathiTrust El Colegio de Mexico May2014

Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
Bernhard Haslhofer
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Jon Voss
 

Similar a Plale HathiTrust El Colegio de Mexico May2014 (20)

HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
CST4599 July 2020
CST4599 July 2020 CST4599 July 2020
CST4599 July 2020
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data Publishing
 
THe HathiTrust Research Center: Digital Humanities at Scale
THe HathiTrust Research Center: Digital Humanities at ScaleTHe HathiTrust Research Center: Digital Humanities at Scale
THe HathiTrust Research Center: Digital Humanities at Scale
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital Library
 
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterElephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
JCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening Slides
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open Data
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and Demo
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Metadata Ownership & Metadata Rights
Metadata Ownership & Metadata RightsMetadata Ownership & Metadata Rights
Metadata Ownership & Metadata Rights
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
 

Más de Beth Plale

Más de Beth Plale (8)

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science research
 
Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEAD
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail Science
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for science
 

Último

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 

Plale HathiTrust El Colegio de Mexico May2014

  • 1. HathiTrust  and  HTRC:  the  changing  Digital   Library   El  Colegio  de  Mexico  |  20.May.14       Beth  Plale  –  @bplale     Professor,  School  of  InformaCcs  and  CompuCng   Director,  HathiTrust  Research  Center     Indiana  University   Tweet  us  -­‐  @HathiTrust    #HTRC   HATHI TRUST RESEARCH CENTER!
  • 2. #HTRC    @HathiTrust   HathiTrust   •  HathiTrust  is  a  consorBum  of  academic  &   research  insBtuBons,  offering  a  collecBon  of   millions  of  Btles  digiBzed  from  libraries   around  the  world.   – Founding  members:  University  of  Michigan,   Indiana  University,  University  of  California,  and   University  of  Virginia   http://www.hathitrust.org/htrc   http://www.hathitrust.org   à  DisBnguished   from  
  • 4. Take  look  at  Details  of   HathiTrust  CollecBon    
  • 5. #HTRC    @HathiTrust   Content   •  Books  and  journals   – Pilots  around  images,  audio,  born-­‐digital   •  DigiBzaBon  sources   – Google  (96.8%,  10,162,104)   – Internet  Archive  (2.9%,  301,972)   – Local  (0.3%,  31,840)  
  • 6. #HTRC    @HathiTrust   Content  Sources  
  • 7. #HTRC    @HathiTrust   Content  Package  
  • 8. #HTRC    @HathiTrust   Metadata   •  Bibliographic   •  Structural   •  Rights   •  AdministraBve  (preservaBon)   •  Holdings  
  • 9. HathiTrust     Repository  OrganizaBon  
  • 10. #HTRC    @HathiTrust   HathiTrust  Repository  OrganizaBon  
  • 11. #HTRC    @HathiTrust   File  System  
  • 12. #HTRC    @HathiTrust   Content  distribuBon  
  • 13. #HTRC    @HathiTrust   Content  distribuBon   Not  public  domain   outside  available  
  • 14. #HTRC    @HathiTrust   à HathiTrust repository is a latent goldmine for text mining analysis, analysis of large-scale corpi through computational tools, and time-based analysis à Restricted nature of HT content suggests need for new forms of access that preserve intimate nature of research investigation while honoring restrictions à Paradigm: computation moves to the data (not vice versa)
  • 15. #HTRC    @HathiTrust    Mission  of  HT  Research  Center   •  Research  arm  of  HathiTrust     •  Goal:    enable  researchers  world-­‐wide  to  carry  out   computaBonal  invesBgaBon  of  HT  repository  through   –  Develop  model  for  access:  the  ‘workset’   –  Develop  tools  that  facilitate  research  by  digital  humaniBes   and  informaBcs  communiBes   –  Develop  secure  cyberinfrastructure  that  allows   computaBonal  invesBgaBon  of  enBre  copyrighted  and   public  domain  HathiTrust  repository   •  Established:    July,  2011   •  CollaboraBve  effort  of  Indiana  University  and   University  of  Illinois      
  • 16. HTRC  system     Complexity  hiding  interface   The  complexity   Tabular  info   StaBsBcal  plots   SpaBal  plots   Request  
  • 17.     Complexity  hiding  interface      
  • 19. #HTRC    @HathiTrust   HTRC  Timeline   •  Phase  I:    development  01  Jul  2011  –  31  Mar  2013       –  HTRC  soiware  and  services  release  v1.0   hjp://sourceforge.net/p/htrc/code/     •  Phase  II:    outreach,  01  Apr  2013  -­‐  present   –  2nd  HTRC  UnCamp  Sep  ‘13     Ajendees  of  UnCamp’13  
  • 20. #HTRC    @HathiTrust   Access  to  copyrighted  materials:  HTRC   Data  Capsule   A  secure  compuBng  framework  that:   •  Trusts  that  researcher  will  not  deliberately  leak  repository  data,  but   •  Prevents  malware  acBng  on  user's  behalf  from  leaking  data.     Enforces:   •  Non-­‐consumpBve  use:    framework  provides  safe  handling  of  large   volumes  of  protected  data   •  Openness:  framework  supports  user-­‐contributed  analysis  tools   (that  is,  not  limit  uses  to  a  known  set  of  algorithms)   •  Efficiency:  framework  supports  user-­‐contributed  analysis  tools   without  resorBng  to  code  walkthroughs  prior  to  acceptance   •  Large-­‐scale  and  low  cost:    protecBons  can  be  extended  to  uBlizaBon   of  large-­‐scale  naBonal  (public)  supercomputers  
  • 21. VM  Image   Manager   VM  Image   Store   VM  Image   Builder   VM   Manager   VM   instance   Secure   Capsule   cluster   SSH   Research   results   Researcher   HTRC  Secure   Capsule   Architectural   Components       Registry     Services,   worksets      
  • 22. VM   Image   Manager   VM   Image   Store   VM   Image   Builder   VM   Manager   VM   instance   Upon  run,   Secure   Capsule:   controls  I/O   behind   scenes   SSH   Research   results   Researcher   HTRC  Secure   Capsule   Architecture   Researcher   requests     new  VM  of   type  X   Researcher  install  tools  onto   VM  through  window  on  her   desktop.         Registry     Services,   worksets       Final  locaBon   of  results  is   registry   1)   2)   Image   instance  is   created   3)   4)  
  • 23. 23   HTRC  secure  data  capsule:  view  from  researcher  desktop  
  • 24. EXAMPLES  OF  RESEARCH  CARRIED  OUT   THROUGH  HATHI  TRUST  RESEARCH   CENTER   •  Author  Gender  IdenBficaBon   •  Using  Topic  Modeling  to  Locate  (down  to   sentence  level)  Philosophical  Arguments  in   Science  Texts  
  • 25. GENDER  IDENTIFICATION  OF  HTRC   AUTHORS  BY  NAMES     Stacy  Kowalczyk,  Asst.  Professor,  Dominican  University   Zong  Peng,  HTRC,  Indiana  University   Ref  talk  by  Stacy  Kowalczyk,  hjp://www.hathitrust.org/htrc_uncamp2013  
  • 26. #HTRC    @HathiTrust   Gender  IdenBficaBon  of  Text   •  QuesBon  InvesBgated:  Can  we  use  author  names  in     bibliographic  records  to  idenBfy  gender?   •  2.6  million  bibliographic  records   –  Extracted  personal  author  data     –  Marc  100  abcd  and  700  abcd   •  606,437  unique  personal  author  strings   •  Bibliographic  data  is  not  fielded  like  patent  names   •  Relying  on  Standard  cataloging  pracBce   –  Last  name,  first  name  middle  name,    Btles/honorifics,   dates  
  • 27. Why  interesBng  to  HTRC?  Introduces  new   source  of  metadata  and  from  sources  with   varying  authority               Raises  quesBons:   1)  How  should  community  contributed  metadata   be  disBnguished  from  more  authoritaBve   sources?     2)  How  should  variability  of  quality  even  within  a   single  contribuBon  be  conveyed  to  community?  
  • 28. #HTRC    @HathiTrust   Authors  vs  Names   •  Methuen,  Algernon  Methuen  Marshall,  Sir  bart.,   1856-­‐1924   •  Methuem,  Algernon     •  Methuen  Algernon     •  Methuen  Marshall,  Sir,  bart.,  1856-­‐     •  Methuen,  A.  Sir,  1856-­‐1924     •  Methuen,  A.  Sir,  bart.,  1856-­‐1924     •  Methuen  Marshall,  Sir  bart  1856-­‐1924     •  Methuen,  Algernon  Methuen  Marshall,  Sir,  1856-­‐1924   •  Methuen,  Algernon  Methuen  Marshall,  Sir,  bart.,   1856-­‐1924   •  Methuen,  Algernon,  1856-­‐1924      
  • 29. #HTRC    @HathiTrust   Sources  of  Data   •  The  Virtual  InternaBonal  Authority  File   –  Hosted  by  OCLC   •  Harvested  names  from  mulBple  data  sources   –  Census  bureau     –  Baby  name  sites   •  EU  Patent  Research  names  list  (Frietsch  et  al,  2009;   Naldi  et  al.  2005)   –  Developed  an  extensive  list  of  European  names   •  Titles  and  honorifics   –  MulBple  web  resources     –  Sir,  Baron,  Count,  Duke,  Father,  Cardinal,  etc   –  Lady,  Mrs.  Miss,  Countess,  Duchess,  Sister,  etc  
  • 30. #HTRC    @HathiTrust   IniBal  Gender  Results   •  Approximately  80%  of  name  strings  have  iniBal   gender  idenBficaBon   –  Female   •  59,365   •  10%   –  Male   •  425,994   •  70%   –  Unknown   •  114,204   •  19%   –  Ambiguous   •  5,965   •  Less  than  1%  
  • 31. #HTRC    @HathiTrust   Results  by  Data  Source   Against  the  whole  set  of  name  strings   •  VIAF       – 19%  hit  rate     •  Web  Names   – 54%  hit  rate   •  Patents  Names   – 8%    
  • 32. Colin  Allen,  Jamie  Murdock   CogniCve  Science,  Indiana  University   Ref  talk  by  Jamie  Murdock,  hjp://www.hathitrust.org/htrc_uncamp2013  
  • 33. The  InPhO  project  is  instrucBve  because  it   demonstrates  an  interacBon  sequence  between   a  researcher  and  his/her  corpus  that  is   nuanced,  is  mulBstep,  and  mulB-­‐modal.         The  HTRC  cyberinfrastructure  must  be  able  to   handle  such  a  nuanced  form  of  interacBon   between  a  researcher  and  their  texts.  
  • 34. Digging  into  philosophy  of  science   •  Establish  points  of  contact  between  philosophy   and  science:  where  philosophical  arguments  on   anthropomorphism  appear  in  science  texts   •  Use  topic  modeling  to  idenBfy  the  volumes  and   pages  within  these  volumes  that  are  “rich”  in  a   chosen  topic   •  Use  semi-­‐formal  discourse  analysis  technique  to   idenBfy  key  arguments  in  selected  pages  to   incrementally  expose  and  represent  argument   structures  
  • 35. The  How   •  1315  volumes  from  HTRC  selected  using   keyword  search  for  ‘darwin’,  ‘romanes’,   ‘anthropomorphism’,  and  ‘comparaBve   psychology’   •  Set  contains  lots  of  uninteresBng  books:    e.g.,   college  course  catalogs   •  Apply  LDA  on  86  volume  subset     •  Using  iPy  Notebook  
  • 36. LDA  topic  modeling   •  LDA  (Latent  Dirichlet  Analysis)  uses  a  Bayesian   updaBng  method  to  generate  a  set  of  “topics”  –   probability  distribuBons  over  set  of  terms  in  a  corpus   •  Number  of  topics  is  a  parameter  in  the  modeling   technique   •  Method  finds  set  of  topics  that  is  best  able  to   reproduce  the  term  distribuBons  in  documents   belonging  to  the  corpus   •  Documents  may  be  whole  volumes,  chapters,  arBcles,   single  pages,  even  individual  sentences  –  modeler’s   choice  
  • 37. Volume  level  topic  modeling  on   ‘anthropomorphism’  yields  set  of   topics  
  • 38. ..  Of  set  of  topics,  choose  ‘16’  as  best  
  • 39. Volumes  most  similar  to  topic  16  
  • 40. Repeat  LDA  at  page  level  
  • 41. Topic  model  at  page  level  for  topics   anthropomorphism,  animal,  and  psychology  
  • 42. Words  sorted  by  similarity  
  • 43. Pick  top  3:  topics  16,  10,  26  
  • 44. Show  documents  of  topics  10,  16,  26  
  • 45. Drop  to  sentence  level   •  Select  three  books  with  highest  aggregate  of   20-­‐40  topic-­‐relevant  pages  for  more  precise   analysis   •  Manually  augment  argument  analysis   – Remodeling  of  three  volumes  at  sentence  level   – Training  other  methods  using  human  analysis  plus   sentence  similarity  
  • 47. Scholarly  Commons     User  Support  Service   •  Develop  training  materials     •  EducaBonal  workshops   •  Tool  and  workset  creaBon   •  Collaborate  with  librarians  and   DH  centers  at  HT  insBtuBons   •  Assist  researchers  in  HTRC  text   data  mining  research  projects   •  Based  at  University  of  Illinois   Library     47  
  • 48. Scholarly  Commons  User  Support   •  Gives  HT  insBtuBons  exclusive  access  to  training  and  learning  materials   that  help  them  establish  programs  that  integrate  HTRC  tools  and  services   into  their  scholarly  commons  programs  in  libraries  and  digital  humaniBes   centers.       •  Physically  located  on  the  University  of  Illinois  Library’s  Scholarly  commons.       •  Supported  by  several  Library  staff  and  faculty.    Key  among  these  is  the   Digital  Humani,es  Research  Specialist  who  will  assist  with  the   development  of  training  and  outreach  iniBaBves  in  support  of  researchers   working  with  the  Hathi  Trust  Research  Center  and  HathiTrust  digital  library   affiliates  who  seek  to  start  their  own  HTRC  research  services.     •  Effort  involves  planning,  implementaBon  and  conBnuous  development  of   training  materials,  educaBonal  workshops,  and  potenBal  tools,  and   outreach  acBviBes  in  support  of  the  usage  of  HTRC  tools  and  datasets.  
  • 50. #HTRC    @HathiTrust   http://www.hathitrust.org/htrc   http://www.hathitrust.org