SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Case	
  Study	
  in	
  Big	
  Data	
  :	
  the	
  Socio-­‐Technical	
  
Issues	
  of	
  HathiTrust	
  Digital	
  Texts	
  
Women’s	
  Ins*tute	
  for	
  Summer	
  Enrichment	
  
Cornell	
  University,	
  Jun	
  16,	
  2014	
  
	
  
Beth	
  Plale	
  
Professor,	
  School	
  of	
  Informa?cs	
  and	
  Compu?ng	
  
Director,	
  Data	
  To	
  Insight	
  Center	
  	
  
Indiana	
  University	
  
HATHI TRUST
RESEARCH CENTER!
•  Who	
  are	
  the	
  Players?	
  HathiTrust,	
  
Google,	
  Authors	
  Guild	
  
•  The	
  Object	
  of	
  AJen?on	
  :	
  11	
  M	
  
books	
  from	
  university	
  libraries	
  
•  Rulings	
  around	
  copyright	
  
•  HTRC,	
  or	
  why	
  I	
  care	
  
•  Is	
  security	
  of	
  HTRC	
  Data	
  Capsule	
  
good	
  enough?	
  
The	
  Players	
  
Books	
  Digi*za*on	
  Project	
  (2007)	
  
Libraries	
  of	
  U	
  Michigan,	
  U	
  California,	
  Virginia,	
  Wisconsin,	
  Indiana,	
  …	
  
digi*zed	
  
books	
  
digi*zed	
  
books	
  
digi*ze	
  
digi*zed	
  
books	
  
digi*zed	
  
books	
  
Legal	
  
ac*on	
  
Mar	
  2011:	
  	
  New	
  York	
  federal	
  judge	
  rejected	
  a	
  
$125	
  million	
  legal	
  selement	
  that	
  Google	
  had	
  
worked	
  out	
  with	
  the	
  authors	
  and	
  publishers	
  
over	
  the	
  copyright	
  issues	
  
Nov	
  2013:	
  same	
  Judge	
  issued	
  ruling	
  saying	
  that	
  
Google's	
  use	
  of	
  the	
  works	
  was	
  a	
  "fair	
  use"	
  
under	
  copyright	
  law	
  
Google/
Authors	
  
Guild	
  
•  June	
  2014:	
  	
  2nd	
  Circuit	
  Court	
  
of	
  Appeals	
  ruling	
  on	
  Authors	
  
Guild	
  versus	
  HathiTrust	
  
(Cornell,	
  U	
  Michigan,	
  U	
  
California,	
  U	
  Wisconsin,	
  
Indiana)	
  is	
  a	
  major	
  victory	
  for	
  
fair	
  use	
  
digi*zed	
  
books	
  
Legal	
  
ac*on	
  
Highlights	
  
2014	
  ruling	
  
•  With	
  respect	
  to	
  the	
  full-­‐text	
  database,	
  the	
  
court	
  found	
  that	
  although	
  a	
  copy	
  of	
  the	
  en*re	
  
work	
  is	
  made,	
  the	
  purpose	
  of	
  a	
  full-­‐text	
  
searchable	
  database	
  is	
  so	
  different	
  from	
  that	
  
of	
  the	
  underlying	
  works	
  that	
  the	
  use	
  must	
  be	
  
considered	
  transforma*ve.	
  In	
  fact,	
  the	
  court	
  
wrote,	
  "the	
  crea*on	
  of	
  a	
  full-­‐text	
  searchable	
  
database	
  is	
  a	
  quintessen*ally	
  transforma*ve	
  
use".	
  	
   June	
  10,	
  2014	
  |	
  By	
  Parker	
  Higgins	
  	
  
Another	
  Fair	
  Use	
  Victory	
  for	
  Book	
  Scanning	
  in	
  HathiTrust	
  
	
  
•  The	
  Authors	
  Guild	
  argued	
  that	
  HathiTrust's	
  
use	
  of	
  an	
  iden*cal	
  server	
  and	
  two	
  tape	
  back-­‐
ups	
  cons*tuted	
  "excessive"	
  copying.	
  	
  
•  Thankfully,	
  the	
  court	
  rejected	
  that	
  premise,	
  
acknowledging	
  that	
  when	
  it	
  comes	
  to	
  digital	
  
technology,	
  an	
  approach	
  that	
  focuses	
  only	
  on	
  
individual	
  copies	
  made	
  is	
  insufficient.	
  
June	
  10,	
  2014	
  |	
  By	
  Parker	
  Higgins	
  	
  
Another	
  Fair	
  Use	
  Victory	
  for	
  Book	
  Scanning	
  in	
  HathiTrust	
  
	
  
Highlights	
  
2014	
  ruling	
  
Does	
  Authors	
  Guild	
  Represent	
  
All	
  Authors?	
  	
  
•  The	
  Authors	
  Guild	
  members	
  are	
  
overwhelmingly	
  trade-­‐book	
  authors;	
  the	
  
books	
  scanned	
  by	
  the	
  Hathi	
  Trust	
  are	
  
overwhelmingly	
  scholarly	
  books	
  wrien	
  as	
  
part	
  of	
  an	
  academic	
  tradi*on	
  that	
  takes	
  free	
  
access	
  and	
  sharing	
  as	
  its	
  founda*on.	
  	
  
•  The	
  Authors	
  Alliance	
  :	
  new	
  organiza*on	
  
represen*ng	
  authors	
  who	
  are	
  primarily	
  
concerned	
  with	
  being	
  read.	
  
Court	
  finds	
  full-­‐book	
  scanning	
  is	
  fair	
  use	
  
Cory	
  Doctorow	
  at	
  3:00	
  pm	
  Sat,	
  Jun	
  14,	
  2014	
  	
  
Highlight	
  
2014	
  Ruling	
  	
  
•  Given	
  that	
  consistent	
  fair	
  use	
  record	
  for	
  book	
  
digi*za*on,	
  today's	
  ruling	
  might	
  not	
  be	
  totally	
  
surprising.	
  S*ll,	
  the	
  text	
  of	
  the	
  opinion	
  is	
  
encouraging,	
  and	
  reflects	
  a	
  court	
  that	
  respects	
  
the	
  Cons/tu/onal	
  purpose	
  of	
  copyright	
  as	
  a	
  
tool	
  to	
  promote	
  the	
  progress	
  of	
  science	
  and	
  
the	
  useful	
  arts—not	
  a	
  blunt	
  instrument	
  for	
  
rightsholders	
  to	
  regulate	
  all	
  downstream	
  uses.	
  
June	
  10,	
  2014	
  |	
  By	
  Parker	
  Higgins	
  	
  
Another	
  Fair	
  Use	
  Victory	
  for	
  Book	
  Scanning	
  in	
  HathiTrust	
  
	
  
•  Who	
  are	
  the	
  Players?	
  HathiTrust,	
  
Google,	
  Authors	
  Guild	
  
•  The	
  Object	
  of	
  Aen*on	
  :	
  11	
  M	
  
books	
  from	
  university	
  libraries	
  
•  Rulings	
  around	
  copyright	
  
•  HTRC,	
  or	
  why	
  I	
  care	
  
•  Is	
  security	
  of	
  HTRC	
  Data	
  Capsule	
  
good	
  enough?	
  
HTRC,	
  or	
  why	
  I	
  care:	
  	
  	
  
	
  
HathiTrust	
  digital	
  library	
  is	
  “big	
  data”;	
  	
  
and	
  
Text	
  mining	
  is	
  the	
  new	
  library	
  catalog	
  
search	
  
Similar	
  model,	
  
different	
  ends	
  
$$	
  
HTRC	
  goes	
  beyond	
  
“full	
  text	
  
searchable	
  
database”	
  
Scholarly	
  
search	
  
Scholarly	
  
mining	
  
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  
•  HathiTrust	
  is	
  a	
  consor*um	
  of	
  academic	
  &	
  
research	
  ins*tu*ons,	
  offering	
  a	
  collec*on	
  of	
  
millions	
  of	
  *tles	
  digi*zed	
  from	
  libraries	
  
around	
  the	
  world.	
  
– Founding	
  members:	
  University	
  of	
  Michigan,	
  
Indiana	
  University,	
  University	
  of	
  California,	
  and	
  
University	
  of	
  Virginia	
  
http://www.hathitrust.org/htrc	
  
http://www.hathitrust.org	
  
à	
  Dis*nguished	
  
from	
  
#HTRC	
  	
  @HathiTrust	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  of	
  HathiTrust	
  
•  Books	
  and	
  journals	
  
– Plus	
  pilots	
  around	
  images,	
  audio,	
  born-­‐digital	
  
•  Digi*za*on	
  sources	
  
– Google	
  (96.8%,	
  10,162,104)	
  
– Internet	
  Archive	
  (2.9%,	
  301,972)	
  
– Local	
  (0.3%,	
  31,840)	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  Sources	
  
#HTRC	
  	
  @HathiTrust	
  
Content	
  distribu*on	
  
360,000	
  volumes	
  
in	
  Spanish	
  
#HTRC	
  	
  @HathiTrust	
  
Mo?va?on	
  for	
  HTRC	
  
à HathiTrust repository is massive scale
-- latent goldmine for text based research
à Restricted nature of parts of
HathiTrust content suggests need for
new forms of access that preserves
intimate nature of interaction with texts
while at same time honoring restrictions
on access
à Size and restrictions demand new
paradigm: computation moves to the
data (not vice versa)
#HTRC	
  	
  @HathiTrust	
  
HathiTrust	
  Research	
  Center	
  
•  	
  The	
  HathiTrust	
  Research	
  Center	
  (HTRC)	
  was	
  
established	
  in	
  2011	
  to	
  enable	
  computa*onal	
  research	
  
across	
  a	
  comprehensive	
  body	
  of	
  published	
  works,	
  for	
  
the	
  purposes	
  of	
  scholarship,	
  educa*on,	
  and	
  inven*on.	
  	
  
•  HTRC	
  Execu*ve	
  Commiee	
  
–  Beth	
  Plale,	
  co-­‐Director,	
  Professor	
  of	
  Informa*cs	
  and	
  
Compu*ng,	
  Indiana	
  University	
  
–  J.	
  Stephen	
  Downie,	
  co-­‐Director,	
  Professor	
  of	
  Informa*on	
  
Science,	
  University	
  of	
  Illinois	
  
–  Robert	
  McDonald,	
  Indiana	
  University	
  Libraries	
  
–  Beth	
  Namachchivaya	
  Sandore,	
  University	
  of	
  Illinois	
  Library	
  
–  John	
  Unsworth,	
  CIO,	
  Dean	
  of	
  Library,	
  Brandies	
  University	
  
	
  
HTRC	
  system	
  	
  
Complexity	
  hiding	
  interface	
  
The	
  complexity	
  
Tabular	
  info	
  
Sta*s*cal	
  plots	
  
Spa*al	
  plots	
  
Request	
  
 
	
  
Complexity	
  hiding	
  interface	
  
	
  
	
  
Text	
  mining	
  at	
  scale:	
  
quick	
  tutorial	
  on	
  topic	
  
modeling	
  of	
  texts	
  
#HTRC	
  	
  @HathiTrust	
  
Topic	
  Modeling	
  
•  Can	
  answer	
  more	
  complex	
  or	
  nuanced	
  
ques*ons	
  
– What	
  are	
  the	
  primary	
  themes	
  of	
  an	
  author?	
  
– What	
  are	
  the	
  primary	
  themes	
  of	
  a	
  research	
  
domain?	
  
– When	
  did	
  a	
  new	
  topic	
  enter	
  a	
  research	
  domain?	
  
•  Provides	
  more	
  data	
  than	
  word	
  counts	
  
– 100s	
  of	
  topics	
  can	
  be	
  extracted.	
  	
  	
  
– Underlying	
  data	
  (topics,	
  volume,	
  and	
  page)	
  is	
  
available	
  
#HTRC	
  	
  @HathiTrust	
  
Themes	
  for	
  Authors	
  
Two	
  topics	
  with	
  iden*cal	
  centrali*es	
  (e.g.,	
  Dickens)	
  but	
  separate	
  
themes	
  
More	
  strongly	
  focused	
  on	
  book	
  
(illustra*ons,	
  volume,	
  literature)	
  
More	
  strongly	
  focused	
  on	
  author	
  
himself	
  	
  (leers,	
  household,	
  house)	
  
Ted Underwood, Univ of Illinois
Digging	
  into	
  philosophy	
  of	
  science	
  
Establish	
  points	
  of	
  contact	
  
between	
  philosophy	
  and	
  
science:	
  where	
  philosophical	
  
arguments	
  on	
  
anthropomorphism	
  appear	
  in	
  
science	
  texts	
  
Colin	
  Allen,	
  IU	
  
The	
  How	
  
•  1315	
  volumes	
  from	
  HTRC	
  selected	
  using	
  
keyword	
  search	
  for	
  ‘darwin’,	
  ‘romanes’,	
  
‘anthropomorphism’,	
  and	
  ‘compara*ve	
  
psychology’	
  
•  Set	
  contains	
  lots	
  of	
  uninteres*ng	
  books:	
  	
  e.g.,	
  
college	
  course	
  catalogs	
  
•  Apply	
  topic	
  modeling	
  on	
  86	
  volume	
  subset	
  	
  
•  Using	
  iPy	
  Notebook	
  
..	
  Of	
  set	
  of	
  topics,	
  choose	
  ‘16’	
  as	
  best	
  
Volumes	
  most	
  similar	
  to	
  topic	
  16	
  
Copyright:	
  A	
  Reality	
  	
  
Full	
  text	
  download	
  is	
  limited	
  by	
  both	
  
size	
  and	
  by	
  copyright	
  
HTRC	
  solu*on	
  to	
  fully-­‐flexible	
  text	
  
mining	
  research	
  on	
  en*re	
  HT	
  digital	
  
repository:	
  
	
  
	
   	
   	
  HTRC	
  Data	
  Capsule	
  
	
  
Funded	
  by	
  Alfred	
  P.	
  Sloan	
  
Founda*on;	
  in	
  collabora*on	
  with	
  Atul	
  
Prakash,	
  University	
  of	
  Michigan	
  
	
  
#HTRC	
  	
  @HathiTrust	
  
Ques*ons	
  driving	
  HTRC	
  Data	
  Capsule	
  
•  Non-­‐consump*ve	
  use:	
  can	
  framework	
  provide	
  
safe	
  handling	
  of	
  large	
  amounts	
  of	
  protected	
  
data?	
  	
  
•  Openness:	
  can	
  framework	
  support	
  user-­‐
contributed	
  analysis	
  without	
  resor*ng	
  to	
  code	
  
walkthroughs	
  prior	
  to	
  acceptance?	
  	
  
•  Large-­‐scale	
  and	
  low	
  cost:	
  can	
  protec*ons	
  be	
  
extended	
  to	
  u*liza*on	
  of	
  large-­‐scale	
  na*onal	
  
(public)	
  computa*onal	
  resources?	
  	
  
#HTRC	
  	
  @HathiTrust	
  
HTRC	
  Data	
  Capsules	
  
•  Trusts	
  text	
  mining	
  researcher	
  to	
  not	
  
deliberately	
  leak	
  repository	
  data	
  
•  Prevents	
  malware	
  ac*ng	
  on	
  user’s	
  behalf	
  from	
  
leaking	
  data.	
  
•  V1.0	
  limits	
  analysis	
  to	
  running	
  	
  within	
  single	
  
VM	
  
VM	
  Image	
  
Manager	
  
VM	
  Image	
  
Store	
  
VM	
  Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
  
Secure	
  
Capsule	
  
cluster	
  
SSH	
   Research	
  
results	
  
Researcher	
  
HTRC	
  Data	
  
Capsule	
  
Architectural	
  
Components	
  
	
  
	
  
Registry	
  	
  
Services,	
  
worksets	
  
	
  
	
  
VM	
  
Image	
  
Manager	
  
VM	
  
Image	
  
Store	
  
VM	
  
Image	
  
Builder	
  
VM	
  
Manager	
  
VM	
  
instance	
  
Upon	
  run,	
  
Secure	
  
Capsule:	
  
controls	
  I/O	
  
behind	
  
scenes	
  
SSH	
   Research	
  
results	
  
Researcher	
  
HTRC	
  Data	
  
Capsule	
  
interac*on	
  
Researcher	
  
requests	
  	
  
new	
  VM	
  of	
  
type	
  X	
  
Researcher	
  install	
  tools	
  onto	
  
VM	
  through	
  window	
  on	
  her	
  
desktop.	
  	
  
	
  
	
  
Registry	
  	
  
Services,	
  
worksets	
  
	
  
	
  
Final	
  loca*on	
  
of	
  results	
  is	
  
registry	
  
1)	
  
2)	
  
Image	
  
instance	
  is	
  
created	
  
3)	
  
4)	
  
setup	
  
41	
  
HTRC	
  secure	
  data	
  capsule:	
  view	
  from	
  researcher	
  desktop	
  
Thanks	
  to	
  our	
  sponsors	
  
HTRC	
  goes	
  beyond	
  “full	
  text	
  
searchable	
  database”.	
  	
  Security	
  has	
  
to	
  be	
  top	
  concern.	
  
scholarly	
  
research	
  
HTRC	
  goes	
  beyond	
  “full	
  
text	
  searchable	
  database”	
  

Más contenido relacionado

La actualidad más candente

20160523 23 Research Data Things
20160523 23 Research Data Things20160523 23 Research Data Things
20160523 23 Research Data ThingsKatina Toufexis
 
20160719 23 Research Data Things
20160719 23 Research Data Things20160719 23 Research Data Things
20160719 23 Research Data ThingsKatina Toufexis
 
3 tu.dc 5min nordbib jp rombouts
3 tu.dc 5min nordbib jp rombouts3 tu.dc 5min nordbib jp rombouts
3 tu.dc 5min nordbib jp romboutsJeroen Rombouts
 
University of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchersUniversity of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchersJez Cope
 
Research Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social SciencesResearch Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social SciencesCelia Emmelhainz
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSMicah Altman
 
Research Data Management Services at UWA (November 2015)
Research Data Management Services at UWA (November 2015)Research Data Management Services at UWA (November 2015)
Research Data Management Services at UWA (November 2015)Katina Toufexis
 
Introduction to Research Data Management at UWA
Introduction to Research Data Management at UWAIntroduction to Research Data Management at UWA
Introduction to Research Data Management at UWAKatina Toufexis
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020Sarah Jones
 
Data Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim ClarkData Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of PublishingAnita de Waard
 
Who will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynoteWho will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynoteJisc RDM
 
Supporting researchers with DMPs
Supporting researchers with DMPsSupporting researchers with DMPs
Supporting researchers with DMPsSarah Jones
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FutureASIS&T
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseGigaScience, BGI Hong Kong
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 

La actualidad más candente (20)

20160523 23 Research Data Things
20160523 23 Research Data Things20160523 23 Research Data Things
20160523 23 Research Data Things
 
20160719 23 Research Data Things
20160719 23 Research Data Things20160719 23 Research Data Things
20160719 23 Research Data Things
 
3 tu.dc 5min nordbib jp rombouts
3 tu.dc 5min nordbib jp rombouts3 tu.dc 5min nordbib jp rombouts
3 tu.dc 5min nordbib jp rombouts
 
University of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchersUniversity of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchers
 
Research Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social SciencesResearch Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social Sciences
 
Levine - Data Curation; Ethics and Legal Considerations
Levine - Data Curation; Ethics and Legal ConsiderationsLevine - Data Curation; Ethics and Legal Considerations
Levine - Data Curation; Ethics and Legal Considerations
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
 
Stephenson - Data Curation for Quantitative Social Science Research
Stephenson - Data Curation for Quantitative Social Science ResearchStephenson - Data Curation for Quantitative Social Science Research
Stephenson - Data Curation for Quantitative Social Science Research
 
UWA Research Week 2016
UWA Research Week 2016UWA Research Week 2016
UWA Research Week 2016
 
Research Data Management Services at UWA (November 2015)
Research Data Management Services at UWA (November 2015)Research Data Management Services at UWA (November 2015)
Research Data Management Services at UWA (November 2015)
 
Introduction to Research Data Management at UWA
Introduction to Research Data Management at UWAIntroduction to Research Data Management at UWA
Introduction to Research Data Management at UWA
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020
 
Data Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim ClarkData Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim Clark
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of Publishing
 
Who will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynoteWho will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynote
 
Supporting researchers with DMPs
Supporting researchers with DMPsSupporting researchers with DMPs
Supporting researchers with DMPs
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 

Similar a Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBeth Plale
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Beth Plale
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional RepositoriesSridhar Gutam
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryRobert H. McDonald
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesRobert H. McDonald
 
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4ARLGSW
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabUniversity of Edinburgh
 
THe HathiTrust Research Center: Digital Humanities at Scale
THe HathiTrust Research Center: Digital Humanities at ScaleTHe HathiTrust Research Center: Digital Humanities at Scale
THe HathiTrust Research Center: Digital Humanities at ScaleRobert H. McDonald
 
Research 101 for Mid-Career Students
Research 101 for Mid-Career StudentsResearch 101 for Mid-Career Students
Research 101 for Mid-Career StudentsAbby Clobridge
 
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Harriett Green
 
Libraries, collections, technology: presented at Pennylvania State University...
Libraries, collections, technology: presented at Pennylvania State University...Libraries, collections, technology: presented at Pennylvania State University...
Libraries, collections, technology: presented at Pennylvania State University...lisld
 
Google Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISGoogle Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISCrishuana Williams
 
Future of Academic Libraries
Future of Academic LibrariesFuture of Academic Libraries
Future of Academic LibrariesPatricia Watkins
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013Beth Plale
 
Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011lljohnston
 
Rebecca Grant DPASSH presentation 2015
Rebecca Grant DPASSH presentation 2015Rebecca Grant DPASSH presentation 2015
Rebecca Grant DPASSH presentation 2015dri_ireland
 
Building Capacities and Communities for Digital Scholarship: The "Digging Dee...
Building Capacities and Communities for Digital Scholarship: The "Digging Dee...Building Capacities and Communities for Digital Scholarship: The "Digging Dee...
Building Capacities and Communities for Digital Scholarship: The "Digging Dee...Harriett Green
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 

Similar a Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts (20)

Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital Library
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
 
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
"From Reading Rooms to Research Commons" Sheila Corrall, DARTS4
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLab
 
THe HathiTrust Research Center: Digital Humanities at Scale
THe HathiTrust Research Center: Digital Humanities at ScaleTHe HathiTrust Research Center: Digital Humanities at Scale
THe HathiTrust Research Center: Digital Humanities at Scale
 
Research 101 for Mid-Career Students
Research 101 for Mid-Career StudentsResearch 101 for Mid-Career Students
Research 101 for Mid-Career Students
 
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
 
Libraries, collections, technology: presented at Pennylvania State University...
Libraries, collections, technology: presented at Pennylvania State University...Libraries, collections, technology: presented at Pennylvania State University...
Libraries, collections, technology: presented at Pennylvania State University...
 
Google Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISGoogle Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLIS
 
Future of Academic Libraries
Future of Academic LibrariesFuture of Academic Libraries
Future of Academic Libraries
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013
 
Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011
 
Swimming upstream: libraries and open scholarship
Swimming upstream: libraries and open scholarshipSwimming upstream: libraries and open scholarship
Swimming upstream: libraries and open scholarship
 
Rebecca Grant DPASSH presentation 2015
Rebecca Grant DPASSH presentation 2015Rebecca Grant DPASSH presentation 2015
Rebecca Grant DPASSH presentation 2015
 
Building Capacities and Communities for Digital Scholarship: The "Digging Dee...
Building Capacities and Communities for Digital Scholarship: The "Digging Dee...Building Capacities and Communities for Digital Scholarship: The "Digging Dee...
Building Capacities and Communities for Digital Scholarship: The "Digging Dee...
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
Google Librarianship
Google LibrarianshipGoogle Librarianship
Google Librarianship
 

Último

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

  • 1. Case  Study  in  Big  Data  :  the  Socio-­‐Technical   Issues  of  HathiTrust  Digital  Texts   Women’s  Ins*tute  for  Summer  Enrichment   Cornell  University,  Jun  16,  2014     Beth  Plale   Professor,  School  of  Informa?cs  and  Compu?ng   Director,  Data  To  Insight  Center     Indiana  University   HATHI TRUST RESEARCH CENTER!
  • 2. •  Who  are  the  Players?  HathiTrust,   Google,  Authors  Guild   •  The  Object  of  AJen?on  :  11  M   books  from  university  libraries   •  Rulings  around  copyright   •  HTRC,  or  why  I  care   •  Is  security  of  HTRC  Data  Capsule   good  enough?  
  • 4. Books  Digi*za*on  Project  (2007)   Libraries  of  U  Michigan,  U  California,  Virginia,  Wisconsin,  Indiana,  …   digi*zed   books   digi*zed   books   digi*ze  
  • 5. digi*zed   books   digi*zed   books   Legal   ac*on   Mar  2011:    New  York  federal  judge  rejected  a   $125  million  legal  selement  that  Google  had   worked  out  with  the  authors  and  publishers   over  the  copyright  issues   Nov  2013:  same  Judge  issued  ruling  saying  that   Google's  use  of  the  works  was  a  "fair  use"   under  copyright  law   Google/ Authors   Guild  
  • 6. •  June  2014:    2nd  Circuit  Court   of  Appeals  ruling  on  Authors   Guild  versus  HathiTrust   (Cornell,  U  Michigan,  U   California,  U  Wisconsin,   Indiana)  is  a  major  victory  for   fair  use   digi*zed   books   Legal   ac*on  
  • 7. Highlights   2014  ruling   •  With  respect  to  the  full-­‐text  database,  the   court  found  that  although  a  copy  of  the  en*re   work  is  made,  the  purpose  of  a  full-­‐text   searchable  database  is  so  different  from  that   of  the  underlying  works  that  the  use  must  be   considered  transforma*ve.  In  fact,  the  court   wrote,  "the  crea*on  of  a  full-­‐text  searchable   database  is  a  quintessen*ally  transforma*ve   use".     June  10,  2014  |  By  Parker  Higgins     Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    
  • 8. •  The  Authors  Guild  argued  that  HathiTrust's   use  of  an  iden*cal  server  and  two  tape  back-­‐ ups  cons*tuted  "excessive"  copying.     •  Thankfully,  the  court  rejected  that  premise,   acknowledging  that  when  it  comes  to  digital   technology,  an  approach  that  focuses  only  on   individual  copies  made  is  insufficient.   June  10,  2014  |  By  Parker  Higgins     Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust     Highlights   2014  ruling  
  • 9. Does  Authors  Guild  Represent   All  Authors?     •  The  Authors  Guild  members  are   overwhelmingly  trade-­‐book  authors;  the   books  scanned  by  the  Hathi  Trust  are   overwhelmingly  scholarly  books  wrien  as   part  of  an  academic  tradi*on  that  takes  free   access  and  sharing  as  its  founda*on.     •  The  Authors  Alliance  :  new  organiza*on   represen*ng  authors  who  are  primarily   concerned  with  being  read.   Court  finds  full-­‐book  scanning  is  fair  use   Cory  Doctorow  at  3:00  pm  Sat,  Jun  14,  2014    
  • 10. Highlight   2014  Ruling     •  Given  that  consistent  fair  use  record  for  book   digi*za*on,  today's  ruling  might  not  be  totally   surprising.  S*ll,  the  text  of  the  opinion  is   encouraging,  and  reflects  a  court  that  respects   the  Cons/tu/onal  purpose  of  copyright  as  a   tool  to  promote  the  progress  of  science  and   the  useful  arts—not  a  blunt  instrument  for   rightsholders  to  regulate  all  downstream  uses.   June  10,  2014  |  By  Parker  Higgins     Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    
  • 11. •  Who  are  the  Players?  HathiTrust,   Google,  Authors  Guild   •  The  Object  of  Aen*on  :  11  M   books  from  university  libraries   •  Rulings  around  copyright   •  HTRC,  or  why  I  care   •  Is  security  of  HTRC  Data  Capsule   good  enough?  
  • 12. HTRC,  or  why  I  care:         HathiTrust  digital  library  is  “big  data”;     and   Text  mining  is  the  new  library  catalog   search  
  • 13. Similar  model,   different  ends   $$   HTRC  goes  beyond   “full  text   searchable   database”   Scholarly   search   Scholarly   mining  
  • 14. #HTRC    @HathiTrust   HathiTrust   •  HathiTrust  is  a  consor*um  of  academic  &   research  ins*tu*ons,  offering  a  collec*on  of   millions  of  *tles  digi*zed  from  libraries   around  the  world.   – Founding  members:  University  of  Michigan,   Indiana  University,  University  of  California,  and   University  of  Virginia   http://www.hathitrust.org/htrc   http://www.hathitrust.org   à  Dis*nguished   from  
  • 16. #HTRC    @HathiTrust   Content  of  HathiTrust   •  Books  and  journals   – Plus  pilots  around  images,  audio,  born-­‐digital   •  Digi*za*on  sources   – Google  (96.8%,  10,162,104)   – Internet  Archive  (2.9%,  301,972)   – Local  (0.3%,  31,840)  
  • 17. #HTRC    @HathiTrust   Content  Sources  
  • 18. #HTRC    @HathiTrust   Content  distribu*on   360,000  volumes   in  Spanish  
  • 19. #HTRC    @HathiTrust   Mo?va?on  for  HTRC   à HathiTrust repository is massive scale -- latent goldmine for text based research à Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à Size and restrictions demand new paradigm: computation moves to the data (not vice versa)
  • 20. #HTRC    @HathiTrust   HathiTrust  Research  Center   •   The  HathiTrust  Research  Center  (HTRC)  was   established  in  2011  to  enable  computa*onal  research   across  a  comprehensive  body  of  published  works,  for   the  purposes  of  scholarship,  educa*on,  and  inven*on.     •  HTRC  Execu*ve  Commiee   –  Beth  Plale,  co-­‐Director,  Professor  of  Informa*cs  and   Compu*ng,  Indiana  University   –  J.  Stephen  Downie,  co-­‐Director,  Professor  of  Informa*on   Science,  University  of  Illinois   –  Robert  McDonald,  Indiana  University  Libraries   –  Beth  Namachchivaya  Sandore,  University  of  Illinois  Library   –  John  Unsworth,  CIO,  Dean  of  Library,  Brandies  University    
  • 21. HTRC  system     Complexity  hiding  interface   The  complexity   Tabular  info   Sta*s*cal  plots   Spa*al  plots   Request  
  • 22.     Complexity  hiding  interface      
  • 23. Text  mining  at  scale:   quick  tutorial  on  topic   modeling  of  texts  
  • 24. #HTRC    @HathiTrust   Topic  Modeling   •  Can  answer  more  complex  or  nuanced   ques*ons   – What  are  the  primary  themes  of  an  author?   – What  are  the  primary  themes  of  a  research   domain?   – When  did  a  new  topic  enter  a  research  domain?   •  Provides  more  data  than  word  counts   – 100s  of  topics  can  be  extracted.       – Underlying  data  (topics,  volume,  and  page)  is   available  
  • 25. #HTRC    @HathiTrust   Themes  for  Authors   Two  topics  with  iden*cal  centrali*es  (e.g.,  Dickens)  but  separate   themes   More  strongly  focused  on  book   (illustra*ons,  volume,  literature)   More  strongly  focused  on  author   himself    (leers,  household,  house)  
  • 26. Ted Underwood, Univ of Illinois
  • 27. Digging  into  philosophy  of  science   Establish  points  of  contact   between  philosophy  and   science:  where  philosophical   arguments  on   anthropomorphism  appear  in   science  texts   Colin  Allen,  IU  
  • 28. The  How   •  1315  volumes  from  HTRC  selected  using   keyword  search  for  ‘darwin’,  ‘romanes’,   ‘anthropomorphism’,  and  ‘compara*ve   psychology’   •  Set  contains  lots  of  uninteres*ng  books:    e.g.,   college  course  catalogs   •  Apply  topic  modeling  on  86  volume  subset     •  Using  iPy  Notebook  
  • 29. ..  Of  set  of  topics,  choose  ‘16’  as  best  
  • 30. Volumes  most  similar  to  topic  16  
  • 31.
  • 32.
  • 33. Copyright:  A  Reality     Full  text  download  is  limited  by  both   size  and  by  copyright  
  • 34. HTRC  solu*on  to  fully-­‐flexible  text   mining  research  on  en*re  HT  digital   repository:          HTRC  Data  Capsule     Funded  by  Alfred  P.  Sloan   Founda*on;  in  collabora*on  with  Atul   Prakash,  University  of  Michigan    
  • 35. #HTRC    @HathiTrust   Ques*ons  driving  HTRC  Data  Capsule   •  Non-­‐consump*ve  use:  can  framework  provide   safe  handling  of  large  amounts  of  protected   data?     •  Openness:  can  framework  support  user-­‐ contributed  analysis  without  resor*ng  to  code   walkthroughs  prior  to  acceptance?     •  Large-­‐scale  and  low  cost:  can  protec*ons  be   extended  to  u*liza*on  of  large-­‐scale  na*onal   (public)  computa*onal  resources?    
  • 36. #HTRC    @HathiTrust   HTRC  Data  Capsules   •  Trusts  text  mining  researcher  to  not   deliberately  leak  repository  data   •  Prevents  malware  ac*ng  on  user’s  behalf  from   leaking  data.   •  V1.0  limits  analysis  to  running    within  single   VM  
  • 37. VM  Image   Manager   VM  Image   Store   VM  Image   Builder   VM   Manager   VM   instance   Secure   Capsule   cluster   SSH   Research   results   Researcher   HTRC  Data   Capsule   Architectural   Components       Registry     Services,   worksets      
  • 38. VM   Image   Manager   VM   Image   Store   VM   Image   Builder   VM   Manager   VM   instance   Upon  run,   Secure   Capsule:   controls  I/O   behind   scenes   SSH   Research   results   Researcher   HTRC  Data   Capsule   interac*on   Researcher   requests     new  VM  of   type  X   Researcher  install  tools  onto   VM  through  window  on  her   desktop.         Registry     Services,   worksets       Final  loca*on   of  results  is   registry   1)   2)   Image   instance  is   created   3)   4)  
  • 40.
  • 41. 41   HTRC  secure  data  capsule:  view  from  researcher  desktop  
  • 42. Thanks  to  our  sponsors  
  • 43. HTRC  goes  beyond  “full  text   searchable  database”.    Security  has   to  be  top  concern.   scholarly   research   HTRC  goes  beyond  “full   text  searchable  database”