SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
Text Analysis Methods
for Digital Humanities
Helen Bailey and Sands Fish
MIT Libraries
Examples of Data Narratives
•  Visualizing Emancipation
•  Narrative Visualization of Whaling Ship Logs
•  Out of Sight, Out of Mind
Approaches to Storytelling w/ Data
•  EDA - Exploratory Data Analysis
•  Exploring data from a number of perspectives:
o  Temporal
o  Geographical
o  Statistical
o  Categorical
o  Relational
•  80% - Data Hacking, 20% - Narrative Construction, Visualization,
etc.
"To use any sort of historical data, we must above all understand the
constraints under which it was collected. In this case, that means
retelling the history of why and how the ship's logs were first collected, and
how the constraints of digitization in the punch card era radically shape the
sort of evidence we can draw from them. The important thing about this sort
of work is that it helps us understand the overall biases of a particular
data set, which is crucial for limiting our interpretive leaps."
- Ben Schmidt, “Reading digital sources: a case study in ship's logs”
Inherent Biases & Limitations
•  Data capture methods and format
•  Purpose of data collection
•  Transformation over time
•  Authenticity and trust
Understand provenance
“Rather than replace humans, computers amplify human abilities. The
most productive line of inquiry, therefore, is not in identifying how automated
methods can obviate the need for researchers to read their text. Rather, the
most productive line of inquiry is to identify the best way to use both
humans and automated methods for analyzing texts.”
- Justin Grimmer and Brandon M. Stewart, “
Text as Data: The Promise and Pitfalls of Automatic Content Analysis
Methods for Political Texts”
Acquiring Text
•  Full-text resources:
o  DSpace@MIT http://dspace.mit.edu/
o  Dome http://dome.mit.edu/
o  Digital Public Library of America http://dp.la
o  Europeana http://www.europeana.eu/portal/
o  HathiTrust http://www.hathitrust.org/
•  http://libguides.mit.edu/apis - metadata only
•  http://libguides.mit.edu/digitalhumanities
Data Management and Sharing
•  Assumption of sharing and data management plan as a
funding requirement
•  Data storage options - anticipate interaction
o  Storage formats - non-proprietary and repurposable
whenever possible
o  File system storage vs. database
•  Documentation of process
http://libraries.mit.edu/guides/subjects/data-management/
Formatting / Pre-Processing
•  Tool input requirements
•  Assumptions:
o  Text as a “bag of words”
o  Unigrams, bigrams
o  Word order (or not)
o  Stop words, capitalization, punctuation
Featurizing Text
•  Each word becomes a feature
•  This is called "high dimensional" data
•  Each word is a "dimension", or "feature"
•  Features are represented as vectors in Euclidean space
•  Euclidean mathematics scales beyond 3 dimensions
The Shape of Data
•  Data structures and formats
•  Informed (in part) by:
o  Tools
o  Co-occurrence
o  Data output formats
o  Entity type
o  Temporal, geographical perspective, etc.
Validation
From Ben Schmidt’s “Machine Learning at Sea”
Network Models
•  Representing data as a network
o  Types: technological, communication, transportation, energy, airplane routes,
web linking patterns
o  social
§  non-human animal interaction
§  membership in larger groups
§  sexually transmitted diseases
§  co-authorship of scientific publications
§  trade agreements between nations
•  Mapping the News - Berkman's Controversy Work
o  Spidering
o  Influential actors over time
Topic Modeling Tools
•  MALLET
o  Can run on unstructured plain text files
o  http://mallet.cs.umass.edu/topics.php
•  Stanford Topic Modeling Toolbox
o  Requires data in a CSV or TSV file
o  http://nlp.stanford.edu/software/tmt/tmt-0.4/
Entity Extraction
•  Identifies known entities in specific categories
o  Locations
o  People
o  Organizations
o  Dates/times
•  Creates annotated text from unstructured text
•  Domain-specific
Entity Extraction Tools
•  Stanford Named Entity Recognizer
http://nlp.stanford.edu/software/CRF-NER.shtml
•  Illinois Named Entity Tagger
http://cogcomp.cs.illinois.edu/page/download_view/NETagger
•  DBPedia Spotlight
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
Geo-Parsing
•  Common Pitfalls
o  Set of places (GeoNames dictionary)
o  Dictionary determines how broad or narrow your
search is
•  Enhancements to CLAVIN by Civic Media
o  Aboutness (uses mention counting)
o  HTTP access used for more advanced workflows

Más contenido relacionado

La actualidad más candente

La actualidad más candente (15)

Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
 
Working digitally with Historical Documents
Working digitally with Historical DocumentsWorking digitally with Historical Documents
Working digitally with Historical Documents
 
Librarian Legal Literacies for Text Data Mining
Librarian Legal Literacies for Text Data MiningLibrarian Legal Literacies for Text Data Mining
Librarian Legal Literacies for Text Data Mining
 
Challenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old InstitutionsChallenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old Institutions
 
New Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMENew Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAME
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library
 
Discussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNBDiscussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNB
 
Google Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISGoogle Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLIS
 
Historical methods 2012
Historical methods 2012Historical methods 2012
Historical methods 2012
 
Building the Archive of DH Research
Building the Archive of DH ResearchBuilding the Archive of DH Research
Building the Archive of DH Research
 
Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)
 
Digital Libraries on International Campuses
Digital Libraries on International CampusesDigital Libraries on International Campuses
Digital Libraries on International Campuses
 
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
 

Similar a Text Analysis Methods for Digital Humanities

Similar a Text Analysis Methods for Digital Humanities (20)

00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 
Combining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCombining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User Profiling
 
Rscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libsRscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libs
 
Realizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyondRealizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyond
 
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
 
datamining-lect1.pptx
datamining-lect1.pptxdatamining-lect1.pptx
datamining-lect1.pptx
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data for the Humanities
Data for the HumanitiesData for the Humanities
Data for the Humanities
 
Beyond the Black Box: Data Visualisation
Beyond the Black Box: Data VisualisationBeyond the Black Box: Data Visualisation
Beyond the Black Box: Data Visualisation
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Dm1.1
Dm1.1Dm1.1
Dm1.1
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Ir1
Ir1Ir1
Ir1
 
Demography pro sem
Demography pro semDemography pro sem
Demography pro sem
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
 

Último

Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 

Text Analysis Methods for Digital Humanities

  • 1. Text Analysis Methods for Digital Humanities Helen Bailey and Sands Fish MIT Libraries
  • 2. Examples of Data Narratives •  Visualizing Emancipation •  Narrative Visualization of Whaling Ship Logs •  Out of Sight, Out of Mind
  • 3. Approaches to Storytelling w/ Data •  EDA - Exploratory Data Analysis •  Exploring data from a number of perspectives: o  Temporal o  Geographical o  Statistical o  Categorical o  Relational •  80% - Data Hacking, 20% - Narrative Construction, Visualization, etc.
  • 4. "To use any sort of historical data, we must above all understand the constraints under which it was collected. In this case, that means retelling the history of why and how the ship's logs were first collected, and how the constraints of digitization in the punch card era radically shape the sort of evidence we can draw from them. The important thing about this sort of work is that it helps us understand the overall biases of a particular data set, which is crucial for limiting our interpretive leaps." - Ben Schmidt, “Reading digital sources: a case study in ship's logs”
  • 5. Inherent Biases & Limitations •  Data capture methods and format •  Purpose of data collection •  Transformation over time •  Authenticity and trust Understand provenance
  • 6. “Rather than replace humans, computers amplify human abilities. The most productive line of inquiry, therefore, is not in identifying how automated methods can obviate the need for researchers to read their text. Rather, the most productive line of inquiry is to identify the best way to use both humans and automated methods for analyzing texts.” - Justin Grimmer and Brandon M. Stewart, “ Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”
  • 7. Acquiring Text •  Full-text resources: o  DSpace@MIT http://dspace.mit.edu/ o  Dome http://dome.mit.edu/ o  Digital Public Library of America http://dp.la o  Europeana http://www.europeana.eu/portal/ o  HathiTrust http://www.hathitrust.org/ •  http://libguides.mit.edu/apis - metadata only •  http://libguides.mit.edu/digitalhumanities
  • 8. Data Management and Sharing •  Assumption of sharing and data management plan as a funding requirement •  Data storage options - anticipate interaction o  Storage formats - non-proprietary and repurposable whenever possible o  File system storage vs. database •  Documentation of process http://libraries.mit.edu/guides/subjects/data-management/
  • 9. Formatting / Pre-Processing •  Tool input requirements •  Assumptions: o  Text as a “bag of words” o  Unigrams, bigrams o  Word order (or not) o  Stop words, capitalization, punctuation
  • 10. Featurizing Text •  Each word becomes a feature •  This is called "high dimensional" data •  Each word is a "dimension", or "feature" •  Features are represented as vectors in Euclidean space •  Euclidean mathematics scales beyond 3 dimensions
  • 11. The Shape of Data •  Data structures and formats •  Informed (in part) by: o  Tools o  Co-occurrence o  Data output formats o  Entity type o  Temporal, geographical perspective, etc.
  • 12. Validation From Ben Schmidt’s “Machine Learning at Sea”
  • 13. Network Models •  Representing data as a network o  Types: technological, communication, transportation, energy, airplane routes, web linking patterns o  social §  non-human animal interaction §  membership in larger groups §  sexually transmitted diseases §  co-authorship of scientific publications §  trade agreements between nations •  Mapping the News - Berkman's Controversy Work o  Spidering o  Influential actors over time
  • 14. Topic Modeling Tools •  MALLET o  Can run on unstructured plain text files o  http://mallet.cs.umass.edu/topics.php •  Stanford Topic Modeling Toolbox o  Requires data in a CSV or TSV file o  http://nlp.stanford.edu/software/tmt/tmt-0.4/
  • 15. Entity Extraction •  Identifies known entities in specific categories o  Locations o  People o  Organizations o  Dates/times •  Creates annotated text from unstructured text •  Domain-specific
  • 16. Entity Extraction Tools •  Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml •  Illinois Named Entity Tagger http://cogcomp.cs.illinois.edu/page/download_view/NETagger •  DBPedia Spotlight https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
  • 17. Geo-Parsing •  Common Pitfalls o  Set of places (GeoNames dictionary) o  Dictionary determines how broad or narrow your search is •  Enhancements to CLAVIN by Civic Media o  Aboutness (uses mention counting) o  HTTP access used for more advanced workflows