SlideShare una empresa de Scribd logo
1 de 35
Duke University Libraries, Digital Scholarship
Text > Data, October 25




HIGH-LEVEL TEXT ANALYSIS
AND TECHNIQUES
Angela Zoss
Data Visualization Coordinator
226 Perkins Library
angela.zoss@duke.edu
DOCUMENTS AS CONTEXT
But first,

ANGELA AS CONTEXT
How I learned to love the
document.
B.A. courses:         Linguistics, Communication

M.S. courses:         Communication, Human-Computer
Interaction

Employment:           arXiv.org Administrator
              • Bibliometrics/Scientometrics
Ph.D.         •
        courses:Computer Mediated Discourse Analysis
              • Latent Structure Analysis
              • Natural Language Processing
Now,

DOCUMENTS AS CONTEXT
Text analysis from…
• documents down to words (“low-level”)
• words up to documents (“high-level”)
Using documents to learn about
language
(or other social phenomena)
Analyzing documents as records/proxies of
language, social structures, events, etc.

Linguistic studies:
morphology, word counts, syntax, etc. …
      over time (e.g., Google ngram viewer)
language across corpora (e.g., political
speeches)

Underwood, T. (2012). Where to start with text mining.
Using documents to learn about
language
  Historical culturomics of pronoun frequencies
Using documents to learn about
language
 Universal properties of mythological networks
Using language to learn about
documents
Analyzing documents as artifacts themselves, with
their own properties and dynamics

Literary, documentary studies:
Structural/rhetorical/stylistic analysis
Document categorization, classification
Detecting clusters of document features (topic
modeling)


Underwood, T. (2012). Where to start with text mining.
Using language to learn about
documents
   Literary Empires, Mapping Temporal and
         Spatial Settings in Swinburne
Using language to learn about
documents
 Using Word Clouds for Topic Modeling Results
What are documents?
For this discussion,
     digital versions of works of
     spoken or written language
Examples:
     books, articles, transcripts, emails, twe
ets…
Documents as context
Documents have:
• form(at)
• style
• provenance
• entities
• intentions
STUDIES OF DOCUMENTS
Why study documents?
• Describe a corpus
• Compare/organize documents
• Locate relevant information/filter out
  irrelevant information
Describing a corpus
• Finding regularities/differences across
  groups of documents
• Developing theories of structure, style, etc.
  that can then be tested or applied
• May be manual (content analysis) or
  computer-assisted (statistical)
Example: Storylines




            http://xkcd.com/657/
Differences of
format, genre, participants…
• Articles may have sections, but these will
  vary by discipline and type of article
• Books may be fiction or non-fiction (or
  both)
• Transcripts may refer to multiple speakers,
  non-text content
• …ad infinitum
Example: Literature
Fingerprinting




 Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE
 Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi:
 10.1109/VAST.2007.4389004
Organizing documents
Detect similarity between documents and a
known category (or simply among
themselves)

Supports browsing, sentiment
analysis, authorship detection
Example: Bohemian Bookshelf




Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book
Discoveries through
Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, to appear.
Similarity based on…
• common document attributes
    authorship, genre
• common language patterns
    topics, phrases
• common entity references
    characters, citations
Example: Quantitative
Formalism




Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An
experiment. Pamphlets of the Stanford Literary Lab (vol. 1).
Example: Clinton’s DNC Speech




                http://b.globe.com/TogUqq
Example: View DHQ




      http://digitalliterature.net/viewDHQ/vis3.html
Classification
• assigning an object to a single class
• often supervised, using an existing
  classification scheme and a tagged corpus
Example: Relative signatures




Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level
of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012
(pp. 103-112).
Categorization
• assigning documents to one or more
  categories
• suggestive of unsupervised clustering
  techniques
• design choices made to fit particular tasks
  or goals
Example: UCSD Map of
Science




Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., &
Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoS
ONE, 7(7), e39464.
Example: NIH Map Viewer




        https://app.nihmaps.org/nih/browser/
Reference
systems, infrastructure
What do we gain by adding structure?

What do we lose?
SUMMARIZING DOCUMENTS
Text is only one component of a document.

Research questions often push us to be
creative with how we operationalize
constructs.

The richness of language and documents is
best preserved by using
multiple, complementary approaches.
QUESTIONS?
angela.zoss@duke.edu

Más contenido relacionado

La actualidad más candente

Electronic Literature
Electronic LiteratureElectronic Literature
Electronic LiteratureSiswo Harsono
 
Electronic literature and its place in digital library
Electronic literature and its place in digital libraryElectronic literature and its place in digital library
Electronic literature and its place in digital libraryAlexandr Belov
 
Regional variation of Finnic folksongs
Regional variation of Finnic folksongsRegional variation of Finnic folksongs
Regional variation of Finnic folksongsMari Sarv
 
EngWri 300 (Magneson)
EngWri 300 (Magneson)EngWri 300 (Magneson)
EngWri 300 (Magneson)karlsen
 
More library services
More library servicesMore library services
More library servicesTimothy Tsui
 
Textual analysis for social research
Textual analysis for social researchTextual analysis for social research
Textual analysis for social researchLazarus Gawazah
 
Authorship Without Agency?: Responding to Computer-Generated Texts
Authorship Without Agency?: Responding to Computer-Generated TextsAuthorship Without Agency?: Responding to Computer-Generated Texts
Authorship Without Agency?: Responding to Computer-Generated TextsLeah Henrickson
 
Electronic literature (e lit) in public libraries
Electronic literature (e lit) in public librariesElectronic literature (e lit) in public libraries
Electronic literature (e lit) in public librariesAlexandr Belov
 
Carl Burnett: Searching the Corpus of Contemporary American English
Carl Burnett: Searching the Corpus of Contemporary American EnglishCarl Burnett: Searching the Corpus of Contemporary American English
Carl Burnett: Searching the Corpus of Contemporary American Englishtalnoznisky
 
International journal on natural language computing(ijnlc)
International journal on natural language computing(ijnlc)International journal on natural language computing(ijnlc)
International journal on natural language computing(ijnlc)kevig
 

La actualidad más candente (10)

Electronic Literature
Electronic LiteratureElectronic Literature
Electronic Literature
 
Electronic literature and its place in digital library
Electronic literature and its place in digital libraryElectronic literature and its place in digital library
Electronic literature and its place in digital library
 
Regional variation of Finnic folksongs
Regional variation of Finnic folksongsRegional variation of Finnic folksongs
Regional variation of Finnic folksongs
 
EngWri 300 (Magneson)
EngWri 300 (Magneson)EngWri 300 (Magneson)
EngWri 300 (Magneson)
 
More library services
More library servicesMore library services
More library services
 
Textual analysis for social research
Textual analysis for social researchTextual analysis for social research
Textual analysis for social research
 
Authorship Without Agency?: Responding to Computer-Generated Texts
Authorship Without Agency?: Responding to Computer-Generated TextsAuthorship Without Agency?: Responding to Computer-Generated Texts
Authorship Without Agency?: Responding to Computer-Generated Texts
 
Electronic literature (e lit) in public libraries
Electronic literature (e lit) in public librariesElectronic literature (e lit) in public libraries
Electronic literature (e lit) in public libraries
 
Carl Burnett: Searching the Corpus of Contemporary American English
Carl Burnett: Searching the Corpus of Contemporary American EnglishCarl Burnett: Searching the Corpus of Contemporary American English
Carl Burnett: Searching the Corpus of Contemporary American English
 
International journal on natural language computing(ijnlc)
International journal on natural language computing(ijnlc)International journal on natural language computing(ijnlc)
International journal on natural language computing(ijnlc)
 

Destacado

Machine Learning and Data Mining: 11 Decision Trees
Machine Learning and Data Mining: 11 Decision TreesMachine Learning and Data Mining: 11 Decision Trees
Machine Learning and Data Mining: 11 Decision TreesPier Luca Lanzi
 
제조/인프라산업의 서비스가상화 적용사례
제조/인프라산업의 서비스가상화 적용사례제조/인프라산업의 서비스가상화 적용사례
제조/인프라산업의 서비스가상화 적용사례Eugene Chung
 
호스트다운사이징 사업 위험 경감 방안으로 활용되는 서비스가상화
호스트다운사이징 사업 위험 경감 방안으로 활용되는 서비스가상화호스트다운사이징 사업 위험 경감 방안으로 활용되는 서비스가상화
호스트다운사이징 사업 위험 경감 방안으로 활용되는 서비스가상화Eugene Chung
 
Big data on 제조 글로벌 제조사 품질 개선 사례-Dhan-kim-2013-7-2
 Big data on 제조 글로벌 제조사 품질 개선 사례-Dhan-kim-2013-7-2 Big data on 제조 글로벌 제조사 품질 개선 사례-Dhan-kim-2013-7-2
Big data on 제조 글로벌 제조사 품질 개선 사례-Dhan-kim-2013-7-2Donghan Kim
 
UX, 세상을 바꾸는 비밀
UX, 세상을 바꾸는 비밀UX, 세상을 바꾸는 비밀
UX, 세상을 바꾸는 비밀Billy Choi
 
TeraStream for ETL
TeraStream for ETLTeraStream for ETL
TeraStream for ETL치민 최
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoMatthew (정재화)
 
파이썬 Special method 이해하기
파이썬 Special method 이해하기파이썬 Special method 이해하기
파이썬 Special method 이해하기Yong Joon Moon
 
실시간 빅데이터와 머신 데이터
실시간 빅데이터와 머신 데이터실시간 빅데이터와 머신 데이터
실시간 빅데이터와 머신 데이터김 한도
 
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례JeongHeon Lee
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationPier Luca Lanzi
 
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersMachine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersPier Luca Lanzi
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 

Destacado (14)

Machine Learning and Data Mining: 11 Decision Trees
Machine Learning and Data Mining: 11 Decision TreesMachine Learning and Data Mining: 11 Decision Trees
Machine Learning and Data Mining: 11 Decision Trees
 
제조/인프라산업의 서비스가상화 적용사례
제조/인프라산업의 서비스가상화 적용사례제조/인프라산업의 서비스가상화 적용사례
제조/인프라산업의 서비스가상화 적용사례
 
호스트다운사이징 사업 위험 경감 방안으로 활용되는 서비스가상화
호스트다운사이징 사업 위험 경감 방안으로 활용되는 서비스가상화호스트다운사이징 사업 위험 경감 방안으로 활용되는 서비스가상화
호스트다운사이징 사업 위험 경감 방안으로 활용되는 서비스가상화
 
Big data on 제조 글로벌 제조사 품질 개선 사례-Dhan-kim-2013-7-2
 Big data on 제조 글로벌 제조사 품질 개선 사례-Dhan-kim-2013-7-2 Big data on 제조 글로벌 제조사 품질 개선 사례-Dhan-kim-2013-7-2
Big data on 제조 글로벌 제조사 품질 개선 사례-Dhan-kim-2013-7-2
 
UX, 세상을 바꾸는 비밀
UX, 세상을 바꾸는 비밀UX, 세상을 바꾸는 비밀
UX, 세상을 바꾸는 비밀
 
TeraStream for ETL
TeraStream for ETLTeraStream for ETL
TeraStream for ETL
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
파이썬 Special method 이해하기
파이썬 Special method 이해하기파이썬 Special method 이해하기
파이썬 Special method 이해하기
 
실시간 빅데이터와 머신 데이터
실시간 빅데이터와 머신 데이터실시간 빅데이터와 머신 데이터
실시간 빅데이터와 머신 데이터
 
빅데이터 플랫폼 Splunk 6.2 인트로
빅데이터 플랫폼 Splunk 6.2 인트로빅데이터 플랫폼 Splunk 6.2 인트로
빅데이터 플랫폼 Splunk 6.2 인트로
 
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to Classification
 
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersMachine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 

Similar a Text Analysis and Document Classification Techniques

LIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsLIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsShalin Hai-Jew
 
Content analysis
Content analysisContent analysis
Content analysisdsmjrf
 
Digital Humanities: An Introduction
Digital Humanities: An IntroductionDigital Humanities: An Introduction
Digital Humanities: An IntroductionDilip Barad
 
Slideshare to write essays
Slideshare to write essaysSlideshare to write essays
Slideshare to write essaysClaudia Pisoni
 
Slideshare to write essays
Slideshare to write essaysSlideshare to write essays
Slideshare to write essaysClaudia Pisoni
 
Writing in the disciplines
Writing in the disciplinesWriting in the disciplines
Writing in the disciplinesvlequire
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-completeLaura Mandell
 
What is doscourse analysis..
What is doscourse analysis..What is doscourse analysis..
What is doscourse analysis..Katy Chicaiza
 
Referencing mudcd it_id
Referencing mudcd it_idReferencing mudcd it_id
Referencing mudcd it_idlibrarymudc
 
Skeptical Discourse Analysis for non-Linguists
Skeptical Discourse Analysis for non-LinguistsSkeptical Discourse Analysis for non-Linguists
Skeptical Discourse Analysis for non-LinguistsDominik Lukes
 
Rethinking Study Skills Webster.pptx
Rethinking Study Skills Webster.pptxRethinking Study Skills Webster.pptx
Rethinking Study Skills Webster.pptxHelen Webster
 
Sh. tamizrad discourse and genre
Sh. tamizrad  discourse and genreSh. tamizrad  discourse and genre
Sh. tamizrad discourse and genreSheila Rad
 
JTharsen Curriculum Vitae 2016
JTharsen Curriculum Vitae 2016JTharsen Curriculum Vitae 2016
JTharsen Curriculum Vitae 2016Jeffrey Tharsen
 
3. Do you have some idea how you will study your topic?
3. Do you have some idea how you will study your topic?3. Do you have some idea how you will study your topic?
3. Do you have some idea how you will study your topic?DoctoralNet Limited
 
Text & Critical Discourse Analysis
Text & Critical Discourse AnalysisText & Critical Discourse Analysis
Text & Critical Discourse AnalysisLazarus Gawazah
 
Mdst3703 2013-09-17-text-models
Mdst3703 2013-09-17-text-modelsMdst3703 2013-09-17-text-models
Mdst3703 2013-09-17-text-modelsRafael Alvarado
 
Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13Jessica C. Murphy
 

Similar a Text Analysis and Document Classification Techniques (20)

LIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsLIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic Patterns
 
Content analysis
Content analysisContent analysis
Content analysis
 
Introduction to Nvivo
Introduction to NvivoIntroduction to Nvivo
Introduction to Nvivo
 
Digital Humanities: An Introduction
Digital Humanities: An IntroductionDigital Humanities: An Introduction
Digital Humanities: An Introduction
 
Slideshare to write essays
Slideshare to write essaysSlideshare to write essays
Slideshare to write essays
 
Slideshare to write essays
Slideshare to write essaysSlideshare to write essays
Slideshare to write essays
 
Writing in the disciplines
Writing in the disciplinesWriting in the disciplines
Writing in the disciplines
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
 
What is doscourse analysis..
What is doscourse analysis..What is doscourse analysis..
What is doscourse analysis..
 
Citing & referencing
Citing & referencing Citing & referencing
Citing & referencing
 
Referencing mudcd it_id
Referencing mudcd it_idReferencing mudcd it_id
Referencing mudcd it_id
 
Skeptical Discourse Analysis for non-Linguists
Skeptical Discourse Analysis for non-LinguistsSkeptical Discourse Analysis for non-Linguists
Skeptical Discourse Analysis for non-Linguists
 
Rethinking Study Skills Webster.pptx
Rethinking Study Skills Webster.pptxRethinking Study Skills Webster.pptx
Rethinking Study Skills Webster.pptx
 
Sh. tamizrad discourse and genre
Sh. tamizrad  discourse and genreSh. tamizrad  discourse and genre
Sh. tamizrad discourse and genre
 
JTharsen Curriculum Vitae 2016
JTharsen Curriculum Vitae 2016JTharsen Curriculum Vitae 2016
JTharsen Curriculum Vitae 2016
 
3. Do you have some idea how you will study your topic?
3. Do you have some idea how you will study your topic?3. Do you have some idea how you will study your topic?
3. Do you have some idea how you will study your topic?
 
Text & Critical Discourse Analysis
Text & Critical Discourse AnalysisText & Critical Discourse Analysis
Text & Critical Discourse Analysis
 
Literature review
Literature reviewLiterature review
Literature review
 
Mdst3703 2013-09-17-text-models
Mdst3703 2013-09-17-text-modelsMdst3703 2013-09-17-text-models
Mdst3703 2013-09-17-text-models
 
Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13
 

Último

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 

Último (20)

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 

Text Analysis and Document Classification Techniques

  • 1. Duke University Libraries, Digital Scholarship Text > Data, October 25 HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUES Angela Zoss Data Visualization Coordinator 226 Perkins Library angela.zoss@duke.edu
  • 4. How I learned to love the document. B.A. courses: Linguistics, Communication M.S. courses: Communication, Human-Computer Interaction Employment: arXiv.org Administrator • Bibliometrics/Scientometrics Ph.D. • courses:Computer Mediated Discourse Analysis • Latent Structure Analysis • Natural Language Processing
  • 6. Text analysis from… • documents down to words (“low-level”) • words up to documents (“high-level”)
  • 7. Using documents to learn about language (or other social phenomena) Analyzing documents as records/proxies of language, social structures, events, etc. Linguistic studies: morphology, word counts, syntax, etc. … over time (e.g., Google ngram viewer) language across corpora (e.g., political speeches) Underwood, T. (2012). Where to start with text mining.
  • 8. Using documents to learn about language Historical culturomics of pronoun frequencies
  • 9. Using documents to learn about language Universal properties of mythological networks
  • 10. Using language to learn about documents Analyzing documents as artifacts themselves, with their own properties and dynamics Literary, documentary studies: Structural/rhetorical/stylistic analysis Document categorization, classification Detecting clusters of document features (topic modeling) Underwood, T. (2012). Where to start with text mining.
  • 11. Using language to learn about documents Literary Empires, Mapping Temporal and Spatial Settings in Swinburne
  • 12. Using language to learn about documents Using Word Clouds for Topic Modeling Results
  • 13. What are documents? For this discussion, digital versions of works of spoken or written language Examples: books, articles, transcripts, emails, twe ets…
  • 14. Documents as context Documents have: • form(at) • style • provenance • entities • intentions
  • 16. Why study documents? • Describe a corpus • Compare/organize documents • Locate relevant information/filter out irrelevant information
  • 17. Describing a corpus • Finding regularities/differences across groups of documents • Developing theories of structure, style, etc. that can then be tested or applied • May be manual (content analysis) or computer-assisted (statistical)
  • 18. Example: Storylines http://xkcd.com/657/
  • 19. Differences of format, genre, participants… • Articles may have sections, but these will vary by discipline and type of article • Books may be fiction or non-fiction (or both) • Transcripts may refer to multiple speakers, non-text content • …ad infinitum
  • 20. Example: Literature Fingerprinting Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi: 10.1109/VAST.2007.4389004
  • 21. Organizing documents Detect similarity between documents and a known category (or simply among themselves) Supports browsing, sentiment analysis, authorship detection
  • 22. Example: Bohemian Bookshelf Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, to appear.
  • 23. Similarity based on… • common document attributes authorship, genre • common language patterns topics, phrases • common entity references characters, citations
  • 24. Example: Quantitative Formalism Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Pamphlets of the Stanford Literary Lab (vol. 1).
  • 25. Example: Clinton’s DNC Speech http://b.globe.com/TogUqq
  • 26. Example: View DHQ http://digitalliterature.net/viewDHQ/vis3.html
  • 27. Classification • assigning an object to a single class • often supervised, using an existing classification scheme and a tagged corpus
  • 28. Example: Relative signatures Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012 (pp. 103-112).
  • 29. Categorization • assigning documents to one or more categories • suggestive of unsupervised clustering techniques • design choices made to fit particular tasks or goals
  • 30. Example: UCSD Map of Science Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., & Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoS ONE, 7(7), e39464.
  • 31. Example: NIH Map Viewer https://app.nihmaps.org/nih/browser/
  • 32. Reference systems, infrastructure What do we gain by adding structure? What do we lose?
  • 34. Text is only one component of a document. Research questions often push us to be creative with how we operationalize constructs. The richness of language and documents is best preserved by using multiple, complementary approaches.

Notas del editor

  1. why categorize/organize?