SlideShare una empresa de Scribd logo
1 de 48
Towards greater transparency in
digital literary analysis
John Lavagnino, King‟s College London
8 May 2014
Slides at http://goo.gl/dPGhPw
The plan
1 General reasons for doing digital
analysis, and some present-day trends
2 A recent study that went badly wrong
3 Open and closed techniques
4 Open and closed data
Things not in the plan
Lots of things that aren‟t analysis are
valuable:
1 publication and rediscovery (as by the
Women Writers Project, Northeastern
University)
2 discussion, argument, interaction
3 studies of digital culture
4 …
Why people do this
Above all, because you can: a byproduct of the
web and the widespread use of computers is a
wealth of textual data. Without books in
transcribed form much less would happen.
Yes, you can always transcribe some new stuff
yourself, but then you immediately need time
and money before doing anything at all.
You can also work with small amounts of
text, but it tends to get less notice.
What‟s harder to do
Texts not in English are less widely available in
digital form and so get analyzed less.
Texts much later than the nineteenth century
are in copyright.
Texts before the nineteenth century pose OCR
problems and have more variable spelling.
It‟s not an accident that there are so many
digital studies of nineteenth-century novels.
Why it‟s worth doing
When there‟s too much to read
When a different kind of attention is
valuable (more systematic? or just very
different from normal reading?)
When it can locate or arrange material as the
basis for more traditional approaches
A recent study that went
badly wrong
The study
Matjaž Perc, “Evolution of the most common
English words and phrases over the
centuries”, Journal of the Royal Society
Interface, 7 December 2012: see:
http://goo.gl/7S0RT
Based on Google ngram data: see
www.culturomics.org
A surprising claim about English
Perc, in his abstract: “We find that the most
common words and phrases in any given
year had a much shorter popularity
lifespan in the sixteenth century than they
had in the twentieth century.”
Top 3-grams, 2007 and 2008
See: http://goo.gl/iUS3e
Top 3-grams, early 1520s
See: http://goo.gl/r4eyh
(Note that the 3-grams are case-sensitive.)
From 1541‟s top 3-grams
See: http://goo.gl/r4eyh
Birthdate of Sir Thomas Bodley: 2 March 1545
Top trigram frequencies, 1800-
2000
Top trigram frequencies, 1520-
1800
Evolution of popularity of the top 100 n-grams over the past five centuries.
Perc M J. R. Soc. Interface doi:10.1098/rsif.2012.0491
See: http://goo.gl/2URVT
©2012 by The Royal Society
Some alternative conclusions
about this research
The world‟s best mass OCR is bad for books
before 1800
You should read what the providers of your
data say about it: Steven Levitt does
Interdisciplinary journals need to have
reviewers from many fields
Real 1520 trigrams
Perc‟s data set contains no true 1520
imprints: his 1520 book is An Open Letter
to the Christian Nobility of the German
Nation, an early-twentieth-century
translation of a book by Martin Luther
published in German in 1520.
Another conclusion
Perc‟s publication of his data and an
interface for exploring it is praiseworthy:
this study is very transparent. It‟s not just
that the Google data is readily available:
Perc constructed his own tables of the top
ngrams year-by-year and published them
online.
Some very rough numbers for 1520
STC titles published in 1520: 114
In English: 47
(And figures for both 1519 and 1521 are
considerably smaller, because 1520
includes many items dated c.1520.)
Limitations of knowledge
The kind of naïve statistical study Perc
performed assumes an entirely reliable
and consistent data set. The Google ngram
data isn‟t like that, but while it can be done
far better, a data set for early-sixteenth-
century English of that kind is not even
possible.
Open and closed techniques
When is language unusual?
A man fires an arrow at a Neanderthal in William
Golding‟s novel The Inheritors:
A stick rose upright and there was a lump of bone
in the middle. Lok peered at the stick and the
lump of bone and the small eyes in the bone
things over the face. Suddenly Lok understood
that the man was holding the stick out to him but
neither he nor Lok could reach across the river.
He would have laughed if it were not for the echo
of the screaming in his head. The stick began to
grow shorter at both ends. Then it shot out to
full length again.
An obvious but useful method
David Hoover, “The End of the Irrelevant Text:
Electronic Texts, Linguistics, and Literary
Theory”, Digital Humanities Quarterly 1:2
(2007), used Google to find other instances of
the oxymoronic phrase “grew shorter”.
When referring to physical objects (and not
lectures, distances, patience, …) it‟s not about
sticks, it‟s about fuses, candles, cigarettes…
(in use), and articles of clothing, hair... (over
time).
Literary significance
Hoover: “Part of the power of „the stick
began to grow shorter at both ends‟ is in
the shape of Lok‟s incomprehension. For
Lok, the whole world is alive, so that a
stick that changes length is perfectly
comprehensible.”
Problems of technique
What forms do you look for? Hoover‟s
investigation looked both at the words
Golding used and at the concept of objects
growing shorter.
Searches can give very different results with
slight differences in query.
It really is true
Geoffrey Pullum, “The sparseness of
linguistic data”, Language Log, 7 April
2014: “it really is true that the probability
for most grammatical sequences of words
actually having turned up on the web
really is approximately zero, so
grammaticality cannot possibly be reduced
to probability of having actually occurred.”
Complex techniques: PCA
Larry L. Stewart, “Charles Brockden Brown:
Quantitative Analysis and Literary
Interpretation”, Literary and Linguistic
Computing, June 2003: among other
things, a study of Brown‟s novels Wieland
and Carwin, and the distinctiveness of the
narrating voices of Clara and Carwin.
Clara and Wieland as narrators
What is that graph based on?
PCA, or Principal Component
Analysis, takes as input numerous textual
features you choose, and tries to create
“components” that capture as much of the
variation in the texts as possible: reducing
the dozens of dimensions needed to show
all these things down to two that roll
together a lot of what‟s going on (about
half of it, in this case).
Principal components
This reduction is automatic: and is not really
a statistical analysis, only a rearrangement
of the data. But it does show us groupings
of the chapters based on part of the actual
data, with Clara‟s narration in Wieland
having more exclamation points and
dashes and fewer instances of “our”;
combining these into one feature makes it
easier to see.
What is that graph based on?
Can we get back to the text?
Yes, in that Stewart tells us what goes into
the first principal component (though not
the others).
No, in that he doesn‟t show any passages
and analyze them in these terms.
And no, in that a component is a complex
weighted combination of parts of features.
Graphs need analysis
It is still common to treat graphs and other
visualizations as results, not as texts that
themselves need interpretation. Yet they‟re
only of interest if they support substantial
discussion and analysis, and that ought to
appear in the article. Stewart has a
literary-critical discussion of the novels in
light of this analysis: but why not a few
pages first on the graph?
Graphs need interaction
You publish one or two or six graphs in an
article, not two hundred, because they take
up a lot of space. But if a graph‟s worth
doing at all it‟s worth doing
differently, and the best way to explore
this kind of study is to try out variations
yourself.
For all its flaws, this is one thing the Google
ngrams resource got right.
Open and closed data
Big uncurated data
Ted Underwood, Michael L. Black, Loretta
Auvil, and Boris Capitanu, “Mapping
Mutable Genres in Structurally Complex
Volumes” (2013), at
http://arxiv.org/abs/1309.3323: the study
analyzes “a collection of 469,200 volumes
drawn from HathiTrust Digital Library”.
That‟s an open data collection provided by
libraries involved in Google Books.
How do you read 469,200 books?
You start by figuring out how to find the text
in them, by skipping things like bookplates
and tables of contents. (The bookplates are
a reason why Google Books and Google
ngrams studies of the word “library” run
into problems.) Without doing that first
you can‟t go on to study (as they are) the
percentage of first-person novels over
time.
But it‟s not really transparent now
If you need to do that much to the books
before you can analyze them, others either
need to duplicate all of that preliminary
work or get the results of your preliminary
work.
Much work on big data elsewhere is based
on data that is simpler in form than books
are, or has been prepared for use first (at
someone‟s expense).
Curated rather than raw texts
These exist in the humanities, but not
necessarily where you want to work or in
the numbers you desire. Another C19-
novel study by Matthew Wilkens used
texts fixed up at Indiana University, with
fewer textual errors and clearly-defined
structure; but that meant he also had a lot
fewer of them.
Specially prepared data
Once it was more common for digital-
humanities work to involve creation of
new data for analysis: not just basic
texts, but also analysis or extraction of
features by hand as a basis for analysis.
For example, Brad Pasanek and D.
Sculley, “Mining millions of
metaphors”, Literary and Linguistic
Computing, September 2008.
Pasanek‟s collection
See http://metaphors.lib.virginia.edu/ for
his Mind is a Metaphor
collection, assembled to support a study of
C18 thinking on the subject; a collection
based in the first instance on doing lots of
searches, extended over the course of
many years by several hands.
A little on how it‟s done
Pasanek: “At present I still spend a fair amount of time
conducting proximity searches for two character strings. I search
one term from a set list ("mind," "heart," "soul," "thought,"
"idea," "imagination," "fancy," "reason," "passion," "head,"
"breast," "bosom," or "brain") against another word that I hope
will prove metaphorical. For example, I search for "mind" within
one hundred characters of "mint" and find the following couplet
in William Cowper's poetry:
“The mind and conduct mutually imprint
And stamp their image in each other's mint.””
Creating data as a scholarly activity
The collection itself is a major effort (and not
everyone would have made it public in this
way prior to publishing their monograph).
Creation of this kind of resource is not yet
widely recognized as valuable scholarship:
the usual focus is on “uninterpreted”
transcription.
And some data comes from sources that cannot
be made generally available (copyright
again).
Are we satisfied?
Over half the metaphors come from
searching Chadwyck-Healey collections of
texts; about a third from reading.
There‟s transparency in that Pasanek
explains in detail how he assembled his
collection; but it would be a challenge to
assemble a rival corpus to compare with
this one. Such an effort shouldn‟t really be
an individual one, but usually will be.
Conclusions
There‟s a potential for openness in new
approaches but some challenges: new
forms of publication appropriate for new
kinds of work, balancing openness and
scholarly recognition, copyright.
We need to find out interesting things to
motivate the changes greater transparency
requires.
Thank you!
Please contact me at
John.Lavagnino@kcl.ac.uk
Slides: at http://goo.gl/dPGhPw
Towards greater transparency in digital literary analysis
Towards greater transparency in digital literary analysis

Más contenido relacionado

Similar a Towards greater transparency in digital literary analysis

Scholarship in the EEBO-TCP Age
Scholarship in the EEBO-TCP AgeScholarship in the EEBO-TCP Age
Scholarship in the EEBO-TCP AgeJohn Lavagnino
 
106 slides on genre and narrative
106 slides on genre and narrative106 slides on genre and narrative
106 slides on genre and narrativeDaniel Feldman
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Cornelius Puschmann
 
MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdfMartin Wynne
 
In the Minds of Men: Darwin and the New World Order, by Ian T. Taylor
In the Minds of Men: Darwin and the New World Order, by Ian T. TaylorIn the Minds of Men: Darwin and the New World Order, by Ian T. Taylor
In the Minds of Men: Darwin and the New World Order, by Ian T. TaylorOrthodoxoOnline
 
Module 4_ Lesson 1 and 2_with Reviewer.pdf
Module 4_ Lesson 1 and 2_with Reviewer.pdfModule 4_ Lesson 1 and 2_with Reviewer.pdf
Module 4_ Lesson 1 and 2_with Reviewer.pdfTeacherMariza
 
Module 4_ Lesson 1 and 2.pptx
Module 4_ Lesson 1 and 2.pptxModule 4_ Lesson 1 and 2.pptx
Module 4_ Lesson 1 and 2.pptxTeacherMariza
 
Electronic Literature - Honors Project Narrative (Final Draft)
Electronic Literature - Honors Project Narrative (Final Draft)Electronic Literature - Honors Project Narrative (Final Draft)
Electronic Literature - Honors Project Narrative (Final Draft)Cameron Irby
 
Academic Writing & Research: Best Practices
Academic Writing & Research: Best PracticesAcademic Writing & Research: Best Practices
Academic Writing & Research: Best Practicesdhocutt
 
English from ccc commun college
English from ccc commun collegeEnglish from ccc commun college
English from ccc commun collegeWin Chu
 
Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...
Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...
Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...Cornelius Puschmann
 
Pliny: 4 perspectives
Pliny: 4 perspectivesPliny: 4 perspectives
Pliny: 4 perspectivesJohn Bradley
 
Primary and Secondary Sources
Primary and Secondary SourcesPrimary and Secondary Sources
Primary and Secondary SourcesPat Toh
 
Statistics and sentiment a report on the digital essay 'Kafka's Wound' by Wil...
Statistics and sentiment a report on the digital essay 'Kafka's Wound' by Wil...Statistics and sentiment a report on the digital essay 'Kafka's Wound' by Wil...
Statistics and sentiment a report on the digital essay 'Kafka's Wound' by Wil...Helen K Jeffrey
 
LCC CTS 2 Option.docx
LCC CTS 2 Option.docxLCC CTS 2 Option.docx
LCC CTS 2 Option.docxwrite4
 
Short Paper #1 Presentation (2015)
Short Paper #1 Presentation (2015)Short Paper #1 Presentation (2015)
Short Paper #1 Presentation (2015)Andy Black
 
Harvesting collective intelligence.
Harvesting collective intelligence. Harvesting collective intelligence.
Harvesting collective intelligence. Alberto Cottica
 
Digital Scholarship Seminar: Implications of Data for the 21st-century Humanist
Digital Scholarship Seminar: Implications of Data for the 21st-century HumanistDigital Scholarship Seminar: Implications of Data for the 21st-century Humanist
Digital Scholarship Seminar: Implications of Data for the 21st-century HumanistRebecca Davis
 

Similar a Towards greater transparency in digital literary analysis (20)

Scholarship in the EEBO-TCP Age
Scholarship in the EEBO-TCP AgeScholarship in the EEBO-TCP Age
Scholarship in the EEBO-TCP Age
 
106 slides on genre and narrative
106 slides on genre and narrative106 slides on genre and narrative
106 slides on genre and narrative
 
SAWS Rouche Presentation HERA Event Feb 2015
SAWS Rouche Presentation HERA Event Feb 2015SAWS Rouche Presentation HERA Event Feb 2015
SAWS Rouche Presentation HERA Event Feb 2015
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)
 
MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdf
 
In the Minds of Men: Darwin and the New World Order, by Ian T. Taylor
In the Minds of Men: Darwin and the New World Order, by Ian T. TaylorIn the Minds of Men: Darwin and the New World Order, by Ian T. Taylor
In the Minds of Men: Darwin and the New World Order, by Ian T. Taylor
 
Module 4_ Lesson 1 and 2_with Reviewer.pdf
Module 4_ Lesson 1 and 2_with Reviewer.pdfModule 4_ Lesson 1 and 2_with Reviewer.pdf
Module 4_ Lesson 1 and 2_with Reviewer.pdf
 
Module 4_ Lesson 1 and 2.pptx
Module 4_ Lesson 1 and 2.pptxModule 4_ Lesson 1 and 2.pptx
Module 4_ Lesson 1 and 2.pptx
 
Reading avoidance
Reading avoidanceReading avoidance
Reading avoidance
 
Electronic Literature - Honors Project Narrative (Final Draft)
Electronic Literature - Honors Project Narrative (Final Draft)Electronic Literature - Honors Project Narrative (Final Draft)
Electronic Literature - Honors Project Narrative (Final Draft)
 
Academic Writing & Research: Best Practices
Academic Writing & Research: Best PracticesAcademic Writing & Research: Best Practices
Academic Writing & Research: Best Practices
 
English from ccc commun college
English from ccc commun collegeEnglish from ccc commun college
English from ccc commun college
 
Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...
Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...
Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...
 
Pliny: 4 perspectives
Pliny: 4 perspectivesPliny: 4 perspectives
Pliny: 4 perspectives
 
Primary and Secondary Sources
Primary and Secondary SourcesPrimary and Secondary Sources
Primary and Secondary Sources
 
Statistics and sentiment a report on the digital essay 'Kafka's Wound' by Wil...
Statistics and sentiment a report on the digital essay 'Kafka's Wound' by Wil...Statistics and sentiment a report on the digital essay 'Kafka's Wound' by Wil...
Statistics and sentiment a report on the digital essay 'Kafka's Wound' by Wil...
 
LCC CTS 2 Option.docx
LCC CTS 2 Option.docxLCC CTS 2 Option.docx
LCC CTS 2 Option.docx
 
Short Paper #1 Presentation (2015)
Short Paper #1 Presentation (2015)Short Paper #1 Presentation (2015)
Short Paper #1 Presentation (2015)
 
Harvesting collective intelligence.
Harvesting collective intelligence. Harvesting collective intelligence.
Harvesting collective intelligence.
 
Digital Scholarship Seminar: Implications of Data for the 21st-century Humanist
Digital Scholarship Seminar: Implications of Data for the 21st-century HumanistDigital Scholarship Seminar: Implications of Data for the 21st-century Humanist
Digital Scholarship Seminar: Implications of Data for the 21st-century Humanist
 

Último

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 

Último (20)

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 

Towards greater transparency in digital literary analysis

  • 1. Towards greater transparency in digital literary analysis John Lavagnino, King‟s College London 8 May 2014 Slides at http://goo.gl/dPGhPw
  • 2. The plan 1 General reasons for doing digital analysis, and some present-day trends 2 A recent study that went badly wrong 3 Open and closed techniques 4 Open and closed data
  • 3. Things not in the plan Lots of things that aren‟t analysis are valuable: 1 publication and rediscovery (as by the Women Writers Project, Northeastern University) 2 discussion, argument, interaction 3 studies of digital culture 4 …
  • 4. Why people do this Above all, because you can: a byproduct of the web and the widespread use of computers is a wealth of textual data. Without books in transcribed form much less would happen. Yes, you can always transcribe some new stuff yourself, but then you immediately need time and money before doing anything at all. You can also work with small amounts of text, but it tends to get less notice.
  • 5. What‟s harder to do Texts not in English are less widely available in digital form and so get analyzed less. Texts much later than the nineteenth century are in copyright. Texts before the nineteenth century pose OCR problems and have more variable spelling. It‟s not an accident that there are so many digital studies of nineteenth-century novels.
  • 6. Why it‟s worth doing When there‟s too much to read When a different kind of attention is valuable (more systematic? or just very different from normal reading?) When it can locate or arrange material as the basis for more traditional approaches
  • 7. A recent study that went badly wrong
  • 8. The study Matjaž Perc, “Evolution of the most common English words and phrases over the centuries”, Journal of the Royal Society Interface, 7 December 2012: see: http://goo.gl/7S0RT Based on Google ngram data: see www.culturomics.org
  • 9. A surprising claim about English Perc, in his abstract: “We find that the most common words and phrases in any given year had a much shorter popularity lifespan in the sixteenth century than they had in the twentieth century.”
  • 10. Top 3-grams, 2007 and 2008 See: http://goo.gl/iUS3e
  • 11. Top 3-grams, early 1520s See: http://goo.gl/r4eyh (Note that the 3-grams are case-sensitive.)
  • 12. From 1541‟s top 3-grams See: http://goo.gl/r4eyh Birthdate of Sir Thomas Bodley: 2 March 1545
  • 15. Evolution of popularity of the top 100 n-grams over the past five centuries. Perc M J. R. Soc. Interface doi:10.1098/rsif.2012.0491 See: http://goo.gl/2URVT ©2012 by The Royal Society
  • 16. Some alternative conclusions about this research The world‟s best mass OCR is bad for books before 1800 You should read what the providers of your data say about it: Steven Levitt does Interdisciplinary journals need to have reviewers from many fields
  • 17. Real 1520 trigrams Perc‟s data set contains no true 1520 imprints: his 1520 book is An Open Letter to the Christian Nobility of the German Nation, an early-twentieth-century translation of a book by Martin Luther published in German in 1520.
  • 18. Another conclusion Perc‟s publication of his data and an interface for exploring it is praiseworthy: this study is very transparent. It‟s not just that the Google data is readily available: Perc constructed his own tables of the top ngrams year-by-year and published them online.
  • 19. Some very rough numbers for 1520 STC titles published in 1520: 114 In English: 47 (And figures for both 1519 and 1521 are considerably smaller, because 1520 includes many items dated c.1520.)
  • 20. Limitations of knowledge The kind of naïve statistical study Perc performed assumes an entirely reliable and consistent data set. The Google ngram data isn‟t like that, but while it can be done far better, a data set for early-sixteenth- century English of that kind is not even possible.
  • 21. Open and closed techniques
  • 22. When is language unusual? A man fires an arrow at a Neanderthal in William Golding‟s novel The Inheritors: A stick rose upright and there was a lump of bone in the middle. Lok peered at the stick and the lump of bone and the small eyes in the bone things over the face. Suddenly Lok understood that the man was holding the stick out to him but neither he nor Lok could reach across the river. He would have laughed if it were not for the echo of the screaming in his head. The stick began to grow shorter at both ends. Then it shot out to full length again.
  • 23. An obvious but useful method David Hoover, “The End of the Irrelevant Text: Electronic Texts, Linguistics, and Literary Theory”, Digital Humanities Quarterly 1:2 (2007), used Google to find other instances of the oxymoronic phrase “grew shorter”. When referring to physical objects (and not lectures, distances, patience, …) it‟s not about sticks, it‟s about fuses, candles, cigarettes… (in use), and articles of clothing, hair... (over time).
  • 24. Literary significance Hoover: “Part of the power of „the stick began to grow shorter at both ends‟ is in the shape of Lok‟s incomprehension. For Lok, the whole world is alive, so that a stick that changes length is perfectly comprehensible.”
  • 25. Problems of technique What forms do you look for? Hoover‟s investigation looked both at the words Golding used and at the concept of objects growing shorter. Searches can give very different results with slight differences in query.
  • 26. It really is true Geoffrey Pullum, “The sparseness of linguistic data”, Language Log, 7 April 2014: “it really is true that the probability for most grammatical sequences of words actually having turned up on the web really is approximately zero, so grammaticality cannot possibly be reduced to probability of having actually occurred.”
  • 27. Complex techniques: PCA Larry L. Stewart, “Charles Brockden Brown: Quantitative Analysis and Literary Interpretation”, Literary and Linguistic Computing, June 2003: among other things, a study of Brown‟s novels Wieland and Carwin, and the distinctiveness of the narrating voices of Clara and Carwin.
  • 28. Clara and Wieland as narrators
  • 29. What is that graph based on? PCA, or Principal Component Analysis, takes as input numerous textual features you choose, and tries to create “components” that capture as much of the variation in the texts as possible: reducing the dozens of dimensions needed to show all these things down to two that roll together a lot of what‟s going on (about half of it, in this case).
  • 30. Principal components This reduction is automatic: and is not really a statistical analysis, only a rearrangement of the data. But it does show us groupings of the chapters based on part of the actual data, with Clara‟s narration in Wieland having more exclamation points and dashes and fewer instances of “our”; combining these into one feature makes it easier to see.
  • 31. What is that graph based on?
  • 32. Can we get back to the text? Yes, in that Stewart tells us what goes into the first principal component (though not the others). No, in that he doesn‟t show any passages and analyze them in these terms. And no, in that a component is a complex weighted combination of parts of features.
  • 33. Graphs need analysis It is still common to treat graphs and other visualizations as results, not as texts that themselves need interpretation. Yet they‟re only of interest if they support substantial discussion and analysis, and that ought to appear in the article. Stewart has a literary-critical discussion of the novels in light of this analysis: but why not a few pages first on the graph?
  • 34. Graphs need interaction You publish one or two or six graphs in an article, not two hundred, because they take up a lot of space. But if a graph‟s worth doing at all it‟s worth doing differently, and the best way to explore this kind of study is to try out variations yourself. For all its flaws, this is one thing the Google ngrams resource got right.
  • 36. Big uncurated data Ted Underwood, Michael L. Black, Loretta Auvil, and Boris Capitanu, “Mapping Mutable Genres in Structurally Complex Volumes” (2013), at http://arxiv.org/abs/1309.3323: the study analyzes “a collection of 469,200 volumes drawn from HathiTrust Digital Library”. That‟s an open data collection provided by libraries involved in Google Books.
  • 37. How do you read 469,200 books? You start by figuring out how to find the text in them, by skipping things like bookplates and tables of contents. (The bookplates are a reason why Google Books and Google ngrams studies of the word “library” run into problems.) Without doing that first you can‟t go on to study (as they are) the percentage of first-person novels over time.
  • 38. But it‟s not really transparent now If you need to do that much to the books before you can analyze them, others either need to duplicate all of that preliminary work or get the results of your preliminary work. Much work on big data elsewhere is based on data that is simpler in form than books are, or has been prepared for use first (at someone‟s expense).
  • 39. Curated rather than raw texts These exist in the humanities, but not necessarily where you want to work or in the numbers you desire. Another C19- novel study by Matthew Wilkens used texts fixed up at Indiana University, with fewer textual errors and clearly-defined structure; but that meant he also had a lot fewer of them.
  • 40. Specially prepared data Once it was more common for digital- humanities work to involve creation of new data for analysis: not just basic texts, but also analysis or extraction of features by hand as a basis for analysis. For example, Brad Pasanek and D. Sculley, “Mining millions of metaphors”, Literary and Linguistic Computing, September 2008.
  • 41. Pasanek‟s collection See http://metaphors.lib.virginia.edu/ for his Mind is a Metaphor collection, assembled to support a study of C18 thinking on the subject; a collection based in the first instance on doing lots of searches, extended over the course of many years by several hands.
  • 42. A little on how it‟s done Pasanek: “At present I still spend a fair amount of time conducting proximity searches for two character strings. I search one term from a set list ("mind," "heart," "soul," "thought," "idea," "imagination," "fancy," "reason," "passion," "head," "breast," "bosom," or "brain") against another word that I hope will prove metaphorical. For example, I search for "mind" within one hundred characters of "mint" and find the following couplet in William Cowper's poetry: “The mind and conduct mutually imprint And stamp their image in each other's mint.””
  • 43. Creating data as a scholarly activity The collection itself is a major effort (and not everyone would have made it public in this way prior to publishing their monograph). Creation of this kind of resource is not yet widely recognized as valuable scholarship: the usual focus is on “uninterpreted” transcription. And some data comes from sources that cannot be made generally available (copyright again).
  • 44. Are we satisfied? Over half the metaphors come from searching Chadwyck-Healey collections of texts; about a third from reading. There‟s transparency in that Pasanek explains in detail how he assembled his collection; but it would be a challenge to assemble a rival corpus to compare with this one. Such an effort shouldn‟t really be an individual one, but usually will be.
  • 45. Conclusions There‟s a potential for openness in new approaches but some challenges: new forms of publication appropriate for new kinds of work, balancing openness and scholarly recognition, copyright. We need to find out interesting things to motivate the changes greater transparency requires.
  • 46. Thank you! Please contact me at John.Lavagnino@kcl.ac.uk Slides: at http://goo.gl/dPGhPw