SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
The Bones of a Bestseller:
Visualizing Fiction
Lynn Cherny
@arnicas
OpenvisConf 2013
Monday, June 17, 13
Language, Sex,Violence
(also spoilers)
TEXT
Monday, June 17, 13
Monday, June 17, 13
Book stars for today
Study what’s popular, because it tells us something
about people.
Monday, June 17, 13
Additionally, I want to illustrate using some statistical tricks on data, particularly some simple machine learning – and the tools I
built to visualize those results.
http://www.economist.com/blogs/graphicdetail/2012/11/fifty-shades-data-visualisations
BY
Monday, June 17, 13
Start with a little motivating graphic, or “pornographic”, that inspired me about 6 months ago. This was actually on the
Economist’s blog!
Can we do this automatically?
Text Classification (Commonly)
§“Bag of words” – each document is considered
a collection of words, independent of order
§Frequencies of certain words are used to
identify the texts
Seems like this should work with sex scenes,
right? Only so many body parts and behaviors,
right?!
Monday, June 17, 13
The way everyone would start this problem…
Data Label
Estdsgfd fdsatreatret dfds Yes
Dsrdsf drerear ewrewtrew No
Reret retdrtd rewrewrtew Yes
Dsfgdg fdsfd Yes
Algorithm
Train
Test
New data in the wild
Monday, June 17, 13
Supervised learning: Have some data you label with the truth, and feed it into some code to learn what the truth is all about.
To do this properly, you divide the data up in a training set, and an evaluation set – and you see how your code did on the
evaluation set: how much did it get right?
Once you’re satisfied with the tweaks on the classifier code, you can use it on new data in the wild.
Sex Scene Detection First Steps
1. Buy 50 Shades on Amazon, unlock text in
Calibre, save as TXT file.
2. Cut up a doc into 500 “word” chunks using
Python
3. Try to label each chunk:
	

 “not sexy” (e.g., paperwork, taxes, calls to Mom)
	

 “maybe steamy” (e.g. kissing, limited touching,
	

 	

 long looks)
	

 “sexy!” (fill in the ____ here)
Monday, June 17, 13
“Would you like to sit?” He waves me toward an L-shaped white leather couch.
His office is way too big for just one man. In front of the floor-to-ceiling windows, there’s a
modern dark wood desk that six people could comfortably eat around. It matches the
coffee table by the couch. Everything else is white—ceiling, floors, and walls, except for the
wall by the door, where a mosaic of small paintings hang, thirty-six of them arranged in a
square.They are exquisite—a series of mundane, forgotten objects painted in such precise
detail they look like photographs. Displayed together, they are breathtaking.
“A local artist.Trouton,” says Grey when he catches my gaze.
“They’re lovely. Raising the ordinary to extraordinary,” I murmur, distracted both by him and
the paintings. He cocks his head to one side and regards me intently.
“I couldn’t agree more, Miss Steele,” he replies, his voice soft, and for some inexplicable
reason I find myself blushing.
Sample of 50 Shades of Grey
Monday, June 17, 13
What the text looks like…
Outsourced to Mechanical Turk
Monday, June 17, 13
Doing the sex scenes labels myself sucked, so I outsourced it to Mechanical Turk, Amazon’s crowdsourcing remote work tool. It
was super easy (to spend a lot of money on this). So I did (spend a lot of money).
WHAT’S A SEX SCENE,
ANYWAY?
Monday, June 17, 13
But let’s step back a little…
Zara.com
Monday, June 17, 13
Lots would say this is sexy (maybe not all women, though).
http://www.ebay.com/itm/Adult-Sex-Toys-Tools-Handcuffs-Eye-mask-Neck-Band-Strap-Whip-Rope-/330845727274?pt=
UK_Home_Garden_Celebrations_Occasions_ET&hash=item4d07f12a2a
Monday, June 17, 13
Some would say this set is sexy, others definitely would not. This turns out to be a lot of what 50 Shades is about… So, hmm.
Also, this set is on Ebay in the UK if you’re into it.
trendir.com
Monday, June 17, 13
So, apart from the bondage, the Mechanical Turkers are seeing small chunks of text, with no context, in random orders. Suppose
there’s a steamy shower scene where they are getting it on – but they stop to discuss a horrible childhood incident and cry? Is
that in a sex scene, or not? Tough to say.
Sexually Exxxplicit,
but still a
http://www.icts.uiowa.edu/sites/default/files/contract.jpg
Monday, June 17, 13
Even worse – some parts of the first book are long sections of contract, which contain sexual rules and regulations – but it’s a
contract. Sexy, or not? Probably not to most…
Monday, June 17, 13
Results from Mechanical Turk as a CSV file.
How’d the raters do?
Sex Scenes
Steamy Scenes
Monday, June 17, 13
We can see a fair amount of variation here, some good agreement, but the blue raters were more turned on by the beginning of
the book.
Comparing to “Pornographic”…
Monday, June 17, 13
A pretty good match, actually. Good for the Turkers and the porno-graphic team!
Comparing:
Monday, June 17, 13
Again, what’s up with the blue raters – they loved this book. Red did not find it sexy at all.
On to the learning algorithm…
The training data:
-The text chunks
-The score the raters gave it (averaged) as “truth”
I started with Python’s NLTK (Natural Language
Toolkit) and Naïve Bayes for classifying (working
in an ipython notebook).
Monday, June 17, 13
NLTK Naïve Bayes not so great
on 50 Shades… 68%.
“packet” (they use a lot of condoms)
Monday, June 17, 13
NLTK outputs a list of top terms, unlike scikit-learn – just wanted to show you what they looked like.
Python’s sklearn (scikit-learn)
Lots of classifiers for
sparse data like text!
http://scikit-learn.org/0.13/auto_examples/
document_classification_20newsgroups.html
Monday, June 17, 13
This is an illustration (not by me) of how many classifiers there are that can be used on text, in scikit-learn… Picked one that has
general good performance, to see how it compared to Naïve Bayes – Stochastic Gradient Descent. Notice there’s a Passive
Aggressive Classifier, too. Best.name.ever.
Using a lemmatizer step in the pipeline (to strip endings off words, since some fiction in my
later samples was in present tense)
Pipelines in sklearn makes it incredibly easy to run lots of experiments.
Fit the model, using training data and “target” answers (in this case,“50 Shades of Grey”)
Test the model on new data (in this case,“50 Shades Darker”). Check how it did against the
answers.
Now
we’re
at 88%
Monday, June 17, 13
Just to show you how little code it is to run a classifier pipeline – and check the results.
Interpreting the results…
Demo: http://www.ghostweather.com/essays/talks/openvisconf/
text_scores/rollover.html
Monday, June 17, 13
To be able to browse the results by content and context, I built a little tool in D3; you can see the matches and mismatches in the
sex scenes, and rollover each little block to inspect the text itself. Useful!
Really amazing P.S. here…
I paid for coding of a bunch of fan-fiction for sex
scenes too, and fed them in to the SGD classifier.
(Recall that 50 Shades started life as Twilight
fanfic.)
*cross-validating with entire set, not just 50 Shades books.
97% accuracy achieved!*
Monday, June 17, 13
I did spend a lot of money on Mturk, getting ratings of sex scenes. For future talks…
I SAID I’D TALK ABOUT
STORY ARCS TOO
But hey -
Monday, June 17, 13
Switching back to another theme… overall arc of action in a novel.
http://www.musik-therapie.at/PederHill/Structure&Plot.htm
Monday, June 17, 13
The movie version of story arcs… “height” is tension, or some kind of measure of excitement, or drama…
PLEASE. IFYOU WRITING SCREENPLAY.
HULK TELLINGYOU.THE 3 ACT
STRUCTURE = GARBAGE.
STOP CITING IT IN ARTICLES.
STOP TALKING ABOUT IT WITH
FRIENDS.
IT WILL NOT HELPYOU.
STAY THE FUCK AWAY FROM ANYONE
WHO EVEN CLAIM IT EXIST. IF THEY
SAY IT DO. SAY “OR COURSE SHIT HAS
BEGINNING, MIDDLE,AND ENDING
YOU INSUFFERABLE TURD” THEN
THROW A DRINK IN THEIR FACE AND
RUN AWAY…
http://filmcrithulk.wordpress.com/2011/07/07/hulk-presents-the-myth-of-3-act-structure/
“The HULK Presents the Myth of the 3-Act Structure”
Monday, June 17, 13
But the more I investigated this online, the more I found people saying it’s bullshit. This is the best quote I found on the subject.
Totally worth reading the essay.
Vonnegut - http://thedesigngym.com/simpleshapesofstories/
Monday, June 17, 13
Vonnegut is here talking about sentiment of events, not really “tension” or “excitement” or rising action – but there’s still some
kind of structural differences going on across each book/story. The question is, what do best sellers look like over the course of
the whole story? By whatever measure will illustrate the pacing/movement of the story.
Lower right hand corner is me working on this talk.
Monday, June 17, 13
Vonnegut on real life, compared to fiction. But because that’s depressing, here is a tiny owl.
http://24.media.tumblr.com/ba77d04cb210b8e24ff73a49a19b3111/
tumblr_mfc6dv2SER1qh66wqo1_1280.jpg
Monday, June 17, 13
Did this cheer you up? It’s better than a kitten! IMO.
Monday, June 17, 13
So – back to the initial thought: rising action, crises, resolution, etc. Can we find this in books? Automatically, I mean?
Can we detect exciting scenes?
Back to Mechanical Turk, with Dan Brown books:
	

 2 raters again, chunks of 500 words
Odd factoid: I got ratings of sex scenes in 2-4 hours.
It took ~13 hours to get Dan Brown action scenes.
Monday, June 17, 13
Using action/exciting scenes as proxy for major events in a book…
“ACTION” SCENES ARE
TOUGH, TOO
Monday, June 17, 13
Brief digression on how hard this is.
Raven.theraider.net
Monday, June 17, 13
This seems obvious… fights, chases, etc.
Objects in the mirror are closer than they
appear
www.badhaven.com / Jurassic Park
Monday, June 17, 13
Small chunks out of context don’t always look like action. Remember Mechanical Turk folks are seeing small pieces without
context, so their judgments are based on only a tiny window. Or, in the case of Dan Brown, it might ALL look like action. (You
might look up “bathos” here too.)
Almost naked, Silas hurled his pale body down the staircase. He knew he
had been betrayed, but by whom? When he reached the foyer, more
officers were surging through the front door. Silas turned the other way and
dashed deeper into the residence hall.The women's entrance. Every Opus
Dei building has one.Winding down narrow hallways, Silas snaked through
a kitchen, past terrified workers, who left to avoid the naked albino as he
knocked over bowls and silverware, bursting into a dark hallway near the
boiler room. He now saw the door he sought, an exit light gleaming at the
end.
Running full speed through the door out into the rain, Silas leapt off the
low landing, not seeing the officer coming the other way until it was too
late.The two men collided, Silas's broad, naked shoulder grinding into the
man's sternum with crushing force. He drove the officer backward onto the
pavement, landing hard on top of him.The officer's gun clattered away. Silas
could hear men running down the hall shouting. Rolling, he grabbed the
loose gun just as the officers emerged. A shot rang out on the stairs, and
Silas felt a searing pain below his ribs. Filled with rage, he opened fire at
all three officers, their blood spraying.
A dark shadow loomed behind, coming out of nowhere.The angry hands
that grabbed at his bare shoulders felt as if they were infused with the
power of the devil himself.The man roared in his ear. SILAS, NO!
Silas spun and fired.Their eyes met. Silas was already screaming in
horror as Bishop Aringarosa fell.
Chapter 96
DaVinci Code
Monday, June 17, 13
A sample of the text in question… This is an action scene.
SOWHAT ABOUT “BAGS OF
WORDS” HERE?
Text content worked for sex scenes…..
Monday, June 17, 13
SGD Classifier on “exciting” scenes washed out –
about 60% accuracy on Dan Brown.
Monday, June 17, 13
It’s possible I could’ve improved this with some other trickery, but heck, let’s move on.
LDA Topic Analysis
Topic analysis produces associations between words
and chunks of text, by probabilistic methods.
“Topics” are described by lists of most informative
words.
A topic may be associated with multiple documents.
Monday, June 17, 13
I thought maybe I’d get somewhere with another “bag of words” unstructured technique that’s popular now: topic analysis.
Blei (2011) from http://www.scottbot.net/HIAL/?p=221
Monday, June 17, 13
A snippet from a classic article.
Elijah Meeks: https://dhs.stanford.edu/comprehending-the-digital-humanities/topics/
Monday, June 17, 13
A network view of topics and documents, by Elijah Meeks. This is a pretty obvious way to visualize the results of LDA on text.
But my data is ordered chapters, so I didn’t want to do this. I wanted to keep the relationship, but still see the topics…
Another tool:
DaVinci Code topics to chapters
mapping
“Excitement” rating color scale
avg by chapter, ordered
(obviously)
Topics (48ish) per
chapter (108)
Chapter 1… to Chapter 108
Monday, June 17, 13
Built another tool to see if there was anything in this – showing them as ordered chapters connected by the “best” matching
topics.
Ah, but since it’s svg/d3… var chart = chart.append("g").attr("translate","0," +
y).attr("transform","rotate(90 600 600)");
But, maybe I need chapter
summaries…. So I can relate
them to the topics?
Monday, June 17, 13
Outsource the summary writing for each chapter, to make it easier to see how topics relate to chapter contexts. … Add them as
text under the leaves (the boxes that represent chapters). Now it’s hard to read – so use svg cute rotate trick and some
resizing…!
Add some topic-tooltips
and fade-outs….
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_arc_diagram/TopicArc.html
Monday, June 17, 13
Some UI niceties I added to make it slightly usable, even for myself. Unfortunately I had to shorten the text my friend created
for each chapter; the originals were pretty hilarious…
This project
featured a
Crayola color
scheme.
http://en.wikipedia.org/wiki/List_of_Crayola_crayon_colors
Monday, June 17, 13
This was the best way I could find on short notice to get a list of divergent bright colors… but I still had to hand-tweak the ones I
used till they were all more or less readable and distinguishable!
Maybe I need One More Tool. Any word relations of interest?
Let’s try a hairball…
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_words_network/index.html
Monday, June 17, 13
For this tool I used Jim Vallandingham’s network code from the flowingdata tutorial – my first major use of coffeescript. I
intended to try the radial layout, but ended up not.There was a lot of preprocessing of the stats to get the topics related to
chapter “excitingness.”
Small
“constellations”
show shared
words (an
accident that’s
useful!)
Filtered to only the
“exciting” nodes…
Monday, June 17, 13
On rollover, you can see the links, and since I created links between shared words (colored in blue), you can see little
constellations for them, which I liked. This could’ve been simplified, of course.
THAT FELT LIKE A DEAD
END.
Maybe pretty, but
Monday, June 17, 13
No structure visible, no story structure… In using just the bag of words, I’d lost all structure or relationships across time, which
seems important to me for things like pacing in a novel.
Slide by me in a talk on Nodebox: http://blogger.ghostweather.com/2013/03/data-visualization-with-nodebox.html
Covered up for
cheap theatrics…
Monday, June 17, 13
Previous work, sketch showing relationship of dialog to exposition in different texts; this I did in early spring of 2013, for another
talk. It was a simple visual of dialog vs. exposition.
Slide by me in a talk on Nodebox: http://blogger.ghostweather.com/2013/03/data-visualization-with-nodebox.html
Monday, June 17, 13
At the time I theorized that the reason Angels and Demons (which I bought on sale on Amazon) had less dialog towards the end
was because the action had increased, to the detriment of dialog. Maybe I was right? Simple is best?
Back to Python.
§ Chunk book by chapter, get POS tags, punctuation,
and word counts + more for each chapter…
§ Import scores from Turkers id’ing which bits are
exciting/action, incorporate with the other data.
Monday, June 17, 13
New process: Check for relationships of everything, across time; and relationship to “excitement” ratings.
Some pretty big rater
differences,
actually.
Monday, June 17, 13
First note the ratings differences (using ipython notebook and pandas). I used the avg scores.
Item = chapter
Magic: Pandas’ rolling_mean function
on different window sizes!
Monday, June 17, 13
After some code magic, nouns, for instance, look like this. Lots of chapters, lots of variation. I used rolling means to get a
smoother curve!
Well, this is a mess.
Monday, June 17, 13
These are hard to plot together on the same scale – giant mess.
Monday, June 17, 13
Standardize the data with a small transform, to get it all on a comparable scale.
Still kind of a mess.
Hey, now I want to play with it live, with UI controls…
Enter Bearcart (by Rob Story/@oceankidbilly)
Monday, June 17, 13
Packaging to generate rickshaw.js graphs built on d3.
Notice the nice
checkbox per
series controls –
what I needed!
Monday, June 17, 13
The result in a browser – just showing you Twilight for example, back to Brown in a sec.
A few oddities… nouns & verbs
Angels & Demons
DaVinci Code
verbs
verbs
nouns
nouns
Monday, June 17, 13
There were nice inverse relations between nouns and verbs in both books. (Both done proportionally and as absolutes.These
plots are proportional numbers.)
Basic “excitement/action” arcs
DaVinci Code
Angels & Demons
Monday, June 17, 13
Notice the nice climbing excitement curve on A&D. This is based on Turker avg scores, of course. DVC has more peaks in the
first half, it seems. But does climb at the end.
Angels & Demons
DaVinci Code
“score”
Action “score”
Quotes
Quotes
Demo http://www.ghostweather.com/essays/talks/openvisconf/
bearcart/index_dav.html
Demo http://www.ghostweather.com/essays/talks/
openvisconf/bearcart/index_ang.html
Chapter number
So I was right –lots of running
around and stuff!
Monday, June 17, 13
So what’s the closest correlate to the action? Inverse correlation (mild) with the dialog, as I suspected initially. (Yes, logistic
regression and other stats can be done here – I did, too. But for this talk, it was about the visuals.)
TwilightThe talky bits…
So…The action?
Invert --
Monday, June 17, 13
Not necessarily true – the giant peak of expository stuff early in Twilight turns out to be the trip to the beach where all the
vampire/werewolf stories come out.
Yet.Another.Tool. !
Demo:
http://www.ghostweather.com/essays/talks/openvisconf/chapter_scores/score_rollover_dav.html
Monday, June 17, 13
Another tool needed… to check the numbers with the text visible. You can eyeball for highlighted correlations with the
excitement, and rollover the blocks again to see the text.
Some final thoughts
Create minimum viable tools (to help you
visualize/analyse) in whatever you can use, fast.
And boy, machine learning sure can use
interactive visual tools!
A browser can easily hold an entire trashy novel.
Monday, June 17, 13
THANKS!
@arnicas, Lynn@ghostweather.com
My thanks to….
Luminosity (help with Dan Brown summaries)Yves Fey (help with romance genre
conventions) Fan friends with sex-filled long fanfic refs (Dorinda, Movies_Michelle,
Gwyn Rhys) Rob Story/@oceankidbilly (for help with Bearcart under pressure) Jim
Vallandingham/@vlandham for his code/advice, Irene and Bocoup for hosting!
Monday, June 17, 13
A Few References
§ Applied Machine Learning with Scikit-Learn:http://scikit-learn.github.io/scikit-learn-tutorial/
index.html
§ Naïve Bayes for text in Scikit-Learn: http://scikit-learn.org/stable/modules/
naive_bayes.html#naive-bayes
§ Stochastic Gradient Descent in Scikit-Learn: http://scikit-learn.org/0.13/modules/sgd.html
§ Nice tutorial overview of working with text data: scikit-learn.github.io/scikit-learn-tutorial/
working_with_text_data.html
§ Bearcart by Rob Story – Rickshaw timeseries graphs from python pandas datastructure in 4
lines (https://github.com/wrobstory/bearcart)
§ LDA topic modeling tool with UI - https://code.google.com/p/topic-modeling-tool/
§ Scott Weingart’s nice overview of LDA Topic Modeling in Digital Humanities: http://
www.scottbot.net/HIAL/?p=221
§ Elijah Meeks’ lovely set of articles on LDA & Digital Humanties vis: https://dhs.stanford.edu/
comprehending-the-digital-humanities/
§ JimVallandingham’s tooltip code and a great demo/tutorial: http://flowingdata.com/2012/08/02/
how-to-make-an-interactive-network-visualization/
§ Rickshaw for timeseries graphs: https://github.com/shutterstock/rickshaw
Monday, June 17, 13
THEVIDEO OF THE TALK:
http://blogger.ghostweather.com/2013/06/analysis-of-fiction-
my-openvisconf-talk.html
http://www.youtube.com/watch?
v=f41U936WqPM
P.S. SEE THE BLOG POST/
EXAMPLES LIVE…
Monday, June 17, 13

Más contenido relacionado

Similar a The Bones of a Bestseller: Visualizing Fiction

Unstructure: Smashing the Boundaries of Data (SxSWi 2014)
Unstructure: Smashing the Boundaries of Data (SxSWi 2014)Unstructure: Smashing the Boundaries of Data (SxSWi 2014)
Unstructure: Smashing the Boundaries of Data (SxSWi 2014)Ian Varley
 
Designing The Future - Metadesign For Murph
Designing The Future - Metadesign For MurphDesigning The Future - Metadesign For Murph
Designing The Future - Metadesign For MurphJohn V Willshire
 
Ezpeleta lp5- 25-10 (2nd part of the lp) - pass
Ezpeleta  lp5- 25-10 (2nd part of the lp) - passEzpeleta  lp5- 25-10 (2nd part of the lp) - pass
Ezpeleta lp5- 25-10 (2nd part of the lp) - passpaulaezpeleta
 
Visual Rhetoric, January 28, 2013
Visual Rhetoric, January 28, 2013Visual Rhetoric, January 28, 2013
Visual Rhetoric, January 28, 2013Miami University
 
English 111, September 11, 2012
English 111, September 11, 2012English 111, September 11, 2012
English 111, September 11, 2012Miami University
 
Ezpeleta lp4-27-09 - pass
Ezpeleta  lp4-27-09 - passEzpeleta  lp4-27-09 - pass
Ezpeleta lp4-27-09 - passpaulaezpeleta
 
Writing Workshop - Fairy Tale Writing Paper - Home
Writing Workshop - Fairy Tale Writing Paper - HomeWriting Workshop - Fairy Tale Writing Paper - Home
Writing Workshop - Fairy Tale Writing Paper - HomeTiffany Surratt
 
The Scarlet Letter
The Scarlet LetterThe Scarlet Letter
The Scarlet Letterguest1ad780
 
Ef}fgtgi€,{ EEfl e€#Ieg;ij;gitggEgIi.docx
Ef}fgtgi€,{ EEfl e€#Ieg;ij;gitggEgIi.docxEf}fgtgi€,{ EEfl e€#Ieg;ij;gitggEgIi.docx
Ef}fgtgi€,{ EEfl e€#Ieg;ij;gitggEgIi.docxgidmanmary
 
Ignite amsterdam from 0 to C
Ignite amsterdam from 0 to CIgnite amsterdam from 0 to C
Ignite amsterdam from 0 to Cubi de feo
 
Papert, Seymour (1980). MINDSTORMS. Children, Computers and Powerful Ideas.
Papert, Seymour (1980). MINDSTORMS. Children, Computers and Powerful Ideas.Papert, Seymour (1980). MINDSTORMS. Children, Computers and Powerful Ideas.
Papert, Seymour (1980). MINDSTORMS. Children, Computers and Powerful Ideas.Robert Louis Stevenson
 
Digital Rhetoric, September 9th, 2014
Digital Rhetoric, September 9th, 2014Digital Rhetoric, September 9th, 2014
Digital Rhetoric, September 9th, 2014Miami University
 
Remarkable writing
Remarkable writingRemarkable writing
Remarkable writingggarro
 

Similar a The Bones of a Bestseller: Visualizing Fiction (20)

Essay Butterfly Effect
Essay Butterfly EffectEssay Butterfly Effect
Essay Butterfly Effect
 
Unstructure: Smashing the Boundaries of Data (SxSWi 2014)
Unstructure: Smashing the Boundaries of Data (SxSWi 2014)Unstructure: Smashing the Boundaries of Data (SxSWi 2014)
Unstructure: Smashing the Boundaries of Data (SxSWi 2014)
 
Designing The Future - Metadesign For Murph
Designing The Future - Metadesign For MurphDesigning The Future - Metadesign For Murph
Designing The Future - Metadesign For Murph
 
Ezpeleta lp5- 25-10 (2nd part of the lp) - pass
Ezpeleta  lp5- 25-10 (2nd part of the lp) - passEzpeleta  lp5- 25-10 (2nd part of the lp) - pass
Ezpeleta lp5- 25-10 (2nd part of the lp) - pass
 
WRA 150 Week 10 In-Class
WRA 150 Week 10 In-ClassWRA 150 Week 10 In-Class
WRA 150 Week 10 In-Class
 
Visual Rhetoric, January 28, 2013
Visual Rhetoric, January 28, 2013Visual Rhetoric, January 28, 2013
Visual Rhetoric, January 28, 2013
 
English 111, September 11, 2012
English 111, September 11, 2012English 111, September 11, 2012
English 111, September 11, 2012
 
Ezpeleta lp4-27-09 - pass
Ezpeleta  lp4-27-09 - passEzpeleta  lp4-27-09 - pass
Ezpeleta lp4-27-09 - pass
 
ana garcia
ana garciaana garcia
ana garcia
 
Workshop #2
Workshop #2Workshop #2
Workshop #2
 
Writing Workshop - Fairy Tale Writing Paper - Home
Writing Workshop - Fairy Tale Writing Paper - HomeWriting Workshop - Fairy Tale Writing Paper - Home
Writing Workshop - Fairy Tale Writing Paper - Home
 
The Scarlet Letter
The Scarlet LetterThe Scarlet Letter
The Scarlet Letter
 
Ef}fgtgi€,{ EEfl e€#Ieg;ij;gitggEgIi.docx
Ef}fgtgi€,{ EEfl e€#Ieg;ij;gitggEgIi.docxEf}fgtgi€,{ EEfl e€#Ieg;ij;gitggEgIi.docx
Ef}fgtgi€,{ EEfl e€#Ieg;ij;gitggEgIi.docx
 
E10 dec8 2010
E10 dec8  2010E10 dec8  2010
E10 dec8 2010
 
Ignite amsterdam from 0 to C
Ignite amsterdam from 0 to CIgnite amsterdam from 0 to C
Ignite amsterdam from 0 to C
 
Papert, Seymour (1980). MINDSTORMS. Children, Computers and Powerful Ideas.
Papert, Seymour (1980). MINDSTORMS. Children, Computers and Powerful Ideas.Papert, Seymour (1980). MINDSTORMS. Children, Computers and Powerful Ideas.
Papert, Seymour (1980). MINDSTORMS. Children, Computers and Powerful Ideas.
 
Digital Rhetoric, September 9th, 2014
Digital Rhetoric, September 9th, 2014Digital Rhetoric, September 9th, 2014
Digital Rhetoric, September 9th, 2014
 
Remarkable writing
Remarkable writingRemarkable writing
Remarkable writing
 
Impland - An Alien Utopia: A 40th Anniversary Retrospective
Impland - An Alien Utopia: A 40th Anniversary RetrospectiveImpland - An Alien Utopia: A 40th Anniversary Retrospective
Impland - An Alien Utopia: A 40th Anniversary Retrospective
 
6 320su16 rhetoric
6 320su16 rhetoric6 320su16 rhetoric
6 320su16 rhetoric
 

Más de Lynn Cherny

COCO's Memory Palace: A Strange Fantasia
COCO's Memory Palace: A Strange FantasiaCOCO's Memory Palace: A Strange Fantasia
COCO's Memory Palace: A Strange FantasiaLynn Cherny
 
Things I Think Are Awesome (Eyeo 2016 Talk)
Things I Think Are Awesome (Eyeo 2016 Talk)Things I Think Are Awesome (Eyeo 2016 Talk)
Things I Think Are Awesome (Eyeo 2016 Talk)Lynn Cherny
 
Nodebox for Data Visualization
Nodebox for Data VisualizationNodebox for Data Visualization
Nodebox for Data VisualizationLynn Cherny
 
Interactive Data Visualization (with D3.js)
Interactive Data Visualization (with D3.js)Interactive Data Visualization (with D3.js)
Interactive Data Visualization (with D3.js)Lynn Cherny
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)Lynn Cherny
 
Simplifying Social Network Diagrams
Simplifying Social Network Diagrams Simplifying Social Network Diagrams
Simplifying Social Network Diagrams Lynn Cherny
 
Design For Online Community: Beyond the Hype
Design For Online Community: Beyond the HypeDesign For Online Community: Beyond the Hype
Design For Online Community: Beyond the HypeLynn Cherny
 

Más de Lynn Cherny (7)

COCO's Memory Palace: A Strange Fantasia
COCO's Memory Palace: A Strange FantasiaCOCO's Memory Palace: A Strange Fantasia
COCO's Memory Palace: A Strange Fantasia
 
Things I Think Are Awesome (Eyeo 2016 Talk)
Things I Think Are Awesome (Eyeo 2016 Talk)Things I Think Are Awesome (Eyeo 2016 Talk)
Things I Think Are Awesome (Eyeo 2016 Talk)
 
Nodebox for Data Visualization
Nodebox for Data VisualizationNodebox for Data Visualization
Nodebox for Data Visualization
 
Interactive Data Visualization (with D3.js)
Interactive Data Visualization (with D3.js)Interactive Data Visualization (with D3.js)
Interactive Data Visualization (with D3.js)
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
 
Simplifying Social Network Diagrams
Simplifying Social Network Diagrams Simplifying Social Network Diagrams
Simplifying Social Network Diagrams
 
Design For Online Community: Beyond the Hype
Design For Online Community: Beyond the HypeDesign For Online Community: Beyond the Hype
Design For Online Community: Beyond the Hype
 

Último

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesSanjay Willie
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 

Último (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 

The Bones of a Bestseller: Visualizing Fiction

  • 1. The Bones of a Bestseller: Visualizing Fiction Lynn Cherny @arnicas OpenvisConf 2013 Monday, June 17, 13
  • 3. Monday, June 17, 13 Book stars for today
  • 4. Study what’s popular, because it tells us something about people. Monday, June 17, 13 Additionally, I want to illustrate using some statistical tricks on data, particularly some simple machine learning – and the tools I built to visualize those results.
  • 5. http://www.economist.com/blogs/graphicdetail/2012/11/fifty-shades-data-visualisations BY Monday, June 17, 13 Start with a little motivating graphic, or “pornographic”, that inspired me about 6 months ago. This was actually on the Economist’s blog! Can we do this automatically?
  • 6. Text Classification (Commonly) §“Bag of words” – each document is considered a collection of words, independent of order §Frequencies of certain words are used to identify the texts Seems like this should work with sex scenes, right? Only so many body parts and behaviors, right?! Monday, June 17, 13 The way everyone would start this problem…
  • 7. Data Label Estdsgfd fdsatreatret dfds Yes Dsrdsf drerear ewrewtrew No Reret retdrtd rewrewrtew Yes Dsfgdg fdsfd Yes Algorithm Train Test New data in the wild Monday, June 17, 13 Supervised learning: Have some data you label with the truth, and feed it into some code to learn what the truth is all about. To do this properly, you divide the data up in a training set, and an evaluation set – and you see how your code did on the evaluation set: how much did it get right? Once you’re satisfied with the tweaks on the classifier code, you can use it on new data in the wild.
  • 8. Sex Scene Detection First Steps 1. Buy 50 Shades on Amazon, unlock text in Calibre, save as TXT file. 2. Cut up a doc into 500 “word” chunks using Python 3. Try to label each chunk: “not sexy” (e.g., paperwork, taxes, calls to Mom) “maybe steamy” (e.g. kissing, limited touching, long looks) “sexy!” (fill in the ____ here) Monday, June 17, 13
  • 9. “Would you like to sit?” He waves me toward an L-shaped white leather couch. His office is way too big for just one man. In front of the floor-to-ceiling windows, there’s a modern dark wood desk that six people could comfortably eat around. It matches the coffee table by the couch. Everything else is white—ceiling, floors, and walls, except for the wall by the door, where a mosaic of small paintings hang, thirty-six of them arranged in a square.They are exquisite—a series of mundane, forgotten objects painted in such precise detail they look like photographs. Displayed together, they are breathtaking. “A local artist.Trouton,” says Grey when he catches my gaze. “They’re lovely. Raising the ordinary to extraordinary,” I murmur, distracted both by him and the paintings. He cocks his head to one side and regards me intently. “I couldn’t agree more, Miss Steele,” he replies, his voice soft, and for some inexplicable reason I find myself blushing. Sample of 50 Shades of Grey Monday, June 17, 13 What the text looks like…
  • 10. Outsourced to Mechanical Turk Monday, June 17, 13 Doing the sex scenes labels myself sucked, so I outsourced it to Mechanical Turk, Amazon’s crowdsourcing remote work tool. It was super easy (to spend a lot of money on this). So I did (spend a lot of money).
  • 11. WHAT’S A SEX SCENE, ANYWAY? Monday, June 17, 13 But let’s step back a little…
  • 12. Zara.com Monday, June 17, 13 Lots would say this is sexy (maybe not all women, though).
  • 13. http://www.ebay.com/itm/Adult-Sex-Toys-Tools-Handcuffs-Eye-mask-Neck-Band-Strap-Whip-Rope-/330845727274?pt= UK_Home_Garden_Celebrations_Occasions_ET&hash=item4d07f12a2a Monday, June 17, 13 Some would say this set is sexy, others definitely would not. This turns out to be a lot of what 50 Shades is about… So, hmm. Also, this set is on Ebay in the UK if you’re into it.
  • 14. trendir.com Monday, June 17, 13 So, apart from the bondage, the Mechanical Turkers are seeing small chunks of text, with no context, in random orders. Suppose there’s a steamy shower scene where they are getting it on – but they stop to discuss a horrible childhood incident and cry? Is that in a sex scene, or not? Tough to say.
  • 15. Sexually Exxxplicit, but still a http://www.icts.uiowa.edu/sites/default/files/contract.jpg Monday, June 17, 13 Even worse – some parts of the first book are long sections of contract, which contain sexual rules and regulations – but it’s a contract. Sexy, or not? Probably not to most…
  • 16. Monday, June 17, 13 Results from Mechanical Turk as a CSV file.
  • 17. How’d the raters do? Sex Scenes Steamy Scenes Monday, June 17, 13 We can see a fair amount of variation here, some good agreement, but the blue raters were more turned on by the beginning of the book.
  • 18. Comparing to “Pornographic”… Monday, June 17, 13 A pretty good match, actually. Good for the Turkers and the porno-graphic team!
  • 19. Comparing: Monday, June 17, 13 Again, what’s up with the blue raters – they loved this book. Red did not find it sexy at all.
  • 20. On to the learning algorithm… The training data: -The text chunks -The score the raters gave it (averaged) as “truth” I started with Python’s NLTK (Natural Language Toolkit) and Naïve Bayes for classifying (working in an ipython notebook). Monday, June 17, 13
  • 21. NLTK Naïve Bayes not so great on 50 Shades… 68%. “packet” (they use a lot of condoms) Monday, June 17, 13 NLTK outputs a list of top terms, unlike scikit-learn – just wanted to show you what they looked like.
  • 22. Python’s sklearn (scikit-learn) Lots of classifiers for sparse data like text! http://scikit-learn.org/0.13/auto_examples/ document_classification_20newsgroups.html Monday, June 17, 13 This is an illustration (not by me) of how many classifiers there are that can be used on text, in scikit-learn… Picked one that has general good performance, to see how it compared to Naïve Bayes – Stochastic Gradient Descent. Notice there’s a Passive Aggressive Classifier, too. Best.name.ever.
  • 23. Using a lemmatizer step in the pipeline (to strip endings off words, since some fiction in my later samples was in present tense) Pipelines in sklearn makes it incredibly easy to run lots of experiments. Fit the model, using training data and “target” answers (in this case,“50 Shades of Grey”) Test the model on new data (in this case,“50 Shades Darker”). Check how it did against the answers. Now we’re at 88% Monday, June 17, 13 Just to show you how little code it is to run a classifier pipeline – and check the results.
  • 24. Interpreting the results… Demo: http://www.ghostweather.com/essays/talks/openvisconf/ text_scores/rollover.html Monday, June 17, 13 To be able to browse the results by content and context, I built a little tool in D3; you can see the matches and mismatches in the sex scenes, and rollover each little block to inspect the text itself. Useful!
  • 25. Really amazing P.S. here… I paid for coding of a bunch of fan-fiction for sex scenes too, and fed them in to the SGD classifier. (Recall that 50 Shades started life as Twilight fanfic.) *cross-validating with entire set, not just 50 Shades books. 97% accuracy achieved!* Monday, June 17, 13 I did spend a lot of money on Mturk, getting ratings of sex scenes. For future talks…
  • 26. I SAID I’D TALK ABOUT STORY ARCS TOO But hey - Monday, June 17, 13 Switching back to another theme… overall arc of action in a novel.
  • 27. http://www.musik-therapie.at/PederHill/Structure&Plot.htm Monday, June 17, 13 The movie version of story arcs… “height” is tension, or some kind of measure of excitement, or drama…
  • 28. PLEASE. IFYOU WRITING SCREENPLAY. HULK TELLINGYOU.THE 3 ACT STRUCTURE = GARBAGE. STOP CITING IT IN ARTICLES. STOP TALKING ABOUT IT WITH FRIENDS. IT WILL NOT HELPYOU. STAY THE FUCK AWAY FROM ANYONE WHO EVEN CLAIM IT EXIST. IF THEY SAY IT DO. SAY “OR COURSE SHIT HAS BEGINNING, MIDDLE,AND ENDING YOU INSUFFERABLE TURD” THEN THROW A DRINK IN THEIR FACE AND RUN AWAY… http://filmcrithulk.wordpress.com/2011/07/07/hulk-presents-the-myth-of-3-act-structure/ “The HULK Presents the Myth of the 3-Act Structure” Monday, June 17, 13 But the more I investigated this online, the more I found people saying it’s bullshit. This is the best quote I found on the subject. Totally worth reading the essay.
  • 29. Vonnegut - http://thedesigngym.com/simpleshapesofstories/ Monday, June 17, 13 Vonnegut is here talking about sentiment of events, not really “tension” or “excitement” or rising action – but there’s still some kind of structural differences going on across each book/story. The question is, what do best sellers look like over the course of the whole story? By whatever measure will illustrate the pacing/movement of the story. Lower right hand corner is me working on this talk.
  • 30. Monday, June 17, 13 Vonnegut on real life, compared to fiction. But because that’s depressing, here is a tiny owl.
  • 32. Monday, June 17, 13 So – back to the initial thought: rising action, crises, resolution, etc. Can we find this in books? Automatically, I mean?
  • 33. Can we detect exciting scenes? Back to Mechanical Turk, with Dan Brown books: 2 raters again, chunks of 500 words Odd factoid: I got ratings of sex scenes in 2-4 hours. It took ~13 hours to get Dan Brown action scenes. Monday, June 17, 13 Using action/exciting scenes as proxy for major events in a book…
  • 34. “ACTION” SCENES ARE TOUGH, TOO Monday, June 17, 13 Brief digression on how hard this is.
  • 35. Raven.theraider.net Monday, June 17, 13 This seems obvious… fights, chases, etc.
  • 36. Objects in the mirror are closer than they appear www.badhaven.com / Jurassic Park Monday, June 17, 13 Small chunks out of context don’t always look like action. Remember Mechanical Turk folks are seeing small pieces without context, so their judgments are based on only a tiny window. Or, in the case of Dan Brown, it might ALL look like action. (You might look up “bathos” here too.)
  • 37. Almost naked, Silas hurled his pale body down the staircase. He knew he had been betrayed, but by whom? When he reached the foyer, more officers were surging through the front door. Silas turned the other way and dashed deeper into the residence hall.The women's entrance. Every Opus Dei building has one.Winding down narrow hallways, Silas snaked through a kitchen, past terrified workers, who left to avoid the naked albino as he knocked over bowls and silverware, bursting into a dark hallway near the boiler room. He now saw the door he sought, an exit light gleaming at the end. Running full speed through the door out into the rain, Silas leapt off the low landing, not seeing the officer coming the other way until it was too late.The two men collided, Silas's broad, naked shoulder grinding into the man's sternum with crushing force. He drove the officer backward onto the pavement, landing hard on top of him.The officer's gun clattered away. Silas could hear men running down the hall shouting. Rolling, he grabbed the loose gun just as the officers emerged. A shot rang out on the stairs, and Silas felt a searing pain below his ribs. Filled with rage, he opened fire at all three officers, their blood spraying. A dark shadow loomed behind, coming out of nowhere.The angry hands that grabbed at his bare shoulders felt as if they were infused with the power of the devil himself.The man roared in his ear. SILAS, NO! Silas spun and fired.Their eyes met. Silas was already screaming in horror as Bishop Aringarosa fell. Chapter 96 DaVinci Code Monday, June 17, 13 A sample of the text in question… This is an action scene.
  • 38. SOWHAT ABOUT “BAGS OF WORDS” HERE? Text content worked for sex scenes….. Monday, June 17, 13
  • 39. SGD Classifier on “exciting” scenes washed out – about 60% accuracy on Dan Brown. Monday, June 17, 13 It’s possible I could’ve improved this with some other trickery, but heck, let’s move on.
  • 40. LDA Topic Analysis Topic analysis produces associations between words and chunks of text, by probabilistic methods. “Topics” are described by lists of most informative words. A topic may be associated with multiple documents. Monday, June 17, 13 I thought maybe I’d get somewhere with another “bag of words” unstructured technique that’s popular now: topic analysis.
  • 41. Blei (2011) from http://www.scottbot.net/HIAL/?p=221 Monday, June 17, 13 A snippet from a classic article.
  • 42. Elijah Meeks: https://dhs.stanford.edu/comprehending-the-digital-humanities/topics/ Monday, June 17, 13 A network view of topics and documents, by Elijah Meeks. This is a pretty obvious way to visualize the results of LDA on text. But my data is ordered chapters, so I didn’t want to do this. I wanted to keep the relationship, but still see the topics…
  • 43. Another tool: DaVinci Code topics to chapters mapping “Excitement” rating color scale avg by chapter, ordered (obviously) Topics (48ish) per chapter (108) Chapter 1… to Chapter 108 Monday, June 17, 13 Built another tool to see if there was anything in this – showing them as ordered chapters connected by the “best” matching topics.
  • 44. Ah, but since it’s svg/d3… var chart = chart.append("g").attr("translate","0," + y).attr("transform","rotate(90 600 600)"); But, maybe I need chapter summaries…. So I can relate them to the topics? Monday, June 17, 13 Outsource the summary writing for each chapter, to make it easier to see how topics relate to chapter contexts. … Add them as text under the leaves (the boxes that represent chapters). Now it’s hard to read – so use svg cute rotate trick and some resizing…!
  • 45. Add some topic-tooltips and fade-outs…. Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_arc_diagram/TopicArc.html Monday, June 17, 13 Some UI niceties I added to make it slightly usable, even for myself. Unfortunately I had to shorten the text my friend created for each chapter; the originals were pretty hilarious…
  • 46. This project featured a Crayola color scheme. http://en.wikipedia.org/wiki/List_of_Crayola_crayon_colors Monday, June 17, 13 This was the best way I could find on short notice to get a list of divergent bright colors… but I still had to hand-tweak the ones I used till they were all more or less readable and distinguishable!
  • 47. Maybe I need One More Tool. Any word relations of interest? Let’s try a hairball… Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_words_network/index.html Monday, June 17, 13 For this tool I used Jim Vallandingham’s network code from the flowingdata tutorial – my first major use of coffeescript. I intended to try the radial layout, but ended up not.There was a lot of preprocessing of the stats to get the topics related to chapter “excitingness.”
  • 48. Small “constellations” show shared words (an accident that’s useful!) Filtered to only the “exciting” nodes… Monday, June 17, 13 On rollover, you can see the links, and since I created links between shared words (colored in blue), you can see little constellations for them, which I liked. This could’ve been simplified, of course.
  • 49. THAT FELT LIKE A DEAD END. Maybe pretty, but Monday, June 17, 13 No structure visible, no story structure… In using just the bag of words, I’d lost all structure or relationships across time, which seems important to me for things like pacing in a novel.
  • 50. Slide by me in a talk on Nodebox: http://blogger.ghostweather.com/2013/03/data-visualization-with-nodebox.html Covered up for cheap theatrics… Monday, June 17, 13 Previous work, sketch showing relationship of dialog to exposition in different texts; this I did in early spring of 2013, for another talk. It was a simple visual of dialog vs. exposition.
  • 51. Slide by me in a talk on Nodebox: http://blogger.ghostweather.com/2013/03/data-visualization-with-nodebox.html Monday, June 17, 13 At the time I theorized that the reason Angels and Demons (which I bought on sale on Amazon) had less dialog towards the end was because the action had increased, to the detriment of dialog. Maybe I was right? Simple is best?
  • 52. Back to Python. § Chunk book by chapter, get POS tags, punctuation, and word counts + more for each chapter… § Import scores from Turkers id’ing which bits are exciting/action, incorporate with the other data. Monday, June 17, 13 New process: Check for relationships of everything, across time; and relationship to “excitement” ratings.
  • 53. Some pretty big rater differences, actually. Monday, June 17, 13 First note the ratings differences (using ipython notebook and pandas). I used the avg scores.
  • 54. Item = chapter Magic: Pandas’ rolling_mean function on different window sizes! Monday, June 17, 13 After some code magic, nouns, for instance, look like this. Lots of chapters, lots of variation. I used rolling means to get a smoother curve!
  • 55. Well, this is a mess. Monday, June 17, 13 These are hard to plot together on the same scale – giant mess.
  • 56. Monday, June 17, 13 Standardize the data with a small transform, to get it all on a comparable scale. Still kind of a mess.
  • 57. Hey, now I want to play with it live, with UI controls… Enter Bearcart (by Rob Story/@oceankidbilly) Monday, June 17, 13 Packaging to generate rickshaw.js graphs built on d3.
  • 58. Notice the nice checkbox per series controls – what I needed! Monday, June 17, 13 The result in a browser – just showing you Twilight for example, back to Brown in a sec.
  • 59. A few oddities… nouns & verbs Angels & Demons DaVinci Code verbs verbs nouns nouns Monday, June 17, 13 There were nice inverse relations between nouns and verbs in both books. (Both done proportionally and as absolutes.These plots are proportional numbers.)
  • 60. Basic “excitement/action” arcs DaVinci Code Angels & Demons Monday, June 17, 13 Notice the nice climbing excitement curve on A&D. This is based on Turker avg scores, of course. DVC has more peaks in the first half, it seems. But does climb at the end.
  • 61. Angels & Demons DaVinci Code “score” Action “score” Quotes Quotes Demo http://www.ghostweather.com/essays/talks/openvisconf/ bearcart/index_dav.html Demo http://www.ghostweather.com/essays/talks/ openvisconf/bearcart/index_ang.html Chapter number So I was right –lots of running around and stuff! Monday, June 17, 13 So what’s the closest correlate to the action? Inverse correlation (mild) with the dialog, as I suspected initially. (Yes, logistic regression and other stats can be done here – I did, too. But for this talk, it was about the visuals.)
  • 62. TwilightThe talky bits… So…The action? Invert -- Monday, June 17, 13 Not necessarily true – the giant peak of expository stuff early in Twilight turns out to be the trip to the beach where all the vampire/werewolf stories come out.
  • 63. Yet.Another.Tool. ! Demo: http://www.ghostweather.com/essays/talks/openvisconf/chapter_scores/score_rollover_dav.html Monday, June 17, 13 Another tool needed… to check the numbers with the text visible. You can eyeball for highlighted correlations with the excitement, and rollover the blocks again to see the text.
  • 64. Some final thoughts Create minimum viable tools (to help you visualize/analyse) in whatever you can use, fast. And boy, machine learning sure can use interactive visual tools! A browser can easily hold an entire trashy novel. Monday, June 17, 13
  • 65. THANKS! @arnicas, Lynn@ghostweather.com My thanks to…. Luminosity (help with Dan Brown summaries)Yves Fey (help with romance genre conventions) Fan friends with sex-filled long fanfic refs (Dorinda, Movies_Michelle, Gwyn Rhys) Rob Story/@oceankidbilly (for help with Bearcart under pressure) Jim Vallandingham/@vlandham for his code/advice, Irene and Bocoup for hosting! Monday, June 17, 13
  • 66. A Few References § Applied Machine Learning with Scikit-Learn:http://scikit-learn.github.io/scikit-learn-tutorial/ index.html § Naïve Bayes for text in Scikit-Learn: http://scikit-learn.org/stable/modules/ naive_bayes.html#naive-bayes § Stochastic Gradient Descent in Scikit-Learn: http://scikit-learn.org/0.13/modules/sgd.html § Nice tutorial overview of working with text data: scikit-learn.github.io/scikit-learn-tutorial/ working_with_text_data.html § Bearcart by Rob Story – Rickshaw timeseries graphs from python pandas datastructure in 4 lines (https://github.com/wrobstory/bearcart) § LDA topic modeling tool with UI - https://code.google.com/p/topic-modeling-tool/ § Scott Weingart’s nice overview of LDA Topic Modeling in Digital Humanities: http:// www.scottbot.net/HIAL/?p=221 § Elijah Meeks’ lovely set of articles on LDA & Digital Humanties vis: https://dhs.stanford.edu/ comprehending-the-digital-humanities/ § JimVallandingham’s tooltip code and a great demo/tutorial: http://flowingdata.com/2012/08/02/ how-to-make-an-interactive-network-visualization/ § Rickshaw for timeseries graphs: https://github.com/shutterstock/rickshaw Monday, June 17, 13
  • 67. THEVIDEO OF THE TALK: http://blogger.ghostweather.com/2013/06/analysis-of-fiction- my-openvisconf-talk.html http://www.youtube.com/watch? v=f41U936WqPM P.S. SEE THE BLOG POST/ EXAMPLES LIVE… Monday, June 17, 13

Notas del editor

  1. Book stars for today
  2. Additionally, I want to illustrate using some statistical tricks on data, particularly some simple machine learning – and the tools I built to visualize those results.
  3. Start with a little motivating graphic, or “pornographic”, that inspired me about 6 months ago. This was actually on the Economist’s blog!Can we do this automatically?
  4. The way everyone would start this problem…
  5. Then you can use a good model to predict new things you haven’t seen before. Spam classifiers work this way.
  6. Supervised learning: Have some data you label with the truth, and feed it into some code to learn what the truth is all about.To do this properly, you divide the data up in a training set, and an evaluation set – and you see how your code did on the evaluation set: how much did it get right?Once you’re satisfied with the tweaks on the classifier code, you can use it on new data in the wild.
  7. What the text looks like…
  8. Doing the sex scenes labels myself sucked, so I outsourced it to Mechanical Turk, Amazon’s crowdsourcing remote work tool. It was super easy (to spend a lot of money on this). So I did (spend a lot of money).
  9. But let’s step back a little…
  10. Lots would say this is sexy (maybe not all women, though).
  11. Some would say this set is sexy, others definitely would not. This turns out to be a lot of what 50 Shades is about… So, hmm. Also, this set is on Ebay in the UK if you’re into it.
  12. So, apart from the bondage, the Mechanical Turkers are seeing small chunks of text, with no context, in random orders. Suppose there’s a steamy shower scene where they are getting it on – but they stop to discuss a horrible childhood incident and cry? Is that in a sex scene, or not? Tough to say.
  13. Even worse – some parts of the first book are long sections of contract, which contain sexual rules and regulations – but it’s a contract. Sexy, or not? Probably not to most…
  14. Results from Mechanical Turk as a CSV file.
  15. We can see a fair amount of variation here, some good agreement, but the blue raters were more turned on by the beginning of the book.
  16. A pretty good match, actually. Good for the Turkers and the porno-graphic team!
  17. Again, what’s up with the blue raters – they loved this book. Red did not find it sexy at all.
  18. NLTK outputs a list of top terms, unlike scikit-learn – just wanted to show you what they looked like.
  19. This is an illustration (not by me) of how many classifiers there are that can be used on text, in scikit-learn… Picked one that has general good performance, to see how it compared to Naïve Bayes – Stochastic Gradient Descent. Notice there’s a Passive Aggressive Classifier, too. Best.name.ever.
  20. Just to show you how little code it is to run a classifier pipeline – and check the results.
  21. To be able to browse the results by content and context, I built a little tool in D3; you can see the matches and mismatches in the sex scenes, and rollover each little block to inspect the text itself. Useful!
  22. I did spend a lot of money on Mturk, getting ratings of sex scenes. For future talks…
  23. Switching back to another theme… overall arc of action in a novel.
  24. The movie version of story arcs… “height” is tension, or some kind of measure of excitement, or drama…
  25. But the more I investigated this online, the more I found people saying it’s bullshit. This is the best quote I found on the subject. Totally worth reading the essay.
  26. Vonnegutis here talking about sentiment of events, not really “tension” or “excitement” or rising action – but there’s still some kind of structural differences going on across each book/story. The question is, what do best sellers look like over the course of the whole story? By whatever measure will illustrate the pacing/movement of the story.Lower right hand corner is me working on this talk.
  27. Vonnegut on real life, compared to fiction.But because that’s depressing, here is a tiny owl.
  28. Did this cheer you up? It’s better than a kitten! IMO.
  29. So – back to the initial thought: rising action, crises, resolution, etc. Can we find this in books? Automatically, I mean?
  30. Using action/exciting scenes as proxy for major events in a book…
  31. Brief digression on how hard this is.
  32. This seems obvious… fights, chases, etc.
  33. Small chunks out of context don’t always look like action. Remember Mechanical Turk folks are seeing small pieces without context, so their judgments are based on only a tiny window. Or, in the case of Dan Brown, it might ALL look like action. (You might look up “bathos” here too.)
  34. A sample of the text in question… This is an action scene.
  35. It’s possible I could’ve improved this with some other trickery, but heck, let’s move on.
  36. I thought maybe I’d get somewhere with another “bag of words” unstructured technique that’s popular now: topic analysis.
  37. A snippet from a classic article.
  38. A network view of topics and documents, by Elijah Meeks. This is a pretty obvious way to visualize the results of LDA on text. Butmy data is ordered chapters, so I didn’t want to do this. I wanted to keep the relationship, but still see the topics…
  39. Built another tool to see if there was anything in this – showing them as ordered chapters connected by the “best” matching topics.
  40. Outsource the summary writing for each chapter, to make it easier to see how topics relate to chapter contexts. … Add them as text under the leaves (the boxes that represent chapters). Now it’s hard to read – so use svg cute rotate trick and some resizing…!
  41. Some UI niceties I added to make it slightly usable, even for myself. Unfortunately I had to shorten the text my friend created for each chapter; the originals were pretty hilarious…
  42. This was the best way I could find on short notice to get a list of divergent bright colors… but I still had to hand-tweak the ones I used till they were all more or less readable and distinguishable!
  43. For this tool I used Jim Vallandingham’s network code from the flowingdata tutorial – my first major use of coffeescript. I intended to try the radial layout, but ended up not. There was a lot of preprocessing of the stats to get the topics related to chapter “excitingness.”
  44. On rollover, you can see the links, and since I created links between shared words (colored in blue), you can see little constellations for them, which I liked. This could’ve been simplified, of course.
  45. No structure visible, no story structure… In using just the bag of words, I’d lost all structure or relationships across time, which seems important to me for things like pacing in a novel.
  46. Previous work, sketch showing relationship of dialog to exposition in different texts; this I did in early spring of 2013, for another talk. It was a simple visual of dialog vs. exposition.
  47. At the time I theorized that the reason Angels and Demons (which I bought on sale on Amazon) had less dialog towards the end was because the action had increased, to the detriment of dialog. Maybe I was right? Simple is best?
  48. New process: Check for relationships of everything, across time; and relationship to “excitement” ratings.
  49. First note the ratings differences (using ipython notebook and pandas). I used the avg scores.
  50. After some code magic, nouns, for instance, look like this. Lots of chapters, lots of variation. I used rolling means to get a smoother curve!
  51. These are hard to plot together on the same scale – giant mess.
  52. Standardize the data with a small transform, to get it all on a comparable scale.Still kind of a mess.
  53. Packaging to generate rickshaw.js graphs built on d3.
  54. The result in a browser – just showing you Twilight for example, back to Brown in a sec.
  55. There were nice inverse relations between nouns and verbs in both books. (Both done proportionally and as absolutes. These plots are proportional numbers.)
  56. Notice the nice climbing excitement curve on A&D. This is based on Turkeravg scores, of course. DVC has more peaks in the first half, it seems. But does climb at the end.
  57. So what’s the closest correlate to the action? Inverse correlation (mild) with the dialog, as I suspected initially. (Yes, logistic regression and other stats can be done here – I did, too. But for this talk, it was about the visuals.)
  58. Not necessarily true – the giant peak of expository stuff early in Twilight turns out to be the trip to the beach where all the vampire/werewolf stories come out.
  59. Another tool needed… to check the numbers with the text visible. You can eyeball for highlighted correlations with the excitement, and rollover the blocks again to see the text.