The Bones of a Bestseller: Visualizing Fiction

The Bones of a Bestseller:
Visualizing Fiction
Lynn Cherny
@arnicas
OpenvisConf 2013
Monday, June 17, 13

Language, Sex,Violence
(also spoilers)
TEXT
Monday, June 17, 13

Monday, June 17, 13
Book stars for today

Study what’s popular, because it tells us something
about people.
Monday, June 17, 13
Additionally, I want to illustrate using some statistical tricks on data, particularly some simple machine learning – and the tools I
built to visualize those results.

http://www.economist.com/blogs/graphicdetail/2012/11/fifty-shades-data-visualisations
BY
Monday, June 17, 13
Start with a little motivating graphic, or “pornographic”, that inspired me about 6 months ago. This was actually on the
Economist’s blog!
Can we do this automatically?

Text Classification (Commonly)
§“Bag of words” – each document is considered
a collection of words, independent of order
§Frequencies of certain words are used to
identify the texts
Seems like this should work with sex scenes,
right? Only so many body parts and behaviors,
right?!
Monday, June 17, 13
The way everyone would start this problem…

Data Label
Estdsgfd fdsatreatret dfds Yes
Dsrdsf drerear ewrewtrew No
Reret retdrtd rewrewrtew Yes
Dsfgdg fdsfd Yes
Algorithm
Train
Test
New data in the wild
Monday, June 17, 13
Supervised learning: Have some data you label with the truth, and feed it into some code to learn what the truth is all about.
To do this properly, you divide the data up in a training set, and an evaluation set – and you see how your code did on the
evaluation set: how much did it get right?
Once you’re satisfied with the tweaks on the classifier code, you can use it on new data in the wild.

Sex Scene Detection First Steps
1. Buy 50 Shades on Amazon, unlock text in
Calibre, save as TXT ﬁle.
2. Cut up a doc into 500 “word” chunks using
Python
3. Try to label each chunk:

“not sexy” (e.g., paperwork, taxes, calls to Mom)

“maybe steamy” (e.g. kissing, limited touching,

long looks)

“sexy!” (ﬁll in the ____ here)
Monday, June 17, 13

“Would you like to sit?” He waves me toward an L-shaped white leather couch.
His office is way too big for just one man. In front of the floor-to-ceiling windows, there’s a
modern dark wood desk that six people could comfortably eat around. It matches the
coffee table by the couch. Everything else is white—ceiling, floors, and walls, except for the
wall by the door, where a mosaic of small paintings hang, thirty-six of them arranged in a
square.They are exquisite—a series of mundane, forgotten objects painted in such precise
detail they look like photographs. Displayed together, they are breathtaking.
“A local artist.Trouton,” says Grey when he catches my gaze.
“They’re lovely. Raising the ordinary to extraordinary,” I murmur, distracted both by him and
the paintings. He cocks his head to one side and regards me intently.
“I couldn’t agree more, Miss Steele,” he replies, his voice soft, and for some inexplicable
reason I find myself blushing.
Sample of 50 Shades of Grey
Monday, June 17, 13
What the text looks like…

Outsourced to Mechanical Turk
Monday, June 17, 13
Doing the sex scenes labels myself sucked, so I outsourced it to Mechanical Turk, Amazon’s crowdsourcing remote work tool. It
was super easy (to spend a lot of money on this). So I did (spend a lot of money).

WHAT’S A SEX SCENE,
ANYWAY?
Monday, June 17, 13
But let’s step back a little…

Zara.com
Monday, June 17, 13
Lots would say this is sexy (maybe not all women, though).

http://www.ebay.com/itm/Adult-Sex-Toys-Tools-Handcuffs-Eye-mask-Neck-Band-Strap-Whip-Rope-/330845727274?pt=
UK_Home_Garden_Celebrations_Occasions_ET&hash=item4d07f12a2a
Monday, June 17, 13
Some would say this set is sexy, others definitely would not. This turns out to be a lot of what 50 Shades is about… So, hmm.
Also, this set is on Ebay in the UK if you’re into it.

trendir.com
Monday, June 17, 13
So, apart from the bondage, the Mechanical Turkers are seeing small chunks of text, with no context, in random orders. Suppose
there’s a steamy shower scene where they are getting it on – but they stop to discuss a horrible childhood incident and cry? Is
that in a sex scene, or not? Tough to say.

Sexually Exxxplicit,
but still a
http://www.icts.uiowa.edu/sites/default/ﬁles/contract.jpg
Monday, June 17, 13
Even worse – some parts of the first book are long sections of contract, which contain sexual rules and regulations – but it’s a
contract. Sexy, or not? Probably not to most…

Monday, June 17, 13
Results from Mechanical Turk as a CSV file.

How’d the raters do?
Sex Scenes
Steamy Scenes
Monday, June 17, 13
We can see a fair amount of variation here, some good agreement, but the blue raters were more turned on by the beginning of
the book.

Comparing to “Pornographic”…
Monday, June 17, 13
A pretty good match, actually. Good for the Turkers and the porno-graphic team!

Comparing:
Monday, June 17, 13
Again, what’s up with the blue raters – they loved this book. Red did not find it sexy at all.

On to the learning algorithm…
The training data:
-The text chunks
-The score the raters gave it (averaged) as “truth”
I started with Python’s NLTK (Natural Language
Toolkit) and Naïve Bayes for classifying (working
in an ipython notebook).
Monday, June 17, 13

NLTK Naïve Bayes not so great
on 50 Shades… 68%.
“packet” (they use a lot of condoms)
Monday, June 17, 13
NLTK outputs a list of top terms, unlike scikit-learn – just wanted to show you what they looked like.

Python’s sklearn (scikit-learn)
Lots of classiﬁers for
sparse data like text!
http://scikit-learn.org/0.13/auto_examples/
document_classiﬁcation_20newsgroups.html
Monday, June 17, 13
This is an illustration (not by me) of how many classifiers there are that can be used on text, in scikit-learn… Picked one that has
general good performance, to see how it compared to Naïve Bayes – Stochastic Gradient Descent. Notice there’s a Passive
Aggressive Classifier, too. Best.name.ever.

Using a lemmatizer step in the pipeline (to strip endings off words, since some ﬁction in my
later samples was in present tense)
Pipelines in sklearn makes it incredibly easy to run lots of experiments.
Fit the model, using training data and “target” answers (in this case,“50 Shades of Grey”)
Test the model on new data (in this case,“50 Shades Darker”). Check how it did against the
answers.
Now
we’re
at 88%
Monday, June 17, 13
Just to show you how little code it is to run a classifier pipeline – and check the results.

Interpreting the results…
Demo: http://www.ghostweather.com/essays/talks/openvisconf/
text_scores/rollover.html
Monday, June 17, 13
To be able to browse the results by content and context, I built a little tool in D3; you can see the matches and mismatches in the
sex scenes, and rollover each little block to inspect the text itself. Useful!

Really amazing P.S. here…
I paid for coding of a bunch of fan-fiction for sex
scenes too, and fed them in to the SGD classifier.
(Recall that 50 Shades started life as Twilight
fanfic.)
*cross-validating with entire set, not just 50 Shades books.
97% accuracy achieved!*
Monday, June 17, 13
I did spend a lot of money on Mturk, getting ratings of sex scenes. For future talks…

I SAID I’D TALK ABOUT
STORY ARCS TOO
But hey -
Monday, June 17, 13
Switching back to another theme… overall arc of action in a novel.

http://www.musik-therapie.at/PederHill/Structure&Plot.htm
Monday, June 17, 13
The movie version of story arcs… “height” is tension, or some kind of measure of excitement, or drama…

PLEASE. IFYOU WRITING SCREENPLAY.
HULK TELLINGYOU.THE 3 ACT
STRUCTURE = GARBAGE.
STOP CITING IT IN ARTICLES.
STOP TALKING ABOUT IT WITH
FRIENDS.
IT WILL NOT HELPYOU.
STAY THE FUCK AWAY FROM ANYONE
WHO EVEN CLAIM IT EXIST. IF THEY
SAY IT DO. SAY “OR COURSE SHIT HAS
BEGINNING, MIDDLE,AND ENDING
YOU INSUFFERABLE TURD” THEN
THROW A DRINK IN THEIR FACE AND
RUN AWAY…
http://filmcrithulk.wordpress.com/2011/07/07/hulk-presents-the-myth-of-3-act-structure/
“The HULK Presents the Myth of the 3-Act Structure”
Monday, June 17, 13
But the more I investigated this online, the more I found people saying it’s bullshit. This is the best quote I found on the subject.
Totally worth reading the essay.

Vonnegut - http://thedesigngym.com/simpleshapesofstories/
Monday, June 17, 13
Vonnegut is here talking about sentiment of events, not really “tension” or “excitement” or rising action – but there’s still some
kind of structural differences going on across each book/story. The question is, what do best sellers look like over the course of
the whole story? By whatever measure will illustrate the pacing/movement of the story.
Lower right hand corner is me working on this talk.

Monday, June 17, 13
Vonnegut on real life, compared to fiction. But because that’s depressing, here is a tiny owl.

http://24.media.tumblr.com/ba77d04cb210b8e24ff73a49a19b3111/
tumblr_mfc6dv2SER1qh66wqo1_1280.jpg
Monday, June 17, 13
Did this cheer you up? It’s better than a kitten! IMO.

Monday, June 17, 13
So – back to the initial thought: rising action, crises, resolution, etc. Can we find this in books? Automatically, I mean?

Can we detect exciting scenes?
Back to Mechanical Turk, with Dan Brown books:

2 raters again, chunks of 500 words
Odd factoid: I got ratings of sex scenes in 2-4 hours.
It took ~13 hours to get Dan Brown action scenes.
Monday, June 17, 13
Using action/exciting scenes as proxy for major events in a book…

“ACTION” SCENES ARE
TOUGH, TOO
Monday, June 17, 13
Brief digression on how hard this is.

Raven.theraider.net
Monday, June 17, 13
This seems obvious… fights, chases, etc.

Objects in the mirror are closer than they
appear
www.badhaven.com / Jurassic Park
Monday, June 17, 13
Small chunks out of context don’t always look like action. Remember Mechanical Turk folks are seeing small pieces without
context, so their judgments are based on only a tiny window. Or, in the case of Dan Brown, it might ALL look like action. (You
might look up “bathos” here too.)

Almost naked, Silas hurled his pale body down the staircase. He knew he
had been betrayed, but by whom? When he reached the foyer, more
officers were surging through the front door. Silas turned the other way and
dashed deeper into the residence hall.The women's entrance. Every Opus
Dei building has one.Winding down narrow hallways, Silas snaked through
a kitchen, past terrified workers, who left to avoid the naked albino as he
knocked over bowls and silverware, bursting into a dark hallway near the
boiler room. He now saw the door he sought, an exit light gleaming at the
end.
Running full speed through the door out into the rain, Silas leapt off the
low landing, not seeing the officer coming the other way until it was too
late.The two men collided, Silas's broad, naked shoulder grinding into the
man's sternum with crushing force. He drove the officer backward onto the
pavement, landing hard on top of him.The officer's gun clattered away. Silas
could hear men running down the hall shouting. Rolling, he grabbed the
loose gun just as the officers emerged. A shot rang out on the stairs, and
Silas felt a searing pain below his ribs. Filled with rage, he opened fire at
all three officers, their blood spraying.
A dark shadow loomed behind, coming out of nowhere.The angry hands
that grabbed at his bare shoulders felt as if they were infused with the
power of the devil himself.The man roared in his ear. SILAS, NO!
Silas spun and fired.Their eyes met. Silas was already screaming in
horror as Bishop Aringarosa fell.
Chapter 96
DaVinci Code
Monday, June 17, 13
A sample of the text in question… This is an action scene.

SOWHAT ABOUT “BAGS OF
WORDS” HERE?
Text content worked for sex scenes…..
Monday, June 17, 13

SGD Classiﬁer on “exciting” scenes washed out –
about 60% accuracy on Dan Brown.
Monday, June 17, 13
It’s possible I could’ve improved this with some other trickery, but heck, let’s move on.

LDA Topic Analysis
Topic analysis produces associations between words
and chunks of text, by probabilistic methods.
“Topics” are described by lists of most informative
words.
A topic may be associated with multiple documents.
Monday, June 17, 13
I thought maybe I’d get somewhere with another “bag of words” unstructured technique that’s popular now: topic analysis.

Blei (2011) from http://www.scottbot.net/HIAL/?p=221
Monday, June 17, 13
A snippet from a classic article.

Elijah Meeks: https://dhs.stanford.edu/comprehending-the-digital-humanities/topics/
Monday, June 17, 13
A network view of topics and documents, by Elijah Meeks. This is a pretty obvious way to visualize the results of LDA on text.
But my data is ordered chapters, so I didn’t want to do this. I wanted to keep the relationship, but still see the topics…

Another tool:
DaVinci Code topics to chapters
mapping
“Excitement” rating color scale
avg by chapter, ordered
(obviously)
Topics (48ish) per
chapter (108)
Chapter 1… to Chapter 108
Monday, June 17, 13
Built another tool to see if there was anything in this – showing them as ordered chapters connected by the “best” matching
topics.

Ah, but since it’s svg/d3… var chart = chart.append("g").attr("translate","0," +
y).attr("transform","rotate(90 600 600)");
But, maybe I need chapter
summaries…. So I can relate
them to the topics?
Monday, June 17, 13
Outsource the summary writing for each chapter, to make it easier to see how topics relate to chapter contexts. … Add them as
text under the leaves (the boxes that represent chapters). Now it’s hard to read – so use svg cute rotate trick and some
resizing…!

Add some topic-tooltips
and fade-outs….
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_arc_diagram/TopicArc.html
Monday, June 17, 13
Some UI niceties I added to make it slightly usable, even for myself. Unfortunately I had to shorten the text my friend created
for each chapter; the originals were pretty hilarious…

This project
featured a
Crayola color
scheme.
http://en.wikipedia.org/wiki/List_of_Crayola_crayon_colors
Monday, June 17, 13
This was the best way I could find on short notice to get a list of divergent bright colors… but I still had to hand-tweak the ones I
used till they were all more or less readable and distinguishable!

Maybe I need One More Tool. Any word relations of interest?
Let’s try a hairball…
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_words_network/index.html
Monday, June 17, 13
For this tool I used Jim Vallandingham’s network code from the flowingdata tutorial – my first major use of coffeescript. I
intended to try the radial layout, but ended up not.There was a lot of preprocessing of the stats to get the topics related to
chapter “excitingness.”

Small
“constellations”
show shared
words (an
accident that’s
useful!)
Filtered to only the
“exciting” nodes…
Monday, June 17, 13
On rollover, you can see the links, and since I created links between shared words (colored in blue), you can see little
constellations for them, which I liked. This could’ve been simplified, of course.

THAT FELT LIKE A DEAD
END.
Maybe pretty, but
Monday, June 17, 13
No structure visible, no story structure… In using just the bag of words, I’d lost all structure or relationships across time, which
seems important to me for things like pacing in a novel.

Slide by me in a talk on Nodebox: http://blogger.ghostweather.com/2013/03/data-visualization-with-nodebox.html
Covered up for
cheap theatrics…
Monday, June 17, 13
Previous work, sketch showing relationship of dialog to exposition in different texts; this I did in early spring of 2013, for another
talk. It was a simple visual of dialog vs. exposition.

Slide by me in a talk on Nodebox: http://blogger.ghostweather.com/2013/03/data-visualization-with-nodebox.html
Monday, June 17, 13
At the time I theorized that the reason Angels and Demons (which I bought on sale on Amazon) had less dialog towards the end
was because the action had increased, to the detriment of dialog. Maybe I was right? Simple is best?

Back to Python.
§ Chunk book by chapter, get POS tags, punctuation,
and word counts + more for each chapter…
§ Import scores from Turkers id’ing which bits are
exciting/action, incorporate with the other data.
Monday, June 17, 13
New process: Check for relationships of everything, across time; and relationship to “excitement” ratings.

Some pretty big rater
differences,
actually.
Monday, June 17, 13
First note the ratings differences (using ipython notebook and pandas). I used the avg scores.

Item = chapter
Magic: Pandas’ rolling_mean function
on different window sizes!
Monday, June 17, 13
After some code magic, nouns, for instance, look like this. Lots of chapters, lots of variation. I used rolling means to get a
smoother curve!

Well, this is a mess.
Monday, June 17, 13
These are hard to plot together on the same scale – giant mess.

Monday, June 17, 13
Standardize the data with a small transform, to get it all on a comparable scale.
Still kind of a mess.

Hey, now I want to play with it live, with UI controls…
Enter Bearcart (by Rob Story/@oceankidbilly)
Monday, June 17, 13
Packaging to generate rickshaw.js graphs built on d3.

Notice the nice
checkbox per
series controls –
what I needed!
Monday, June 17, 13
The result in a browser – just showing you Twilight for example, back to Brown in a sec.

A few oddities… nouns & verbs
Angels & Demons
DaVinci Code
verbs
verbs
nouns
nouns
Monday, June 17, 13
There were nice inverse relations between nouns and verbs in both books. (Both done proportionally and as absolutes.These
plots are proportional numbers.)

Basic “excitement/action” arcs
DaVinci Code
Angels & Demons
Monday, June 17, 13
Notice the nice climbing excitement curve on A&D. This is based on Turker avg scores, of course. DVC has more peaks in the
first half, it seems. But does climb at the end.

Angels & Demons
DaVinci Code
“score”
Action “score”
Quotes
Quotes
Demo http://www.ghostweather.com/essays/talks/openvisconf/
bearcart/index_dav.html
Demo http://www.ghostweather.com/essays/talks/
openvisconf/bearcart/index_ang.html
Chapter number
So I was right –lots of running
around and stuff!
Monday, June 17, 13
So what’s the closest correlate to the action? Inverse correlation (mild) with the dialog, as I suspected initially. (Yes, logistic
regression and other stats can be done here – I did, too. But for this talk, it was about the visuals.)

TwilightThe talky bits…
So…The action?
Invert --
Monday, June 17, 13
Not necessarily true – the giant peak of expository stuff early in Twilight turns out to be the trip to the beach where all the
vampire/werewolf stories come out.

Yet.Another.Tool. !
Demo:
http://www.ghostweather.com/essays/talks/openvisconf/chapter_scores/score_rollover_dav.html
Monday, June 17, 13
Another tool needed… to check the numbers with the text visible. You can eyeball for highlighted correlations with the
excitement, and rollover the blocks again to see the text.

Some final thoughts
Create minimum viable tools (to help you
visualize/analyse) in whatever you can use, fast.
And boy, machine learning sure can use
interactive visual tools!
A browser can easily hold an entire trashy novel.
Monday, June 17, 13

THANKS!
@arnicas, Lynn@ghostweather.com
My thanks to….
Luminosity (help with Dan Brown summaries)Yves Fey (help with romance genre
conventions) Fan friends with sex-ﬁlled long fanﬁc refs (Dorinda, Movies_Michelle,
Gwyn Rhys) Rob Story/@oceankidbilly (for help with Bearcart under pressure) Jim
Vallandingham/@vlandham for his code/advice, Irene and Bocoup for hosting!
Monday, June 17, 13

A Few References
§ Applied Machine Learning with Scikit-Learn:http://scikit-learn.github.io/scikit-learn-tutorial/
index.html
§ Naïve Bayes for text in Scikit-Learn: http://scikit-learn.org/stable/modules/
naive_bayes.html#naive-bayes
§ Stochastic Gradient Descent in Scikit-Learn: http://scikit-learn.org/0.13/modules/sgd.html
§ Nice tutorial overview of working with text data: scikit-learn.github.io/scikit-learn-tutorial/
working_with_text_data.html
§ Bearcart by Rob Story – Rickshaw timeseries graphs from python pandas datastructure in 4
lines (https://github.com/wrobstory/bearcart)
§ LDA topic modeling tool with UI - https://code.google.com/p/topic-modeling-tool/
§ Scott Weingart’s nice overview of LDA Topic Modeling in Digital Humanities: http://
www.scottbot.net/HIAL/?p=221
§ Elijah Meeks’ lovely set of articles on LDA & Digital Humanties vis: https://dhs.stanford.edu/
comprehending-the-digital-humanities/
§ JimVallandingham’s tooltip code and a great demo/tutorial: http://ﬂowingdata.com/2012/08/02/
how-to-make-an-interactive-network-visualization/
§ Rickshaw for timeseries graphs: https://github.com/shutterstock/rickshaw
Monday, June 17, 13

THEVIDEO OF THE TALK:
http://blogger.ghostweather.com/2013/06/analysis-of-ﬁction-
my-openvisconf-talk.html
http://www.youtube.com/watch?
v=f41U936WqPM
P.S. SEE THE BLOG POST/
EXAMPLES LIVE…
Monday, June 17, 13

The Bones of a Bestseller: Visualizing Fiction

Recomendados

Recomendados

Más contenido relacionado

Similar a The Bones of a Bestseller: Visualizing Fiction

Similar a The Bones of a Bestseller: Visualizing Fiction (20)

Más de Lynn Cherny

Más de Lynn Cherny (7)

Último

Último (20)

The Bones of a Bestseller: Visualizing Fiction

Notas del editor