SlideShare una empresa de Scribd logo
1 de 37
Natural Language Visualization with
Scattertext
Jason S. Kessler*
Global AI Conference
April 27, 2018
Code for all visualizations is available at:
https://github.com/JasonKessler/GlobalAI2018
$ pip3 install scattertext
@jasonkessler*No, not that Jason Kessler
Lexicon speculation
Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification
using machine learning techniques. EMNLP. 2002. (ACL 2018 Test of Time Award Winner)
@jasonkessler
Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification
using machine learning techniques. EMNLP. 2002.
Lexicon mining ≈ lexicon speculation
@jasonkessler
Language and Demographics
Christian Rudder: http://blog.okcupid.com/index.php/page/7/
hobos
almond
butter 100 Years of
Solitude
Bikram yoga
@jasonkessler
Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010)
OKCupid: Words and phrases that distinguish white
men.
@jasonkessler
Explanation
OKCupid: Words and phrases that
distinguish Latin men.
Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) @jasonkessler
Ranking with everyone else
The smaller the distance from the top left, the
higher the association with white men
Source: Christian Rudder. Dataclysm. 2014.
Phish is highly associated with white men
Kpop is not
@jasonkessler
@jasonkessler
my blue eyes
Source: Christian Rudder. Dataclysm. 2014.
Scattertext
pip install scattertext
github.com/JasonKessler/scattertext
@jasonkessler
Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017.
- Interactive, d3-
based scatterplot
- Concise Python
API
- Automatically
displays non-
overlapping labels
Scaled F-Score
• Term-class* associations:
• “Good” is associated with the “positive” class
• “Bad” with the class “negative”
• Core intuition: association relies on two necessary factors
• Frequency: How often a term occurs in a class
• Precision: P(class|document contains term)
• F-Score:
• Information retrieval evaluation metric
• Harmonic mean between precision and recall
• Requires both metrics to be high
• *Term: is defined to be a word, phrase or other discrete linguistic element
@jasonkessler
@jasonkessler
Precision
Frequency
Naïve approach
@jasonkessler
Precision
Frequency
Naïve approach
Y-axis: - Precision, i.e. P(class|term), roughly normal distribution
- Mean ≅ 0.5, sd ≅ 0.4,
X-axis: - Frequency, i.e. P(term|class), roughly power distribution
- Mean ≅ 0.00008, sd ≅ 0.008
Color: - Harmonic mean of precision and frequency (blue=high, red=low)
@jasonkessler
Precision
Frequency
Naïve approach
Y-axis: - Precision, i.e. P(class|term), roughly normal distribution
- Mean ≅ 0.5, sd ≅ 0.4,
X-axis: - Frequency, i.e. P(term|class), roughly power distribution
- Mean ≅ 0.00008, sd ≅ 0.008
Color: - Harmonic mean of precision and frequency (blue=high, red=low)
@jasonkessler
Precision
Frequency
Problem:
• Top words are just stop words.
• Why?
• Harmonic mean of uneven distributions.
• Most words have prec of ~0.5, leads harmonic
mean to rely on frequency.
Fix: Normalize Precision and Frequency
• Task: make precision
and frequency
similarly distributed
• How: take normal CDF
of each term’s
precision and
frequency
• Mean and std.
computed from data
• Right: log normal CDF
@jasonkessler
This area is the log-normal
CDF of the term “beauty”
(0.938 ∈ [0,1]).
Each tick mark is the log-
frequency of a term.
*log-normal CDF isn’t used in these charts
Scaled-F-Score
@jasonkessler
NormCDF-Precision
NormCDF - Frequency
Positive Scaled-F-Score
Good: positive terms
make sense!
Still some function words,
but that’s okay.
Note These frequent
terms are all very close
to 1 on the x-axis, but
are ordered
NormCDF-Precision
NormCDF - Frequency @jasonkessler
Pos
Freq.
Neg
Freq. Prec.
Freq
%
Raw
Hmean
Prec.
CDF
Freq.
CDF
Scaled
F-Score
best 108 36 75.00% 0.22% 0.44% 71.95% 99.50% 83.51%
entertaining 58 13 81.69% 0.12% 0.24% 77.07% 90.94% 83.43%
fun 73 26 73.74% 0.15% 0.30% 70.92% 95.63% 81.44%
heart 45 11 80.36% 0.09% 0.18% 76.09% 84.49% 80.07%
Top Scaled-F Score Terms
Note: normalized precision and frequency are on comparable
scales, allowing for the harmonic mean to take both into account.
Problem: highly negative terms are
all low frequency
NormCDF-Precision
NormCDF - Frequency @jasonkessler
Solution:
• Compute Scaled F-
Score association
scores for negative
reviews.
• Use the highest score
Scaled-F-Score
Positive Scaled F-Score
@jasonkessler
Negative Scaled F-
Score
Note: only one
obviously negative
term
@jasonkessler
Scaled-F-Score by log-
frequency
The score can be overly
sensitive to very frequent
terms, but still doesn’t score
them very highly
ScaledF-Score
Log Frequency
Why not use TF-IDF?
• Drastically favors low frequency
terms
• term in all classes -> idf=1 ->
score=0
TF-IDF(Positive)-TF-IDF(Negative)
Log Frequency
TF IDF
Burt Monroe, Michael Colaresi and Kevin Quinn. Fightin'
words: Lexical feature selection and evaluation for
identifying the content of political conflict. Political
Analysis. 2008.@jasonkessler
Monroe et. al (2009) approach
• Bayesian approach to term-
association
• Likelihood: Z-score of log-odds-
ratio
• Prior: Term frequency in a
background corpus
• Posterior: Z-score of log-odds-ratio
with background counts as
smoothing values
Popular, but much more tweaking to
get to work than Scaled F Score.
@jasonkessler
Scattertext reimplementation of Monroe et al. See
http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb for code.
Scattertext
implementation (with prior
weighting modifications)
In defense of stop words
Cindy K. Chung and James W. Pennebaker. Counting
Little Words in Big Data: The Psychology of
Communities, Culture, and History. EASP. 2012
In times of
shared crisis,
“we” use
increases, while
“I” use decreases.
I/we: age, social
integration
I: lying, social
rank
@jasonkessler
Function words and gender
Newman, ML; Groom, CJ; Handelman LD, Pennebaker, JW. Gender
Differences in Language Use: An Analysis of 14,000 Text Samples. 2008.
LIWC Dimension
Bold: entirely stop words
Effect Size (Cohen’s d)
(>0 F, <0 M) MANOVA p<.001
All Pronouns (esp. 3rd person) 0.36
Present tense verbs (walk, is, be) 0.18
Feeling (touch, hold, feel) 0.17
Certainty (always, never) 0.14
Word count NS
Numbers -0.15
Prepositions -0.17
Words >6 letters -0.24
Swear words -0.22
Articles -0.24
• Performed on a
variety of
language
categories,
including
speech.
• Other studies
have found that
function words
are the best
predictors of
gender.
@jasonkessler
Function word usage is counter-intuitive
- James W. Pennebaker, Carla J. Groom, Daniel Loew, James M. Dabbs. Testosterone as a Social
Inhibitor: Two Case Studies of the Effect of Testosterone Treatment on Language. 2004.
- Susan C. Herring, Anna Martinson. Assessing Gender Authenticity in Computer-Mediated Language
Use: Evidence From an Identity Game. Journal of Language and Social Psychology. 2004.
• Pennebaker et al:
• Testosterone levels (in two therapeutic settings) predict:
• Modest but significant decreases in:
• Pronouns referring to others (ex: we, she, they)
• Communication verbs (ex: hear, say)
• Modest but significant decreases in:
• Optimism words (ex: energy, upbeat)
• Negative (but statistically insignificant) correlation with subject’s beliefs about testosterone and
category usage!
• Does not entirely explain gender differences.
• Herring et al:
• Subjects tasked with impersonating opposite gender in online game
• Discussed stereotypical topics (cars, shopping) but didn’t change stylistic cues
@jasonkessler
Clickbait: what works?
@jasonkessler
Clickbait corpus
• Facebook posts from BuzzFeed, NY Times, etc/ from 2010s.
• Includes headline and the number of Facebook likes
• Scraped by researcher Max Woolf at github.com/minimaxir/clickbait-cluster.
• We’ll separate articles from 2016 into the upper third and lower third
of likes.
• Identify words and phrases that predict likes.
• Begin with noun phrases identified from Phrase Machine (Handler et
al. 2016)
• Filter out redundant NPs.
Abram Handler, Matt Denny, Hanna Wallach, and Brendan O'Connor. Bag of what? Simple noun
phrase extraction for corpus analysis. NLP+CSS Workshop at EMNLP 2016.
@jasonkessler
@jasonkessler
Scaled-F-Score of engagement by
noun phrase
@jasonkessler
Scaled-F-Score of engagement by
unigram
Messier, but psycholinguist
information: 3rd person pronouns ->
high engagement; 2nd person low
“dies”: obit
Can, guess, how: questions.
Clickbait corpus
• How do terms with similar meanings differ in terms of their
engagement rates?
• Use Gensim (https://radimrehurek.com/gensim/) to find word embeddings
• Use UMAP (McInnes and Healy 2018) to project them into two
dimensions, and explore them with Scattertext.
• Locally groups words with similar embeddings together.
• Better alternative to T-SNE; allows for cosine instead of Euclidean distance
criteria
Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for
Dimension Reduction. Arxiv. 2018.
@jasonkessler
@jasonkessler
This island is mostly
food related.
“Chocolate” and
“cake” are highly
engaging, but
“breakfast” has
predictive of low
engagement.
Term positions from determined by UMAP,
color by Scaled F-Score for engagement.
Clickbait corpus
• How do the Times and Buzzfeed differ in what they talk about, and
their content engages their readers?
• Scattertext can easily create visualizations to help answer these
questions.
• First, we’ll look at how what engages for Buzzfeed contrasts with what
engages for the Times, and vice versa
Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for
Dimension Reduction. Arxiv. 2018.
@jasonkessler
Oddly, NY Times readers distinctly
like articles about sex, death, and
which are written in a smug tone.
This chart doesn’t give a good sense
of what language is more associated
with one site.
@jasonkessler
This chart let’s you know how Buzzfeed and the
Times are distinct, while still distinguishing
engaging content,
@jasonkessler
Thank you! Questions?
@jasonkessler
Jason S. Kessler
Global AI Conference
April 27, 2018
https://github.com/JasonKessler/GlobalAI2018

Más contenido relacionado

Similar a Natural Language Visualization with Scattertext

Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Tyler Schnoebelen
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsIdibon1
 
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in betweenVariation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in betweenTyler Schnoebelen
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptxAndi Wu
 
Development and validation of a vocabulary size test of multiword expressions
Development and validation of a vocabulary size test of multiword expressionsDevelopment and validation of a vocabulary size test of multiword expressions
Development and validation of a vocabulary size test of multiword expressionsRon Martinez
 
Speech act-andrew-d-cohen-1206263596754620-5
Speech act-andrew-d-cohen-1206263596754620-5Speech act-andrew-d-cohen-1206263596754620-5
Speech act-andrew-d-cohen-1206263596754620-5xiongying
 
(1) assignment 7.1 a identifying letters, phonemes, and grapheme
(1) assignment 7.1 a identifying letters, phonemes, and grapheme(1) assignment 7.1 a identifying letters, phonemes, and grapheme
(1) assignment 7.1 a identifying letters, phonemes, and graphemessuserfa5723
 
Creating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music VisualizationCreating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music Visualizationicchp2012
 
G284 Hori, K & Ito, T. (2018, June) Text mining of children's essays about an...
G284 Hori, K & Ito, T. (2018, June) Text mining of children's essays about an...G284 Hori, K & Ito, T. (2018, June) Text mining of children's essays about an...
G284 Hori, K & Ito, T. (2018, June) Text mining of children's essays about an...Takehiko Ito
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
L05 word representation
L05 word representationL05 word representation
L05 word representationananth
 
Quarter Two: Week Two: Day 4-5 English.pptx
Quarter Two: Week Two: Day 4-5 English.pptxQuarter Two: Week Two: Day 4-5 English.pptx
Quarter Two: Week Two: Day 4-5 English.pptxHarleyLaus1
 
Rigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deploymentRigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deploymentSandy Man
 
Jacob Eisenstein, Assistant Professor, School of Interactive Computing, Georg...
Jacob Eisenstein, Assistant Professor, School of Interactive Computing, Georg...Jacob Eisenstein, Assistant Professor, School of Interactive Computing, Georg...
Jacob Eisenstein, Assistant Professor, School of Interactive Computing, Georg...MLconf
 
(1)assignment 7.1 a identifying letters, phonemes, and graphemes
(1)assignment 7.1 a identifying letters, phonemes, and graphemes (1)assignment 7.1 a identifying letters, phonemes, and graphemes
(1)assignment 7.1 a identifying letters, phonemes, and graphemes ssuserfa5723
 
NLP in Practice - Part II
NLP in Practice - Part IINLP in Practice - Part II
NLP in Practice - Part IIDelip Rao
 
Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Sunil Kumar Kopparapu
 

Similar a Natural Language Visualization with Scattertext (20)

Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methods
 
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in betweenVariation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptx
 
Development and validation of a vocabulary size test of multiword expressions
Development and validation of a vocabulary size test of multiword expressionsDevelopment and validation of a vocabulary size test of multiword expressions
Development and validation of a vocabulary size test of multiword expressions
 
Speech act-andrew-d-cohen-1206263596754620-5
Speech act-andrew-d-cohen-1206263596754620-5Speech act-andrew-d-cohen-1206263596754620-5
Speech act-andrew-d-cohen-1206263596754620-5
 
(1) assignment 7.1 a identifying letters, phonemes, and grapheme
(1) assignment 7.1 a identifying letters, phonemes, and grapheme(1) assignment 7.1 a identifying letters, phonemes, and grapheme
(1) assignment 7.1 a identifying letters, phonemes, and grapheme
 
Creating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music VisualizationCreating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music Visualization
 
G284 Hori, K & Ito, T. (2018, June) Text mining of children's essays about an...
G284 Hori, K & Ito, T. (2018, June) Text mining of children's essays about an...G284 Hori, K & Ito, T. (2018, June) Text mining of children's essays about an...
G284 Hori, K & Ito, T. (2018, June) Text mining of children's essays about an...
 
Lsat prep session 1
Lsat prep session 1Lsat prep session 1
Lsat prep session 1
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
L05 word representation
L05 word representationL05 word representation
L05 word representation
 
Quarter Two: Week Two: Day 4-5 English.pptx
Quarter Two: Week Two: Day 4-5 English.pptxQuarter Two: Week Two: Day 4-5 English.pptx
Quarter Two: Week Two: Day 4-5 English.pptx
 
01 Jas
01 Jas01 Jas
01 Jas
 
Rigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deploymentRigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deployment
 
Jacob Eisenstein, Assistant Professor, School of Interactive Computing, Georg...
Jacob Eisenstein, Assistant Professor, School of Interactive Computing, Georg...Jacob Eisenstein, Assistant Professor, School of Interactive Computing, Georg...
Jacob Eisenstein, Assistant Professor, School of Interactive Computing, Georg...
 
(1)assignment 7.1 a identifying letters, phonemes, and graphemes
(1)assignment 7.1 a identifying letters, phonemes, and graphemes (1)assignment 7.1 a identifying letters, phonemes, and graphemes
(1)assignment 7.1 a identifying letters, phonemes, and graphemes
 
NLP in Practice - Part II
NLP in Practice - Part IINLP in Practice - Part II
NLP in Practice - Part II
 
ENGLISH WEEK 1
ENGLISH WEEK 1ENGLISH WEEK 1
ENGLISH WEEK 1
 
Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.
 

Más de Jason Kessler

Jason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler
 
Discovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer BehaviorDiscovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer BehaviorJason Kessler
 
Scattertext: A Tool for Visualizing Differences in Language
Scattertext: A Tool for Visualizing Differences in LanguageScattertext: A Tool for Visualizing Differences in Language
Scattertext: A Tool for Visualizing Differences in LanguageJason Kessler
 
From Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
From Sentiment to Persuasion Analysis: A Look at Idea Generation ToolsFrom Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
From Sentiment to Persuasion Analysis: A Look at Idea Generation ToolsJason Kessler
 
The 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainJason Kessler
 
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Jason Kessler
 
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Jason Kessler
 

Más de Jason Kessler (7)

Jason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with Twitter
 
Discovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer BehaviorDiscovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer Behavior
 
Scattertext: A Tool for Visualizing Differences in Language
Scattertext: A Tool for Visualizing Differences in LanguageScattertext: A Tool for Visualizing Differences in Language
Scattertext: A Tool for Visualizing Differences in Language
 
From Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
From Sentiment to Persuasion Analysis: A Look at Idea Generation ToolsFrom Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
From Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
 
The 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive Domain
 
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
 
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
 

Último

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Último (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

Natural Language Visualization with Scattertext

  • 1. Natural Language Visualization with Scattertext Jason S. Kessler* Global AI Conference April 27, 2018 Code for all visualizations is available at: https://github.com/JasonKessler/GlobalAI2018 $ pip3 install scattertext @jasonkessler*No, not that Jason Kessler
  • 2. Lexicon speculation Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. EMNLP. 2002. (ACL 2018 Test of Time Award Winner) @jasonkessler
  • 3. Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. EMNLP. 2002. Lexicon mining ≈ lexicon speculation @jasonkessler
  • 4. Language and Demographics Christian Rudder: http://blog.okcupid.com/index.php/page/7/ hobos almond butter 100 Years of Solitude Bikram yoga @jasonkessler
  • 5. Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) OKCupid: Words and phrases that distinguish white men. @jasonkessler
  • 6. Explanation OKCupid: Words and phrases that distinguish Latin men. Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) @jasonkessler
  • 7. Ranking with everyone else The smaller the distance from the top left, the higher the association with white men Source: Christian Rudder. Dataclysm. 2014. Phish is highly associated with white men Kpop is not @jasonkessler
  • 8. @jasonkessler my blue eyes Source: Christian Rudder. Dataclysm. 2014.
  • 9. Scattertext pip install scattertext github.com/JasonKessler/scattertext @jasonkessler Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. - Interactive, d3- based scatterplot - Concise Python API - Automatically displays non- overlapping labels
  • 10. Scaled F-Score • Term-class* associations: • “Good” is associated with the “positive” class • “Bad” with the class “negative” • Core intuition: association relies on two necessary factors • Frequency: How often a term occurs in a class • Precision: P(class|document contains term) • F-Score: • Information retrieval evaluation metric • Harmonic mean between precision and recall • Requires both metrics to be high • *Term: is defined to be a word, phrase or other discrete linguistic element @jasonkessler
  • 12. @jasonkessler Precision Frequency Naïve approach Y-axis: - Precision, i.e. P(class|term), roughly normal distribution - Mean ≅ 0.5, sd ≅ 0.4, X-axis: - Frequency, i.e. P(term|class), roughly power distribution - Mean ≅ 0.00008, sd ≅ 0.008 Color: - Harmonic mean of precision and frequency (blue=high, red=low)
  • 13. @jasonkessler Precision Frequency Naïve approach Y-axis: - Precision, i.e. P(class|term), roughly normal distribution - Mean ≅ 0.5, sd ≅ 0.4, X-axis: - Frequency, i.e. P(term|class), roughly power distribution - Mean ≅ 0.00008, sd ≅ 0.008 Color: - Harmonic mean of precision and frequency (blue=high, red=low)
  • 14. @jasonkessler Precision Frequency Problem: • Top words are just stop words. • Why? • Harmonic mean of uneven distributions. • Most words have prec of ~0.5, leads harmonic mean to rely on frequency.
  • 15. Fix: Normalize Precision and Frequency • Task: make precision and frequency similarly distributed • How: take normal CDF of each term’s precision and frequency • Mean and std. computed from data • Right: log normal CDF @jasonkessler This area is the log-normal CDF of the term “beauty” (0.938 ∈ [0,1]). Each tick mark is the log- frequency of a term. *log-normal CDF isn’t used in these charts
  • 17. Positive Scaled-F-Score Good: positive terms make sense! Still some function words, but that’s okay. Note These frequent terms are all very close to 1 on the x-axis, but are ordered NormCDF-Precision NormCDF - Frequency @jasonkessler
  • 18. Pos Freq. Neg Freq. Prec. Freq % Raw Hmean Prec. CDF Freq. CDF Scaled F-Score best 108 36 75.00% 0.22% 0.44% 71.95% 99.50% 83.51% entertaining 58 13 81.69% 0.12% 0.24% 77.07% 90.94% 83.43% fun 73 26 73.74% 0.15% 0.30% 70.92% 95.63% 81.44% heart 45 11 80.36% 0.09% 0.18% 76.09% 84.49% 80.07% Top Scaled-F Score Terms Note: normalized precision and frequency are on comparable scales, allowing for the harmonic mean to take both into account.
  • 19. Problem: highly negative terms are all low frequency NormCDF-Precision NormCDF - Frequency @jasonkessler Solution: • Compute Scaled F- Score association scores for negative reviews. • Use the highest score
  • 20. Scaled-F-Score Positive Scaled F-Score @jasonkessler Negative Scaled F- Score Note: only one obviously negative term
  • 21. @jasonkessler Scaled-F-Score by log- frequency The score can be overly sensitive to very frequent terms, but still doesn’t score them very highly ScaledF-Score Log Frequency
  • 22. Why not use TF-IDF? • Drastically favors low frequency terms • term in all classes -> idf=1 -> score=0 TF-IDF(Positive)-TF-IDF(Negative) Log Frequency TF IDF
  • 23. Burt Monroe, Michael Colaresi and Kevin Quinn. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis. 2008.@jasonkessler Monroe et. al (2009) approach • Bayesian approach to term- association • Likelihood: Z-score of log-odds- ratio • Prior: Term frequency in a background corpus • Posterior: Z-score of log-odds-ratio with background counts as smoothing values Popular, but much more tweaking to get to work than Scaled F Score.
  • 24. @jasonkessler Scattertext reimplementation of Monroe et al. See http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb for code. Scattertext implementation (with prior weighting modifications)
  • 25. In defense of stop words Cindy K. Chung and James W. Pennebaker. Counting Little Words in Big Data: The Psychology of Communities, Culture, and History. EASP. 2012 In times of shared crisis, “we” use increases, while “I” use decreases. I/we: age, social integration I: lying, social rank @jasonkessler
  • 26. Function words and gender Newman, ML; Groom, CJ; Handelman LD, Pennebaker, JW. Gender Differences in Language Use: An Analysis of 14,000 Text Samples. 2008. LIWC Dimension Bold: entirely stop words Effect Size (Cohen’s d) (>0 F, <0 M) MANOVA p<.001 All Pronouns (esp. 3rd person) 0.36 Present tense verbs (walk, is, be) 0.18 Feeling (touch, hold, feel) 0.17 Certainty (always, never) 0.14 Word count NS Numbers -0.15 Prepositions -0.17 Words >6 letters -0.24 Swear words -0.22 Articles -0.24 • Performed on a variety of language categories, including speech. • Other studies have found that function words are the best predictors of gender. @jasonkessler
  • 27. Function word usage is counter-intuitive - James W. Pennebaker, Carla J. Groom, Daniel Loew, James M. Dabbs. Testosterone as a Social Inhibitor: Two Case Studies of the Effect of Testosterone Treatment on Language. 2004. - Susan C. Herring, Anna Martinson. Assessing Gender Authenticity in Computer-Mediated Language Use: Evidence From an Identity Game. Journal of Language and Social Psychology. 2004. • Pennebaker et al: • Testosterone levels (in two therapeutic settings) predict: • Modest but significant decreases in: • Pronouns referring to others (ex: we, she, they) • Communication verbs (ex: hear, say) • Modest but significant decreases in: • Optimism words (ex: energy, upbeat) • Negative (but statistically insignificant) correlation with subject’s beliefs about testosterone and category usage! • Does not entirely explain gender differences. • Herring et al: • Subjects tasked with impersonating opposite gender in online game • Discussed stereotypical topics (cars, shopping) but didn’t change stylistic cues @jasonkessler
  • 29. Clickbait corpus • Facebook posts from BuzzFeed, NY Times, etc/ from 2010s. • Includes headline and the number of Facebook likes • Scraped by researcher Max Woolf at github.com/minimaxir/clickbait-cluster. • We’ll separate articles from 2016 into the upper third and lower third of likes. • Identify words and phrases that predict likes. • Begin with noun phrases identified from Phrase Machine (Handler et al. 2016) • Filter out redundant NPs. Abram Handler, Matt Denny, Hanna Wallach, and Brendan O'Connor. Bag of what? Simple noun phrase extraction for corpus analysis. NLP+CSS Workshop at EMNLP 2016. @jasonkessler
  • 31. @jasonkessler Scaled-F-Score of engagement by unigram Messier, but psycholinguist information: 3rd person pronouns -> high engagement; 2nd person low “dies”: obit Can, guess, how: questions.
  • 32. Clickbait corpus • How do terms with similar meanings differ in terms of their engagement rates? • Use Gensim (https://radimrehurek.com/gensim/) to find word embeddings • Use UMAP (McInnes and Healy 2018) to project them into two dimensions, and explore them with Scattertext. • Locally groups words with similar embeddings together. • Better alternative to T-SNE; allows for cosine instead of Euclidean distance criteria Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Arxiv. 2018. @jasonkessler
  • 33. @jasonkessler This island is mostly food related. “Chocolate” and “cake” are highly engaging, but “breakfast” has predictive of low engagement. Term positions from determined by UMAP, color by Scaled F-Score for engagement.
  • 34. Clickbait corpus • How do the Times and Buzzfeed differ in what they talk about, and their content engages their readers? • Scattertext can easily create visualizations to help answer these questions. • First, we’ll look at how what engages for Buzzfeed contrasts with what engages for the Times, and vice versa Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Arxiv. 2018. @jasonkessler
  • 35. Oddly, NY Times readers distinctly like articles about sex, death, and which are written in a smug tone. This chart doesn’t give a good sense of what language is more associated with one site. @jasonkessler
  • 36. This chart let’s you know how Buzzfeed and the Times are distinct, while still distinguishing engaging content, @jasonkessler
  • 37. Thank you! Questions? @jasonkessler Jason S. Kessler Global AI Conference April 27, 2018 https://github.com/JasonKessler/GlobalAI2018

Notas del editor

  1. Selected termsWhich words and phrases statistically distinguish ethnic groups and genders?
  2. Selected terms
  3. Selected terms
  4. No link to aggression, achievement, sexual behavior, anger, or markers of cognitive complexity.