This session was recorded in San Francisco on February 5th, 2019 and can be viewed here: https://youtu.be/aXPE6IiKRmI
The 2018 Brazilian Presidential Elections represented a tangible demonstration of radical change in the way candidates conduct their campaigns, as the shift from traditional media to social media hit the shore of the largest country in the southern hemisphere.
Analyzing the political agenda, the broadcast TV-based debates and exchange on social media networks was an NLP feast that The AI Academy reckoned was too good to pass. In this panel, we present the work we conducted , and will show how Driverless AI helped us accelerate our NLP experiments thanks to the recent introduction of advanced text analytics recipes.
Bio: Maker/Dreamer/Iconoclast/Chaordic Leader with over 20 years of experience across a number of high-tech industries around the world. Curiosity towards new technologies and the ability to adapt to different cultural and social environments has taken him from a research lab in Italy to a start up in Denmark, to a multinational technology company in Silicon Valley, and ultimately to a leading broadband and video service provider in Brazil. Time and again his career journey has demonstrated his ability to recognize at a very early stage high-potential disruptive ideas and the determination to transform an idea into a real product / service.
Over the past seven years, Carmelo cultivated his passion for innovation by leading major technology incubations at a large Telecom operator, supporting the Brazilian startup ecosystem as a Mentor at a startup accelerator and continuously extending his business and technology knowledge through a blend of formal learning & hands-on projects implementations. His focus over the past few years has been on Data Science and Artificial Intelligence, carrying out in-depth technology investigations, product incubations and solutions development.
By establishing The AI Academy, Carmelo intends to create and foster a rich environment for the study, research and application of Machine/Deep Learning techniques to real-life use cases, bridging the AI gap between talent and Enterprises - and furthermore elevating Brazil's "AIQ", inserting São Paulo on the world's AI Map.
Carmelo Iaria, AI Academy - How The AI Academy is accelerating NLP projects with Driverless AI - H2O World San Francisco
1. Accelerating NLP projects
with Driverless AI
Carmelo IARIA
data artisan,
The AI Academy
https://www.linkedin.com/in/carmeloiaria/
#H2OWORLD
2. The AI Academy is an integrated system of
education, research and consulting services
focusing on the application of Artificial Intelligence
to challenging real-life problems
We believe the speed at which technology is
evolving is proving current linear education
systems obsolete and we are re-discovering the
approach of the 15th century artists workshops
where continuous learning was integrated with
project development
4. 2018 Brazilian Presidential Elections Project
THE SCOPE
We studied, through the use of NLP and Visualization techniques, the
following use cases
1. The Ideological Position – medium post
• Where the political programs of the 13 presidential candidates stand in an ideological
spectrum
2. The Media War – [medium post]
• Visualize the political attacks by the candidates during the presidential campaign
• Comparing the attacks on traditional media with those on social media
5. USE CASE 1 – the ideological position
STRATEGY
Problem Statement:
Extract a single metric representing the ideological position of each political program
Data Source
• the 13 government programs
Analysis
Dimensions
• Economy and Employment
• Education and Health
• External Policy and
Environment
• Political System and
Corruption
• Social Policy and Human
Rights
• Safety
Visual Encoding
• Weight – how much
importance was given to each
policy in the political program
• Ideology Position – where
does the political program
stand in a left-right ideological
spectrum
Techniques
• Weight – Simple score
representing the number of
words on each specific policy
normalized to the size of the
program
• Ideology – [Slaping, Proksh -
A Poisson Scaling Model for
Estimating Time-Series Party
Positions from Text1]
[Wordfish R package 2]
• Visualizations – d3.js [3]
6. USE CASE 1 – the ideological position
ABOUT WORDFISH
The methodology proposed uses a Poisson scaling technique to estimate party positions in a single
left-right dimension based on word frequencies in political text. The main advantages of such method
are that
• It is language independent: as such it can be applied to any language with the same efficacy
• It is an unsupervised technique: unlike other methodologies analyzed in the paper, Wordfish
doesn't require hand-coding (i.e. providing training text that "defines" what's left and what's right in
a political text). This way it eliminate bias
It is important to highlight that the technique does not attempt to provide an absolute classification of
"left" or "right" but rather to measure the relative position of the different parties (government programs
in our case) as emerging from the political text.
You can read the paper here.
7. USE CASE 1 – the ideological position
SAMPLE RESULTS
Economy &
Employment
Human
Rights &
Social
Welfare
Security
8. USE CASE 1 – the ideological position
SAMPLE RESULTS
User can navigate a summarized
version of the political programs by
Policy and by Candidate making it
easy to compare
See the full Interactive Visualization
9. USE CASE 2 – the media war
STRATEGY
Problem Statement:
How do presidential candidates make use of traditional and social media to attack opponents
Data Source
• Presidential candidates
television debates run on the
main broadcast channels prior
to the elections’ first turn
• Tweeter accounts of the two
presidential candidates
running in the second turn
Analysis
Dimensions
• Political Attack
• Media Type (traditional vs
social)
Visual Encoding
• Participation – indicates which
TV debates the candidate
participated in
• Attacker/Attacked – who
attacked who
• Number of Attacks – how
many attacks each
presidential candidate
launched during each
television debate or twitter
account
Techniques
• Attacks (mentions + negative
sentiment) – Out-of-box
(Google and IBM SA APIs)
and internally developed
(Driverless AI) Sentiment
Analysis for Portuguese
language
• Visualizations – d3.js [3]
10. USE CASE 2 – the media war
DEVELOPING OUR SENTIMENT ANALYSIS CLASSIFIER
Sentiment Analysis on Brazilian Portuguese corpora
We tested a number of approaches to be able to extract political attacks from the political debates on
television and on social media. In particular we wanted to validate if it made sense to develop our own
classifier for Sentiment Analysis or if the out-of-box solutions were good enough when applied to a
NLP problem in Brazilian Portuguese
A. Out-of-Box Sentiment Analysis classifiers
1) IBM Watson Sentiment Analysis API
2) Google Sentiment Analysis API
B. Internally Developed Sentiment Analysis classifier – public dataset
1) Driverless AI NLP recipes to train a model on publicly available datasets
C. Internally Developed Sentiment Analysis classifier – large own corpus (*)
1) Driverless AI NLP recipes to automatically annotate a large corpus of Brazilian Portuguese political documents
(*) future work
11. Internally Developed Sentiment Analysis
classifier
EXPERIMENTS SET UP
Environment
• AWS EC2 p2.8xlarge instance:
• 8x GPUs
• 32 vCPUs
• 488GB RAM
Experiments
1. Airline sentiment analysis dataset [4]:
• It has 14640 valid tweets from 2/17/2015 to 2/24/2015 related to reviews of major U.S. airlines, containing sentiment
label, negative reason label, tweets content and other meta information like location, user ID etc. The data fraction is
roughly 15% positive, 65% negative, and 20% neutral.
2. Political Social Media Posts dataset [5]:
• This dataset, from Crowdflower's Data For Everyone Library, provides text of 5000 messages from politicians' social
media accounts, along with human judgments about the purpose, partisanship, and audience of the messages
3. TweetSentBR dataset [6] selected
• The annotated dataset is composed of 15.000 tweets split in two documents - a training set with 12.999 documents
labeled in positive (44%), neutral (26%) and negative (29%); and a test set composed of 2001 documents with similar
distribution to the training set, 45%, 25% and 29% respectively
Tools
• Driverless AI release 1.5 AMI
• R/Python – data preparation
12. Internally Developed Sentiment Analysis
classifier
DRIVERLESS AI - NLP
The support for NLP on Driverless AI allowed us to extract
features from raw text that were used to carry forward our
sentiment analysis classification task
NLP Features
• Word Count
• TFIDF
• Word Embeddings
NLP specific models
• Truncated SVD on word count
• Linear models on TFIDF vectors
• Convolutional neural network models on word embeddings
Variable Importance
+----+----------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| | Relative Importance | Feature | Description
|----+----------------------------+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------|
| 0 | 1 | 1_TxtCNN_TE:tweet.2 | Predicted probabilities of class #3 based on CNN model on text column ['tweet']
| 1 | 0.78723 | 1_TxtCNN_TE:tweet.0 | Predicted probabilities of class #1 based on CNN model on text column ['tweet']
| 2 | 0.66161 | 2_TxtTE:tweet.0 | Predicted probabilities of class #1 based on linear model on Tfidf features from text column ['tweet']
| 3 | 0.518799 | 1_TxtCNN_TE:tweet.1 | Predicted probabilities of class #2 based on CNN model on text column ['tweet']
| 4 | 0.518165 | 2_TxtTE:tweet.2 | Predicted probabilities of class #3 based on linear model on Tfidf features from text column ['tweet']
| 5 | 0.239108 | 2_TxtTE:tweet.1 | Predicted probabilities of class #2 based on linear model on Tfidf features from text column ['tweet']
| 6 | 0.0201291 | 0_Txt:tweet.1 | Feature #2 of tf-idf-based word embedding (followed by dimensionality reduction to 75 dimensions) of 'tweet'
TOP 6 Engineered Features during model building
Driverless AI Stage Timing (seconds) Number of Models
Data Preparation 7.73 0
Model and Feature Tuning 2,699.92 (45 mins) 730
Feature Evolution 2,283.49 (38 mins) 1566
Final Pipeline Training 1,386.20 (23 mins) 12
This classification experiment completed in 1 hours and 49 minutes (1:49:52), using 0 of the
1 original features, and 167 of the 1,140 engineered features.
Driverless AI built a stacked ensemble of 2 XGBoostModels, 2 LightGBMModels to predict
sentiment given 1 original features from the input dataset trainTT.csv.
Experiment Summary
13. USE CASE 2 – the media war
MODELS PERFORMANCE COMPARISON
F1-Score
+----+----------------------------+------------------------------+--------------------------------------------------------------+
| | Framework | F1-score [negative] | F1-score [neutral] | F1-score [positive] |
|----+----------------------------+-------------------------------+--------------------------------|----------------------------|
| | Google Cloud | 0.4313725 | 0.2357320 | 0.6467569 |
| | IBM Watson | 0.6921381 | 0.3324022 | 0.7383943 |
| | H2O Driverless AI | 0.6854839 | 0.5028185 | 0.7881669 |
14. USE CASE 2 – the media war
TV DEBATES VISUALIZATION
An attack is represented by a
line that connects distinct
regions
The external ring of the plot is divided
into regions, each one representing a
candidate
The regions are further separated
into sections (divided by thin lines),
each section representing a debate
Visualization inspired by 2007
article published by The New
York Time. In that article each
line would represent a
mention. We have detected
mentions and then applied
sentiment analysis to only
show those classified as
negative
15. USE CASE 2 – the media war
TV DEBATES VISUALIZATION
There’s a small circular scale on the outside of the section
- minor tick marks represent ten words from attack phrases
- major tick marks represent one hundred words.
For instance, candidate Ciro Gomes spoke about 1100
words in attack phrases across five television debates
Attacks received are pointing
to the candidate name in the
middle of the segment
Attacks from a specific debates can
be highlighted as well as visualize a
specific attack by selecting a line
16. USE CASE 2 – the media war
TV DEBATES VISUALIZATION
Not surprisingly Jair Bolsonaro was the candidate receiving most of the attacks (despite having participated to
only 2 television debates) and he’s also the candidate with the least amount of attacks made on the debates
Attacks to Bolsonaro during pre-electoral Television debates Attacks by Bolsonaro during pre-electoral Television debates
17. USE CASE 2 – the media war
ATTACKS ON TWITTER VISUALIZATION
TV DEBATES ATTACKS TWITTER ATTACKS
18. 2018 Brazilian Presidential Elections Project
CONCLUSIONS
• Performing NLP projects for non-English languages presents a number of additional
challenges.
• While out-of-box Cognitive/Natural Language APIs offer a number of powerful NLP
functionalities, in this project we have validated that better performance can be achieved
by building your own model to in classify sentiment in Brazilian Portuguese text.
• Leveraging the NLP recipes built in into the Driverless AI Automated Machine Learning
pipeline we’ve been able to tremendously accelerate the experimentation cycle, allowing
us to focus on machine learning strategy definition, interpretation of the results and the
creation of powerful interactive visualizations to extract insights on the topic analyzed.
19. The project team
Data Lens
xan
Data Artisan
carmelo
Data Ninja
kubo
Lady Data
carol
ferrArI
driverless
IA enthusiast &
Data believer
santiago
Kung Fu Pandas
bruno
20. 2018 Brazilian Presidential Elections Project
REFERENCES
Techniques
• Slaping, Proksh - A Poisson Scaling Model for Estimating Time-Series Party Positions from Text [link]
• NLP techniques in Driverless AI [link]
• Jonathan Corum and Farhana Hossai (NYT): Naming Names - names used by major presidential candidates in
series of Democratic and Republicans debates leading up to the Iowa caucuses [link]
Tools
• Driverless AI [link]
• Wordfish R package [link]
• d3.js [link]
• Circos [link]
Datasets
• Airline sentiment analysis dataset [link]:
• Political Social Media Posts dataset [link]:
• Building a Sentiment Corpus of Tweets in Brazilian Portuguese [link]:
Notas del editor
2018’s Brazilian Presidential Elections represented a significant change not only with regards to the elections results but also from the way the election campaigns were conducted by the candidates
The winning candidate deserted Television debates and conducted his campaign from social media platforms in a way nobody had ever done in Brazil