Carmelo Iaria, AI Academy - How The AI Academy is accelerating NLP projects with Driverless AI - H2O World San Francisco

Accelerating NLP projects
with Driverless AI
Carmelo IARIA
data artisan,
The AI Academy
https://www.linkedin.com/in/carmeloiaria/
#H2OWORLD

The AI Academy is an integrated system of
education, research and consulting services
focusing on the application of Artificial Intelligence
to challenging real-life problems
We believe the speed at which technology is
evolving is proving current linear education
systems obsolete and we are re-discovering the
approach of the 15th century artists workshops
where continuous learning was integrated with
project development

2018 Brazilian Presidential Elections Project
THE CONTEXT

THE SCOPE
We studied, through the use of NLP and Visualization techniques, the
following use cases
1. The Ideological Position – medium post
• Where the political programs of the 13 presidential candidates stand in an ideological
spectrum
2. The Media War – [medium post]
• Visualize the political attacks by the candidates during the presidential campaign
• Comparing the attacks on traditional media with those on social media

USE CASE 1 – the ideological position
STRATEGY
Problem Statement:
Extract a single metric representing the ideological position of each political program
Data Source
• the 13 government programs
Analysis
Dimensions
• Economy and Employment
• Education and Health
• External Policy and
Environment
• Political System and
Corruption
• Social Policy and Human
Rights
• Safety
Visual Encoding
• Weight – how much
importance was given to each
policy in the political program
• Ideology Position – where
does the political program
stand in a left-right ideological
spectrum
Techniques
• Weight – Simple score
representing the number of
words on each specific policy
normalized to the size of the
program
• Ideology – [Slaping, Proksh -
A Poisson Scaling Model for
Estimating Time-Series Party
Positions from Text1]
[Wordfish R package 2]
• Visualizations – d3.js [3]

ABOUT WORDFISH
The methodology proposed uses a Poisson scaling technique to estimate party positions in a single
left-right dimension based on word frequencies in political text. The main advantages of such method
are that
• It is language independent: as such it can be applied to any language with the same efficacy
• It is an unsupervised technique: unlike other methodologies analyzed in the paper, Wordfish
doesn't require hand-coding (i.e. providing training text that "defines" what's left and what's right in
a political text). This way it eliminate bias
It is important to highlight that the technique does not attempt to provide an absolute classification of
"left" or "right" but rather to measure the relative position of the different parties (government programs
in our case) as emerging from the political text.
You can read the paper here.

SAMPLE RESULTS
Economy &
Employment
Human
Rights &
Social
Welfare
Security

SAMPLE RESULTS
User can navigate a summarized
version of the political programs by
Policy and by Candidate making it
easy to compare
See the full Interactive Visualization

USE CASE 2 – the media war
STRATEGY
Problem Statement:
How do presidential candidates make use of traditional and social media to attack opponents
Data Source
• Presidential candidates
television debates run on the
main broadcast channels prior
to the elections’ first turn
• Tweeter accounts of the two
presidential candidates
running in the second turn
Analysis
Dimensions
• Political Attack
• Media Type (traditional vs
social)
Visual Encoding
• Participation – indicates which
TV debates the candidate
participated in
• Attacker/Attacked – who
attacked who
• Number of Attacks – how
many attacks each
presidential candidate
launched during each
television debate or twitter
account
Techniques
• Attacks (mentions + negative
sentiment) – Out-of-box
(Google and IBM SA APIs)
and internally developed
(Driverless AI) Sentiment
Analysis for Portuguese
language
• Visualizations – d3.js [3]

DEVELOPING OUR SENTIMENT ANALYSIS CLASSIFIER
Sentiment Analysis on Brazilian Portuguese corpora
We tested a number of approaches to be able to extract political attacks from the political debates on
television and on social media. In particular we wanted to validate if it made sense to develop our own
classifier for Sentiment Analysis or if the out-of-box solutions were good enough when applied to a
NLP problem in Brazilian Portuguese
A. Out-of-Box Sentiment Analysis classifiers
1) IBM Watson Sentiment Analysis API
2) Google Sentiment Analysis API
B. Internally Developed Sentiment Analysis classifier – public dataset
1) Driverless AI NLP recipes to train a model on publicly available datasets
C. Internally Developed Sentiment Analysis classifier – large own corpus (*)
1) Driverless AI NLP recipes to automatically annotate a large corpus of Brazilian Portuguese political documents
(*) future work

Internally Developed Sentiment Analysis
classifier
EXPERIMENTS SET UP
Environment
• AWS EC2 p2.8xlarge instance:
• 8x GPUs
• 32 vCPUs
• 488GB RAM
Experiments
1. Airline sentiment analysis dataset [4]:
• It has 14640 valid tweets from 2/17/2015 to 2/24/2015 related to reviews of major U.S. airlines, containing sentiment
label, negative reason label, tweets content and other meta information like location, user ID etc. The data fraction is
roughly 15% positive, 65% negative, and 20% neutral.
2. Political Social Media Posts dataset [5]:
• This dataset, from Crowdflower's Data For Everyone Library, provides text of 5000 messages from politicians' social
media accounts, along with human judgments about the purpose, partisanship, and audience of the messages
3. TweetSentBR dataset [6]  selected
• The annotated dataset is composed of 15.000 tweets split in two documents - a training set with 12.999 documents
labeled in positive (44%), neutral (26%) and negative (29%); and a test set composed of 2001 documents with similar
distribution to the training set, 45%, 25% and 29% respectively
Tools
• Driverless AI release 1.5 AMI
• R/Python – data preparation

Internally Developed Sentiment Analysis
classifier
DRIVERLESS AI - NLP
The support for NLP on Driverless AI allowed us to extract
features from raw text that were used to carry forward our
sentiment analysis classification task
NLP Features
• Word Count
• TFIDF
• Word Embeddings
NLP specific models
• Truncated SVD on word count
• Linear models on TFIDF vectors
• Convolutional neural network models on word embeddings
Variable Importance
+----+----------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| | Relative Importance | Feature | Description
|----+----------------------------+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------|
| 0 | 1 | 1_TxtCNN_TE:tweet.2 | Predicted probabilities of class #3 based on CNN model on text column ['tweet']
| 1 | 0.78723 | 1_TxtCNN_TE:tweet.0 | Predicted probabilities of class #1 based on CNN model on text column ['tweet']
| 2 | 0.66161 | 2_TxtTE:tweet.0 | Predicted probabilities of class #1 based on linear model on Tfidf features from text column ['tweet']
| 3 | 0.518799 | 1_TxtCNN_TE:tweet.1 | Predicted probabilities of class #2 based on CNN model on text column ['tweet']
| 6 | 0.0201291 | 0_Txt:tweet.1 | Feature #2 of tf-idf-based word embedding (followed by dimensionality reduction to 75 dimensions) of 'tweet'
TOP 6 Engineered Features during model building
Driverless AI Stage Timing (seconds) Number of Models
Data Preparation 7.73 0
Model and Feature Tuning 2,699.92 (45 mins) 730
Feature Evolution 2,283.49 (38 mins) 1566
Final Pipeline Training 1,386.20 (23 mins) 12
This classification experiment completed in 1 hours and 49 minutes (1:49:52), using 0 of the
1 original features, and 167 of the 1,140 engineered features.
Driverless AI built a stacked ensemble of 2 XGBoostModels, 2 LightGBMModels to predict
sentiment given 1 original features from the input dataset trainTT.csv.
Experiment Summary

MODELS PERFORMANCE COMPARISON
F1-Score
+----+----------------------------+------------------------------+--------------------------------------------------------------+
| | Framework | F1-score [negative] | F1-score [neutral] | F1-score [positive] |
|----+----------------------------+-------------------------------+--------------------------------|----------------------------|
| | Google Cloud | 0.4313725 | 0.2357320 | 0.6467569 |
| | IBM Watson | 0.6921381 | 0.3324022 | 0.7383943 |
| | H2O Driverless AI | 0.6854839 | 0.5028185 | 0.7881669 |

TV DEBATES VISUALIZATION
An attack is represented by a
line that connects distinct
regions
The external ring of the plot is divided
into regions, each one representing a
candidate
The regions are further separated
into sections (divided by thin lines),
each section representing a debate
Visualization inspired by 2007
article published by The New
York Time. In that article each
line would represent a
mention. We have detected
mentions and then applied
sentiment analysis to only
show those classified as
negative

There’s a small circular scale on the outside of the section
- minor tick marks represent ten words from attack phrases
- major tick marks represent one hundred words.
For instance, candidate Ciro Gomes spoke about 1100
words in attack phrases across five television debates
Attacks received are pointing
to the candidate name in the
middle of the segment
Attacks from a specific debates can
be highlighted as well as visualize a
specific attack by selecting a line

Not surprisingly Jair Bolsonaro was the candidate receiving most of the attacks (despite having participated to
only 2 television debates) and he’s also the candidate with the least amount of attacks made on the debates
Attacks to Bolsonaro during pre-electoral Television debates Attacks by Bolsonaro during pre-electoral Television debates

ATTACKS ON TWITTER VISUALIZATION
TV DEBATES ATTACKS TWITTER ATTACKS

CONCLUSIONS
• Performing NLP projects for non-English languages presents a number of additional
challenges.
• While out-of-box Cognitive/Natural Language APIs offer a number of powerful NLP
functionalities, in this project we have validated that better performance can be achieved
by building your own model to in classify sentiment in Brazilian Portuguese text.
• Leveraging the NLP recipes built in into the Driverless AI Automated Machine Learning
pipeline we’ve been able to tremendously accelerate the experimentation cycle, allowing
us to focus on machine learning strategy definition, interpretation of the results and the
creation of powerful interactive visualizations to extract insights on the topic analyzed.

The project team
Data Lens
xan
Data Artisan
carmelo
Data Ninja
kubo
Lady Data
carol
ferrArI
driverless
IA enthusiast &
Data believer
santiago
Kung Fu Pandas
bruno

REFERENCES
Techniques
• Slaping, Proksh - A Poisson Scaling Model for Estimating Time-Series Party Positions from Text [link]
• NLP techniques in Driverless AI [link]
• Jonathan Corum and Farhana Hossai (NYT): Naming Names - names used by major presidential candidates in
series of Democratic and Republicans debates leading up to the Iowa caucuses [link]
Tools
• Driverless AI [link]
• Wordfish R package [link]
• d3.js [link]
• Circos [link]
Datasets
• Airline sentiment analysis dataset [link]:
• Political Social Media Posts dataset [link]:
• Building a Sentiment Corpus of Tweets in Brazilian Portuguese [link]:

Carmelo Iaria, AI Academy - How The AI Academy is accelerating NLP projects with Driverless AI - H2O World San Francisco

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Carmelo Iaria, AI Academy - How The AI Academy is accelerating NLP projects with Driverless AI - H2O World San Francisco

Similar a Carmelo Iaria, AI Academy - How The AI Academy is accelerating NLP projects with Driverless AI - H2O World San Francisco (20)

Más de Sri Ambati

Más de Sri Ambati (20)

Último

Último (20)

Carmelo Iaria, AI Academy - How The AI Academy is accelerating NLP projects with Driverless AI - H2O World San Francisco

Notas del editor