Using Machine Learning on AWS for Continuous Sentiment Analysis from Labeling to a Real-time Data Pipeline by Zignal Labs

A Sentiment Pipeline with AWS and
Amazon SageMaker
Jeff Fenchel

Starbucks - Racial Profiling
Shutdown for racial bias training estimated to cost an
additional 16.7 million in lost revenue.

Agenda
1. Why are we building yet another sentiment API?
2. How we leverage Amazon Mechanical Turk to collect labeled
data
3. Utilizing Amazon SageMaker to regularly retrain and update
models in a resilient fashion
www.linkedin.com/in/jeffreyfenchel

Why Sentiment?
Option A Option B

Customer Feedback
“The sentiment is too neutral.”
“I have removed sentiment from all my reports.”
“I spend hours doing manual sentiment overrides.”
“Why was tweet X labeled neutral/positive/negative.”

Rule Based Sentiment
● Positive if it:
○ mentions and no dissatisfaction is expressed
○ portrays the company as being sustainable
○ Is introducing a new executive
● Negative if it:
○ Equates the company to something negative e.g world
hunger
● Neutral if it:
○ focuses on a new facility being opened

Reputation Polarity
"Polarity for reputation: Does the information (facts, opinions) in the text have positive, negative,
or neutral implications for the image of the company? This problem is related to sentiment
analysis and opinion mining, but has substantial differences with the mainstream research in that
areas: polar facts are ubiquitous (for instance, “Lehmann Brothers goes bankrupt” is a fact with
negative implications for reputation), perspective plays a key role. The same information may
have negative implications from the point of view of clients and positive from the point of view of
investors, negative sentiments may have positive polarity for reputation (for example, “R.I.P.
Michael Jackson. We’ll miss you” has a negative associated sentiment - sadness -, but a positive
implication for the reputation of Michael Jackson.)”
-- RepLab 2012

Reputation Polarity
"Polarity for reputation: Does the information (facts, opinions) in the text have positive, negative,
or neutral implications for the image of the company? This problem is related to sentiment
analysis and opinion mining, but has substantial differences with the mainstream research in that
areas: polar facts are ubiquitous (for instance, “Lehmann Brothers goes bankrupt” is a fact with
negative implications for reputation), perspective plays a key role. The same information may
have negative implications from the point of view of clients and positive from the point of view of
investors, negative sentiments may have positive polarity for reputation (for example, “R.I.P.
Michael Jackson. We’ll miss you” has a negative associated sentiment - sadness -, but a positive
implication for the reputation of Michael Jackson.)”
-- RepLab 2012
Negative Neutral Positive

Transparency And Trust
Predicted Sentiment
Actual
Sentiment
Negative Neutral Positive
Negative 102 18 13
Neutral 16 39 16
Positive 6 2 33

Mechanical Turk for Human Intelligence Tasks (HITs)
On Demand Workforce

Where do we start?
$ click_
https://github.com/pallets/click

Quality Control
● Fleiss’ Kappa Agreement
● Worker quality and bias assessment
with expectation maximization
● Qualification test and training
○ 21% pass rate

Continuous Labeling
Complete Records are critical
including:
● Raw assignment answers
from Mturk + HIT info
● Computed worker
evaluations (quality + bias +
support)
● Best fit answers

Quality in Test != Quality at Scale
● 92% -> 73% Label accuracy
● We get repeat workers!
CrossPolarityErrorRate
Worker Score [0,1]
Number of batches with contribution
Ratioofworkers
Workers by Repeat work Cumulative Histogram
Worker Score vs Cross Polarity Error

Investment in Workers
● Transparent feedback
● Automatic exclusions

Deployable Project Pattern
sultan/
├── deploy/
│ ├── templates/
│ │ ├── __init__.py
│ │ ├── pipeline.py
│ │ └── sultan.py
│ ├── __init__.py
│ └──requirements.txt
├── functions/
│ ├── create_sentiment_questions/
│ ├── provide_feedback/
│ └── manage_sentiment_results/
├── sultan/
├── setup.py
└── requirements.py
Troposphere
https://github.com/cloudtools/troposphere
ZignalLib
- Cloudformation in Python
- Common Deployment Patterns

The Twitter Model
● Current models use Keras + NLTK (for longform)
● BYOE approach

The Longform Model
● Two phase training: sentence level and document level

Amazon SageMaker
Optimized For:
Also supports:

Model Lifecycle
Training jobs:
● Longform
● Twitter
Endpoints:
● Twitter
○ Variant 1 - <Date Trained> - 95%
Traffic
○ Variant 2 - <Date Trained> - 5%
Traffic
● Longform

Model Training
model/
├── config.json
├── model.hdf5
├── output_encoder.pickle
├── stats.json
└── tokenizer.pickle

Model Deployment
● New models introduced as a 5%
variant
1) Deploy
2) Promote ● Promoted if 5xx replies < 10%

Model Deployment (Review)
● New models introduced as a 5%
variant
1) Deploy
2) Promote ● Promoted if 5xx replies < 10%

Serving Model Architecture
● Amazon SageMaker provided framework
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advan
ced_functionality/scikit_bring_your_own
● API consumer side batching

Summary
● Continuously gather labeled data from Mechanical Turk
● Leverage Amazon SageMaker to retrain daily and provide an
endpoint for our real time data pipeline
○ Serverless
○ Provides architecture patterns
● Received positive feedback in a trials with numerous
customers especially around sentiment directionality
● Future Work
○ Explore Hyper Parameter Tuning
○ Improve the inclusion of relevance in sentiment analysis

Zignal Labs
https://zignallabs.com/
www.linkedin.com/in/jeffreyfenchel
jfenchel@zignallabs.com

Using Machine Learning on AWS for Continuous Sentiment Analysis from Labeling to a Real-time Data Pipeline by Zignal Labs

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Using Machine Learning on AWS for Continuous Sentiment Analysis from Labeling to a Real-time Data Pipeline by Zignal Labs

Similar a Using Machine Learning on AWS for Continuous Sentiment Analysis from Labeling to a Real-time Data Pipeline by Zignal Labs (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Using Machine Learning on AWS for Continuous Sentiment Analysis from Labeling to a Real-time Data Pipeline by Zignal Labs