This webinar discusses how to perform sentiment analysis on large datasets using Apache Hive. It provides an overview of sentiment analysis and demonstrates useful Hive UDFs for preprocessing text data and extracting n-grams. The webinar also includes a tutorial analyzing sentiment around the topic of "mortgage" using the MemeTracker dataset containing 90 million records of URLs, timestamps, memes and links over 36GB of JSON data. Advanced custom sentiment analysis can be developed by extending Hive's extensibility framework.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Basic Sentiment Analysis using Hive
1. Sentiment Analysis using Hive
Secrets From the Pros
We will be starting at 11:03 PDT
Use the Chat Pane in GoToWebinar to Ask Questions!
Assess your level and learn new stuff
This webinar is intended for intermediate audiences
(familiar with Apache Hive and Hadoop, but not experts)
?
2. News Cycle for “Mortgage” 2008-
09
Mortgage- Crisis, Foreclosures, Fraud
-10
0
10
20
30
40
50
60
70
80
90
6/12/04 8/1/04 9/20/04 11/9/04 12/29/04 2/17/05 4/8/05 5/28/05
Crisis
Foreclosure
Fraud
Linear (Crisis)
Linear (Foreclosure)
Linear (Fraud)
# of records: 90M/partition
Partitions:
Month
Columns:
URL
Timestamp
Array of Memes
Links
Table: MemeTracker
36GB of JSON Data
3. AGENDA
This Webinar provides tips on doing basic sentiment analysis on
large data sets using Hive:
• Overview of Sentiment Analysis (SA)
• Hive UDFs useful for SA
• Demo, Guided Tutorial
• Developing advanced, custom SA Engines
4. Sentiment Analysis
Applications
Direct-- Call center
logs, Emails, Chat logs
Indirect-- Social
Media, Forums, Review websites
Gather Customer Feedback
Over time, geography
By customer, market
segments
Sentiment Analysis
Product / service decisions
Customer support
Marketing- messaging, offers
Customer retention, upsell
Use for Decision making
5. Sentiment Analysis
How to operationalize a Sentiment Analysis App
1.
Crawl, Scrape, API
calls, collect
2. Create
“Documents”
3. Pre-process
Data
4. Apply Language
Model, Extract
Sentiment
5. Integrate with
Mktg
Automn., CRM, C
CA, etc OLTP
6. Improve
Product, Better
CS, Targeted
Offers
6. Pre and Post Preprocessing
Hive Built-In Functions
Goal Input Data Output Data
Use this
Hive UDF
Tokenization (“Hello There! How are you?”)
(
(“Hello”, “There”)
, (“How”, “are”,
“you”)
)
sentences
Column (array) to rows [1, 2, 3]
1
2
3
explode
Navigating documents,
extracting fields
{"store":
{"fruit":[{"weight":8,"type":"apple"}
,{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red
"}
},
"email":"amy@xyz.net",
"owner":"amy"
}
{"weight":8,"type":"
apple"}
get_json_object(
src_json.json,
'$.fruit[0]')
7. N-Gram
Language Models
Q: What is a language model?
A: A mathematical model that assigns probability to a sequence of m words
Q: What is “n-gram” model?
A: Probabilistic language model for predicting next word in a sequence of words
Q: What is an n-gram?
A: A contiguous sequence of n items from a given sequence of text
Eg: “Mary had a little lamb”
Bi-grams: “Mary had”, “had a”, “a little”, “little lamb”
8. N-Gram Language Model
Hive Built-In Functions
Goal Input Data Output Data
Use this
Hive UDF
Find important topics
using a stop word list,
trending topics
Collection of sentences
k most frequently occurring
n-grams
ngrams
Extract intelligence
around certain
keywords, pre-compute
search look aheads
Collection of sentences
k most frequently occuring
n-grams around a “context”
word. Eg: “Government
shutdown”
context_ngrams
9. Dataset used-- Meme Tracker
How MemeTracker.org creates the dataset
90 Million sources
900K news stories / day
Track 17M memes
# of records: 90M/partition
Partitions:
Month
Columns:
URL
Timestamp
Array of Memes
Links
Table: MemeTracker
6GB of Data / month
10. Analyze Sentiment on “Mortgage”
By Tracking How Memes spread, using Hive
What is a Meme?
“Government Shutdown”, “Affordable Care Act”, “Green Eggs and Ham”, etc
# of records: 90M/partition
Partitions:
Month
Columns:
URL
Timestamp
Array of Memes
Links
Table: MemeTracker
36GB of JSON Data
12. Hive’s Extensibility Framework
• There are many UDFs built into Hive
• For more advanced users Hive allows many
ways to extend the language
– SERDEs
– UDFs, UDAFs, and UDTFs
– Hive Streaming
13. How to access this Tutorial
• Create a free Qubole Account (www.qubole.com)
• Login Click on “Analyze” Look for “Tutorials”
tab at top of page
14. Summary
• Pre and post processing
– Use Hive
• Language Models
– Use pre-existing language models codified as Hive UDFs such as
ngrams and context_ngrams
– UDFs-- Build your own language model in java using Hive UDF
framework
– Hive Streaming-- Plug-in your existing language models or 3rd
party libraries
• Visualization
– Use a spreadsheet / BI reporting tool
15. THANK YOU
Managed Cluster Built-In Connectors Friendly User-Interface Dedicated Support
• 100% Managed Hadoop Cluster in the Cloud
• Auto-Scaling Cluster. Full Life-cycle Management
• +12 Connectors to Applications and Data Sources
• 14-Day Free Trial (free account available)
• 24/7 Customer Support
What’s Included?
www.qubole.com/try
Notas del editor
Great to model clicks and impressions and try and understand what a buyers intent is. Intent to purchase or churn.. Quality-- Banks, call centers,
Great to model clicks and impressions and try and understand what a buyers intent is. Intent to purchase or churn.. Quality-- Banks, call centers,
Information diffusionData is already gathered, documents created, memes extracted. Lot of work already done. Data ready for you.Can do this on your own on twitter feeds.
Solutions– many..Framework– pre-processing --- applying model --- post processingChallenges: Scaling.