SlideShare una empresa de Scribd logo
1 de 15
Sentiment Analysis using Hive
Secrets From the Pros
We will be starting at 11:03 PDT
Use the Chat Pane in GoToWebinar to Ask Questions!
Assess your level and learn new stuff
This webinar is intended for intermediate audiences
(familiar with Apache Hive and Hadoop, but not experts)
?
News Cycle for “Mortgage” 2008-
09
Mortgage- Crisis, Foreclosures, Fraud
-10
0
10
20
30
40
50
60
70
80
90
6/12/04 8/1/04 9/20/04 11/9/04 12/29/04 2/17/05 4/8/05 5/28/05
Crisis
Foreclosure
Fraud
Linear (Crisis)
Linear (Foreclosure)
Linear (Fraud)
# of records: 90M/partition
Partitions:
Month
Columns:
URL
Timestamp
Array of Memes
Links
Table: MemeTracker
36GB of JSON Data
AGENDA
This Webinar provides tips on doing basic sentiment analysis on
large data sets using Hive:
• Overview of Sentiment Analysis (SA)
• Hive UDFs useful for SA
• Demo, Guided Tutorial
• Developing advanced, custom SA Engines
Sentiment Analysis
Applications
Direct-- Call center
logs, Emails, Chat logs
Indirect-- Social
Media, Forums, Review websites
Gather Customer Feedback
Over time, geography
By customer, market
segments
Sentiment Analysis
Product / service decisions
Customer support
Marketing- messaging, offers
Customer retention, upsell
Use for Decision making
Sentiment Analysis
How to operationalize a Sentiment Analysis App
1.
Crawl, Scrape, API
calls, collect
2. Create
“Documents”
3. Pre-process
Data
4. Apply Language
Model, Extract
Sentiment
5. Integrate with
Mktg
Automn., CRM, C
CA, etc OLTP
6. Improve
Product, Better
CS, Targeted
Offers
Pre and Post Preprocessing
Hive Built-In Functions
Goal Input Data Output Data
Use this
Hive UDF
Tokenization (“Hello There! How are you?”)
(
(“Hello”, “There”)
, (“How”, “are”,
“you”)
)
sentences
Column (array) to rows [1, 2, 3]
1
2
3
explode
Navigating documents,
extracting fields
{"store":
{"fruit":[{"weight":8,"type":"apple"}
,{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red
"}
},
"email":"amy@xyz.net",
"owner":"amy"
}
{"weight":8,"type":"
apple"}
get_json_object(
src_json.json,
'$.fruit[0]')
N-Gram
Language Models
Q: What is a language model?
A: A mathematical model that assigns probability to a sequence of m words
Q: What is “n-gram” model?
A: Probabilistic language model for predicting next word in a sequence of words
Q: What is an n-gram?
A: A contiguous sequence of n items from a given sequence of text
Eg: “Mary had a little lamb”
Bi-grams: “Mary had”, “had a”, “a little”, “little lamb”
N-Gram Language Model
Hive Built-In Functions
Goal Input Data Output Data
Use this
Hive UDF
Find important topics
using a stop word list,
trending topics
Collection of sentences
k most frequently occurring
n-grams
ngrams
Extract intelligence
around certain
keywords, pre-compute
search look aheads
Collection of sentences
k most frequently occuring
n-grams around a “context”
word. Eg: “Government
shutdown”
context_ngrams
Dataset used-- Meme Tracker
How MemeTracker.org creates the dataset
90 Million sources
900K news stories / day
Track 17M memes
# of records: 90M/partition
Partitions:
Month
Columns:
URL
Timestamp
Array of Memes
Links
Table: MemeTracker
6GB of Data / month
Analyze Sentiment on “Mortgage”
By Tracking How Memes spread, using Hive
What is a Meme?
“Government Shutdown”, “Affordable Care Act”, “Green Eggs and Ham”, etc
# of records: 90M/partition
Partitions:
Month
Columns:
URL
Timestamp
Array of Memes
Links
Table: MemeTracker
36GB of JSON Data
Demo
Hive’s Extensibility Framework
• There are many UDFs built into Hive
• For more advanced users Hive allows many
ways to extend the language
– SERDEs
– UDFs, UDAFs, and UDTFs
– Hive Streaming
How to access this Tutorial
• Create a free Qubole Account (www.qubole.com)
• Login  Click on “Analyze”  Look for “Tutorials”
tab at top of page
Summary
• Pre and post processing
– Use Hive
• Language Models
– Use pre-existing language models codified as Hive UDFs such as
ngrams and context_ngrams
– UDFs-- Build your own language model in java using Hive UDF
framework
– Hive Streaming-- Plug-in your existing language models or 3rd
party libraries
• Visualization
– Use a spreadsheet / BI reporting tool
THANK YOU
Managed Cluster Built-In Connectors Friendly User-Interface Dedicated Support
• 100% Managed Hadoop Cluster in the Cloud
• Auto-Scaling Cluster. Full Life-cycle Management
• +12 Connectors to Applications and Data Sources
• 14-Day Free Trial (free account available)
• 24/7 Customer Support
What’s Included?
 www.qubole.com/try 

Más contenido relacionado

La actualidad más candente

Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Question Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningQuestion Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningLucidworks
 
Knowledge graph convolutional networks - London 2018
Knowledge graph convolutional networks - London 2018Knowledge graph convolutional networks - London 2018
Knowledge graph convolutional networks - London 2018Vaticle
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadooplucenerevolution
 
Introducing GRAKN.AI
Introducing GRAKN.AIIntroducing GRAKN.AI
Introducing GRAKN.AIVaticle
 
Knowledge graph convolutional networks - Berlin 2019
Knowledge graph convolutional networks - Berlin 2019Knowledge graph convolutional networks - Berlin 2019
Knowledge graph convolutional networks - Berlin 2019Vaticle
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientistsAjay Ohri
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with SparkGhulam Imaduddin
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopPranab Ghosh
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction SystemBigDataCloud
 
Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923Rosanna Man
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 

La actualidad más candente (20)

resume_MH
resume_MHresume_MH
resume_MH
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Question Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningQuestion Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep Learning
 
Knowledge graph convolutional networks - London 2018
Knowledge graph convolutional networks - London 2018Knowledge graph convolutional networks - London 2018
Knowledge graph convolutional networks - London 2018
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoop
 
Introducing GRAKN.AI
Introducing GRAKN.AIIntroducing GRAKN.AI
Introducing GRAKN.AI
 
Vespa, A Tour
Vespa, A TourVespa, A Tour
Vespa, A Tour
 
Knowledge graph convolutional networks - Berlin 2019
Knowledge graph convolutional networks - Berlin 2019Knowledge graph convolutional networks - Berlin 2019
Knowledge graph convolutional networks - Berlin 2019
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientists
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 

Destacado

Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1Vimal Suthar
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOPKirthan S Holla
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock AnalysisVaibhav Jain
 
TRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPTRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPKirthan S Holla
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013SATOSHI TAGOMORI
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Hortonworks
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopCloudera, Inc.
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project reportBharat Khanna
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysisDiana Maynard
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 

Destacado (20)

Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOP
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 
TRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPTRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOP
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with Hadoop
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project report
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 

Similar a Basic Sentiment Analysis using Hive

Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingCloud Elements
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoCOSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoAmazon Web Services
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...Databricks
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkValue Amplify Consulting
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseNaveen Kumar
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Amazon Web Services Korea
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Narrative Mind Lessons Learned
Narrative Mind Lessons LearnedNarrative Mind Lessons Learned
Narrative Mind Lessons LearnedH4Diadmin
 
Narrative Mind Lessons Learned H4D Stanford 2016
Narrative Mind Lessons Learned H4D Stanford 2016Narrative Mind Lessons Learned H4D Stanford 2016
Narrative Mind Lessons Learned H4D Stanford 2016Stanford University
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessInside Analysis
 

Similar a Basic Sentiment Analysis using Hive (20)

Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoCOSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Narrative Mind Lessons Learned
Narrative Mind Lessons LearnedNarrative Mind Lessons Learned
Narrative Mind Lessons Learned
 
Narrative Mind Lessons Learned H4D Stanford 2016
Narrative Mind Lessons Learned H4D Stanford 2016Narrative Mind Lessons Learned H4D Stanford 2016
Narrative Mind Lessons Learned H4D Stanford 2016
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
 

Más de Qubole

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome ThemQubole
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data AdoptionQubole
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleQubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on YarnQubole
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup Qubole
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleQubole
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposalQubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloudQubole
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive QueriesQubole
 

Más de Qubole (20)

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by Qubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
 

Último

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Basic Sentiment Analysis using Hive

  • 1. Sentiment Analysis using Hive Secrets From the Pros We will be starting at 11:03 PDT Use the Chat Pane in GoToWebinar to Ask Questions! Assess your level and learn new stuff This webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts) ?
  • 2. News Cycle for “Mortgage” 2008- 09 Mortgage- Crisis, Foreclosures, Fraud -10 0 10 20 30 40 50 60 70 80 90 6/12/04 8/1/04 9/20/04 11/9/04 12/29/04 2/17/05 4/8/05 5/28/05 Crisis Foreclosure Fraud Linear (Crisis) Linear (Foreclosure) Linear (Fraud) # of records: 90M/partition Partitions: Month Columns: URL Timestamp Array of Memes Links Table: MemeTracker 36GB of JSON Data
  • 3. AGENDA This Webinar provides tips on doing basic sentiment analysis on large data sets using Hive: • Overview of Sentiment Analysis (SA) • Hive UDFs useful for SA • Demo, Guided Tutorial • Developing advanced, custom SA Engines
  • 4. Sentiment Analysis Applications Direct-- Call center logs, Emails, Chat logs Indirect-- Social Media, Forums, Review websites Gather Customer Feedback Over time, geography By customer, market segments Sentiment Analysis Product / service decisions Customer support Marketing- messaging, offers Customer retention, upsell Use for Decision making
  • 5. Sentiment Analysis How to operationalize a Sentiment Analysis App 1. Crawl, Scrape, API calls, collect 2. Create “Documents” 3. Pre-process Data 4. Apply Language Model, Extract Sentiment 5. Integrate with Mktg Automn., CRM, C CA, etc OLTP 6. Improve Product, Better CS, Targeted Offers
  • 6. Pre and Post Preprocessing Hive Built-In Functions Goal Input Data Output Data Use this Hive UDF Tokenization (“Hello There! How are you?”) ( (“Hello”, “There”) , (“How”, “are”, “you”) ) sentences Column (array) to rows [1, 2, 3] 1 2 3 explode Navigating documents, extracting fields {"store": {"fruit":[{"weight":8,"type":"apple"} ,{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red "} }, "email":"amy@xyz.net", "owner":"amy" } {"weight":8,"type":" apple"} get_json_object( src_json.json, '$.fruit[0]')
  • 7. N-Gram Language Models Q: What is a language model? A: A mathematical model that assigns probability to a sequence of m words Q: What is “n-gram” model? A: Probabilistic language model for predicting next word in a sequence of words Q: What is an n-gram? A: A contiguous sequence of n items from a given sequence of text Eg: “Mary had a little lamb” Bi-grams: “Mary had”, “had a”, “a little”, “little lamb”
  • 8. N-Gram Language Model Hive Built-In Functions Goal Input Data Output Data Use this Hive UDF Find important topics using a stop word list, trending topics Collection of sentences k most frequently occurring n-grams ngrams Extract intelligence around certain keywords, pre-compute search look aheads Collection of sentences k most frequently occuring n-grams around a “context” word. Eg: “Government shutdown” context_ngrams
  • 9. Dataset used-- Meme Tracker How MemeTracker.org creates the dataset 90 Million sources 900K news stories / day Track 17M memes # of records: 90M/partition Partitions: Month Columns: URL Timestamp Array of Memes Links Table: MemeTracker 6GB of Data / month
  • 10. Analyze Sentiment on “Mortgage” By Tracking How Memes spread, using Hive What is a Meme? “Government Shutdown”, “Affordable Care Act”, “Green Eggs and Ham”, etc # of records: 90M/partition Partitions: Month Columns: URL Timestamp Array of Memes Links Table: MemeTracker 36GB of JSON Data
  • 11. Demo
  • 12. Hive’s Extensibility Framework • There are many UDFs built into Hive • For more advanced users Hive allows many ways to extend the language – SERDEs – UDFs, UDAFs, and UDTFs – Hive Streaming
  • 13. How to access this Tutorial • Create a free Qubole Account (www.qubole.com) • Login  Click on “Analyze”  Look for “Tutorials” tab at top of page
  • 14. Summary • Pre and post processing – Use Hive • Language Models – Use pre-existing language models codified as Hive UDFs such as ngrams and context_ngrams – UDFs-- Build your own language model in java using Hive UDF framework – Hive Streaming-- Plug-in your existing language models or 3rd party libraries • Visualization – Use a spreadsheet / BI reporting tool
  • 15. THANK YOU Managed Cluster Built-In Connectors Friendly User-Interface Dedicated Support • 100% Managed Hadoop Cluster in the Cloud • Auto-Scaling Cluster. Full Life-cycle Management • +12 Connectors to Applications and Data Sources • 14-Day Free Trial (free account available) • 24/7 Customer Support What’s Included?  www.qubole.com/try 

Notas del editor

  1. Great to model clicks and impressions and try and understand what a buyers intent is. Intent to purchase or churn.. Quality-- Banks, call centers,
  2. Great to model clicks and impressions and try and understand what a buyers intent is. Intent to purchase or churn.. Quality-- Banks, call centers,
  3. Information diffusionData is already gathered, documents created, memes extracted. Lot of work already done. Data ready for you.Can do this on your own on twitter feeds.
  4. Solutions– many..Framework– pre-processing --- applying model --- post processingChallenges: Scaling.