SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Sentiment Knowledge Discovery in Twitter Streaming Data
Albert Bifet and Eibe Frank
University of Waikato
Hamilton, New Zealand
Canberra, 7 October 2010
Discovery Science 2010
Twitter: A Massive Data Stream
Web 2.0
Micro-blogging service
Built to discover what is happening at any moment in time,
anywhere in the world.
106 million registered users
600 million search queries per day
3 billion requests a day via its API.
2 / 26
Outline
1 Twitter Streaming Data
2 Twitter Sentiment Classification: Metrics and Methods
3 Empirical results
3 / 26
Outline
1 Twitter Streaming Data
2 Twitter Sentiment Classification: Metrics and Methods
3 Empirical results
4 / 26
Data stream classification cycle
1 Process an example at a time,
and inspect it only once (at
most)
2 Use a limited amount of
memory
3 Work in a limited amount of
time
4 Be ready to predict at any
point
5 / 26
Data stream classification cycle
Evaluation procedures for Data
Streams
Holdout
Interleaved Test-Then-Train
("Prequential" Evaluation)
5 / 26
Twitter Streaming API
Twitter APIs
Streaming API
Two discrete REST APIs
Real-time access to Tweets
sampled form
filtered form
HTTP based
GET
POST
DELETE
6 / 26
Sentiment Analysis on Twitter
Sentiment analysis
Classifying messages into two categories depending on
whether they convey positive or negative feelings
Emoticons are visual cues associated with emotional states,
which can be used to define class labels for sentiment
classification
Positive Emoticons Negative Emoticons
:) :(
:-) :-(
: ) : (
:D
=)
Table: List of positive and negative emoticons.
7 / 26
Outline
1 Twitter Streaming Data
2 Twitter Sentiment Classification: Metrics and Methods
3 Empirical results
8 / 26
Streaming Data Evaluation with Unbalanced Classes
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
Predicted Predicted
Class+ Class- Total
Correct Class+ 68.06 14.94 83
Correct Class- 13.94 3.06 17
Total 82 18 100
Table: Confusion matrix for chance predictor
9 / 26
Streaming Data Evaluation with Unbalanced Classes
Kappa Statistic
p0: classifier’s prequential accuracy
pc: probability that a chance classifier makes a correct
prediction.
κ statistic
κ =
p0 −pc
1−pc
κ = 1 if the classifier is always correct
κ = 0 if the predictions coincide with the correct ones as
often as those of the chance classifier
Forgetting mechanism for estimating prequential kappa
Sliding window of size w with the most recent observations
10 / 26
Data Stream Mining Methods
Multinomial Naïve Bayes
Considers a document as a bag-of-words.
Estimates the probability of observing word w and the prior
probability P(c)
Probability of class c given a test document:
P(c|d) =
P(c)∏w∈d P(w|c)nwd
P(d)
11 / 26
Data Stream Mining Methods
Stochastic Gradient Descent
Vanilla stochastic gradient descent with a fixed learning
rate
Optimizing the hinge loss with an L2 penalty commonly
applied to SVM
Loss function to optimize:
λ
2
||w||2
+∑[1−(yxw+b)]+
12 / 26
Data Stream Mining Methods
Hoeffding Tree
Incremental decision tree for data streams.
Strategy based on the Hoeffding bound
ε =
R2 ln(1/δ)
2n
A node is expanded by splitting as soon as there is
sufficient statistical evidence
13 / 26
Outline
1 Twitter Streaming Data
2 Twitter Sentiment Classification: Metrics and Methods
3 Empirical results
14 / 26
What is MOA?
{M}assive {O}nline {A}nalysis is a framework for mining data
streams.
Based on experience with Weka and VFML
Focussed on classification trees, but lots of active
development: clustering, item set and sequence mining,
regression
Easy to extend
Easy to design and run experiments
15 / 26
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
16 / 26
Twitter Sentiment Corpora
Twitter Sentiment Corpus
twittersentiment.appspot.com
Alec Go, Richa Bhayani, Karthik Raghunathan, and Lei
Huang
Website to research the sentiment for a brand, product, or
topic.
Training dataset with messages between April 2009 and
June 25, 2009
800,000 tweets with positive emoticons
800,000 tweets with negative emoticons
Test dataset manually annotated
177 negative tweets
182 positive ones
17 / 26
Twitter Sentiment Corpora
Edinburgh Corpus
http://demeter.inf.ed.ac.uk
Sasa Petrovic, Miles Osborne, and Victor Lavrenko
97 million tweets (14 GB)
Each tweet contains
timestamp of the tweet,
anonymized user name
the tweet’s text
the posting method that was used
Collected between November 11th 2009 and February 1st
2010, using Twitter’s streaming API.
18 / 26
Twitter Empirical Evaluation
Sliding Window Prequential Accuracy
30
40
50
60
70
80
90
100
0,01
0,08
0,15
0,22
0,29
0,36
0,43
0,5
0,57
0,64
0,71
0,78
0,85
0,92
0,99
1,06
1,13
1,2
1,27
1,34
1,41
1,48
1,55
Millions of Instances
Accuracy%
NB Multinomial SGD Hoeffding Tree Class Distribution
Figure: Accuracy and Kappa Statistic on twittersentiment
corpus
19 / 26
Twitter Empirical Evaluation
Sliding Window Kappa Statistic
0
10
20
30
40
50
60
70
80
0,01
0,08
0,15
0,22
0,29
0,36
0,43
0,50
0,57
0,64
0,71
0,78
0,85
0,92
0,99
1,06
1,13
1,20
1,27
1,34
1,41
1,48
1,55
Millions of Instances
KappaStatistic
NB Multinomial SGD Hoeffding Tree Class Distribution
Figure: Accuracy and Kappa Statistic on twittersentiment
corpus
19 / 26
Twitter Empirical Evaluation
Sliding Window Prequential Accuracy
75
77
79
81
83
85
87
89
91
93
95
0,01
0,1
0,19
0,28
0,37
0,46
0,55
0,64
0,73
0,82
0,91
1
1,09
1,18
1,27
1,36
1,45
1,54
1,63
1,72
1,81
1,9
1,99
2,08
Millions of Instances
Accuracy%
NB Multinomial SGD Hoeffding Tree Class Distribution
Figure: Accuracy and Kappa Statistic on Edinburgh corpus
20 / 26
Twitter Empirical Evaluation
Sliding Window Kappa Statistic
0
10
20
30
40
50
60
70
80
90
100
0,01
0,1
0,19
0,28
0,37
0,46
0,55
0,64
0,73
0,82
0,91
1
1,09
1,18
1,27
1,36
1,45
1,54
1,63
1,72
1,81
1,9
1,99
2,08
Millions of Instances
KappaStatistic
NB Multinomial SGD Hoeffding Tree Class Distribution
Figure: Accuracy and Kappa Statistic on Edinburgh corpus
20 / 26
twittersentiment Corpus
Prequential Accuracy and Kappa
Accuracy Kappa Time
Multinomial Naïve Bayes 75.05% 50.10% 116.62 sec.
SGD 82.80% 62.60% 219.54 sec.
Hoeffding Tree 73.11% 46.23% 5525.51 sec.
Total prequential accuracy and Kappa measured on the
twittersentiment data stream
21 / 26
Edinburgh Corpus
Prequential Accuracy and Kappa
Accuracy Kappa Time
Multinomial Naïve Bayes 86.11% 36.15% 173.28, sec.
SGD 86.26% 31.88% 293.98 sec.
Hoeffding Tree 84.76% 20.40% 6151.51 sec.
Total prequential accuracy and Kappa obtained on the
Edinburgh corpus data stream.
22 / 26
SGD coefficient variations on the Edinburgh corpus
Middle of Stream End of Stream
Tags Coefficient Coefficient Variation
apple 0.3 0.7 0.4
microsoft -0.4 -0.1 0.3
facebook -0.3 0.4 0.7
mcdonalds 0.5 0.1 -0.4
google 0.3 0.6 0.3
disney 0.0 0.0 0.0
bmw 0.0 -0.2 -0.2
pepsi 0.1 -0.6 -0.7
dell 0.2 0.0 -0.2
gucci -0.4 0.6 1.0
amazon -0.1 -0.4 -0.3
23 / 26
Summary
Twitter is a new “what’s-happening-right-now” tool
Twitter as a stream mining dataset for real-time predictions
Sliding window Kappa statistic
Recommend SGD-based model
24 / 26
twittersentiment Corpus
Hold-out Accuracy and Kappa
Accuracy Kappa
Multinomial Naïve Bayes 82.45% 64.89%
SGD 78.55% 57.23%
Hoeffding Tree 69.36% 38.73%
Accuracy and Kappa for the test dataset obtained from
twittersentiment
25 / 26
Edinburgh Corpus
Hold-out Accuracy and Kappa
Accuracy Kappa
Multinomial Naïve Bayes 73.81% 47.28%
SGD 67.41% 34.23%
Hoeffding Tree 60.72% 20.59%
Accuracy and Kappa for the test dataset obtained from
twittersentiment using the Edinburgh corpus as training
data stream.
26 / 26

Más contenido relacionado

La actualidad más candente

Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream miningAlbert Bifet
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016Paolo Missier
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’ Paolo Missier
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud Paolo Missier
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Paolo Missier
 
Metric based meta_learning
Metric based meta_learningMetric based meta_learning
Metric based meta_learningSEMINARGROOT
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefRobert Grossman
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
 

La actualidad más candente (20)

Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
 
Metric based meta_learning
Metric based meta_learningMetric based meta_learning
Metric based meta_learning
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
 

Similar a Sentiment Knowledge Discovery in Twitter Streaming Data

Maria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data streamMaria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data streamPyData
 
Network Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangNetwork Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangEugine Kang
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns associationDeepaR42
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataArabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataMatthew Vaughn
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...Paolo Missier
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATetsuya Sakai
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisOlga Scrivner
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value StoreSajeev P
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
CCLS Internship Presentation
CCLS Internship PresentationCCLS Internship Presentation
CCLS Internship PresentationCharles Naut
 
Una estrategia para la integración de ontologías, servicios web y PLN en el a...
Una estrategia para la integración de ontologías, servicios web y PLN en el a...Una estrategia para la integración de ontologías, servicios web y PLN en el a...
Una estrategia para la integración de ontologías, servicios web y PLN en el a...Anubis Hosein
 
2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflowsmyGrid team
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudAccelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudJamie Kinney
 

Similar a Sentiment Knowledge Discovery in Twitter Streaming Data (20)

Maria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data streamMaria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data stream
 
Network Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangNetwork Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine Kang
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns association
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataArabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...
 
Data mining
Data mining Data mining
Data mining
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
Kdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar DasKdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar Das
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value Store
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
CCLS Internship Presentation
CCLS Internship PresentationCCLS Internship Presentation
CCLS Internship Presentation
 
HPC_NIST_SHA3
HPC_NIST_SHA3HPC_NIST_SHA3
HPC_NIST_SHA3
 
Braintalk cuso nm
Braintalk cuso nmBraintalk cuso nm
Braintalk cuso nm
 
Una estrategia para la integración de ontologías, servicios web y PLN en el a...
Una estrategia para la integración de ontologías, servicios web y PLN en el a...Una estrategia para la integración de ontologías, servicios web y PLN en el a...
Una estrategia para la integración de ontologías, servicios web y PLN en el a...
 
2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Telegraph Cq English
Telegraph Cq EnglishTelegraph Cq English
Telegraph Cq English
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudAccelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the Cloud
 

Más de Albert Bifet

Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labelsAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsAlbert Bifet
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online AnalysisAlbert Bifet
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streamsAlbert Bifet
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAlbert Bifet
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAlbert Bifet
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsAlbert Bifet
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesAlbert Bifet
 
Kalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsKalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsAlbert Bifet
 

Más de Albert Bifet (18)

Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online Analysis
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streams
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data Streams
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed Trees
 
Kalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsKalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data Streams
 

Último

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 

Último (20)

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 

Sentiment Knowledge Discovery in Twitter Streaming Data

  • 1. Sentiment Knowledge Discovery in Twitter Streaming Data Albert Bifet and Eibe Frank University of Waikato Hamilton, New Zealand Canberra, 7 October 2010 Discovery Science 2010
  • 2. Twitter: A Massive Data Stream Web 2.0 Micro-blogging service Built to discover what is happening at any moment in time, anywhere in the world. 106 million registered users 600 million search queries per day 3 billion requests a day via its API. 2 / 26
  • 3. Outline 1 Twitter Streaming Data 2 Twitter Sentiment Classification: Metrics and Methods 3 Empirical results 3 / 26
  • 4. Outline 1 Twitter Streaming Data 2 Twitter Sentiment Classification: Metrics and Methods 3 Empirical results 4 / 26
  • 5. Data stream classification cycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 5 / 26
  • 6. Data stream classification cycle Evaluation procedures for Data Streams Holdout Interleaved Test-Then-Train ("Prequential" Evaluation) 5 / 26
  • 7. Twitter Streaming API Twitter APIs Streaming API Two discrete REST APIs Real-time access to Tweets sampled form filtered form HTTP based GET POST DELETE 6 / 26
  • 8. Sentiment Analysis on Twitter Sentiment analysis Classifying messages into two categories depending on whether they convey positive or negative feelings Emoticons are visual cues associated with emotional states, which can be used to define class labels for sentiment classification Positive Emoticons Negative Emoticons :) :( :-) :-( : ) : ( :D =) Table: List of positive and negative emoticons. 7 / 26
  • 9. Outline 1 Twitter Streaming Data 2 Twitter Sentiment Classification: Metrics and Methods 3 Empirical results 8 / 26
  • 10. Streaming Data Evaluation with Unbalanced Classes Predicted Predicted Class+ Class- Total Correct Class+ 75 8 83 Correct Class- 7 10 17 Total 82 18 100 Table: Simple confusion matrix example Predicted Predicted Class+ Class- Total Correct Class+ 68.06 14.94 83 Correct Class- 13.94 3.06 17 Total 82 18 100 Table: Confusion matrix for chance predictor 9 / 26
  • 11. Streaming Data Evaluation with Unbalanced Classes Kappa Statistic p0: classifier’s prequential accuracy pc: probability that a chance classifier makes a correct prediction. κ statistic κ = p0 −pc 1−pc κ = 1 if the classifier is always correct κ = 0 if the predictions coincide with the correct ones as often as those of the chance classifier Forgetting mechanism for estimating prequential kappa Sliding window of size w with the most recent observations 10 / 26
  • 12. Data Stream Mining Methods Multinomial Naïve Bayes Considers a document as a bag-of-words. Estimates the probability of observing word w and the prior probability P(c) Probability of class c given a test document: P(c|d) = P(c)∏w∈d P(w|c)nwd P(d) 11 / 26
  • 13. Data Stream Mining Methods Stochastic Gradient Descent Vanilla stochastic gradient descent with a fixed learning rate Optimizing the hinge loss with an L2 penalty commonly applied to SVM Loss function to optimize: λ 2 ||w||2 +∑[1−(yxw+b)]+ 12 / 26
  • 14. Data Stream Mining Methods Hoeffding Tree Incremental decision tree for data streams. Strategy based on the Hoeffding bound ε = R2 ln(1/δ) 2n A node is expanded by splitting as soon as there is sufficient statistical evidence 13 / 26
  • 15. Outline 1 Twitter Streaming Data 2 Twitter Sentiment Classification: Metrics and Methods 3 Empirical results 14 / 26
  • 16. What is MOA? {M}assive {O}nline {A}nalysis is a framework for mining data streams. Based on experience with Weka and VFML Focussed on classification trees, but lots of active development: clustering, item set and sequence mining, regression Easy to extend Easy to design and run experiments 15 / 26
  • 17. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 16 / 26
  • 18. Twitter Sentiment Corpora Twitter Sentiment Corpus twittersentiment.appspot.com Alec Go, Richa Bhayani, Karthik Raghunathan, and Lei Huang Website to research the sentiment for a brand, product, or topic. Training dataset with messages between April 2009 and June 25, 2009 800,000 tweets with positive emoticons 800,000 tweets with negative emoticons Test dataset manually annotated 177 negative tweets 182 positive ones 17 / 26
  • 19. Twitter Sentiment Corpora Edinburgh Corpus http://demeter.inf.ed.ac.uk Sasa Petrovic, Miles Osborne, and Victor Lavrenko 97 million tweets (14 GB) Each tweet contains timestamp of the tweet, anonymized user name the tweet’s text the posting method that was used Collected between November 11th 2009 and February 1st 2010, using Twitter’s streaming API. 18 / 26
  • 20. Twitter Empirical Evaluation Sliding Window Prequential Accuracy 30 40 50 60 70 80 90 100 0,01 0,08 0,15 0,22 0,29 0,36 0,43 0,5 0,57 0,64 0,71 0,78 0,85 0,92 0,99 1,06 1,13 1,2 1,27 1,34 1,41 1,48 1,55 Millions of Instances Accuracy% NB Multinomial SGD Hoeffding Tree Class Distribution Figure: Accuracy and Kappa Statistic on twittersentiment corpus 19 / 26
  • 21. Twitter Empirical Evaluation Sliding Window Kappa Statistic 0 10 20 30 40 50 60 70 80 0,01 0,08 0,15 0,22 0,29 0,36 0,43 0,50 0,57 0,64 0,71 0,78 0,85 0,92 0,99 1,06 1,13 1,20 1,27 1,34 1,41 1,48 1,55 Millions of Instances KappaStatistic NB Multinomial SGD Hoeffding Tree Class Distribution Figure: Accuracy and Kappa Statistic on twittersentiment corpus 19 / 26
  • 22. Twitter Empirical Evaluation Sliding Window Prequential Accuracy 75 77 79 81 83 85 87 89 91 93 95 0,01 0,1 0,19 0,28 0,37 0,46 0,55 0,64 0,73 0,82 0,91 1 1,09 1,18 1,27 1,36 1,45 1,54 1,63 1,72 1,81 1,9 1,99 2,08 Millions of Instances Accuracy% NB Multinomial SGD Hoeffding Tree Class Distribution Figure: Accuracy and Kappa Statistic on Edinburgh corpus 20 / 26
  • 23. Twitter Empirical Evaluation Sliding Window Kappa Statistic 0 10 20 30 40 50 60 70 80 90 100 0,01 0,1 0,19 0,28 0,37 0,46 0,55 0,64 0,73 0,82 0,91 1 1,09 1,18 1,27 1,36 1,45 1,54 1,63 1,72 1,81 1,9 1,99 2,08 Millions of Instances KappaStatistic NB Multinomial SGD Hoeffding Tree Class Distribution Figure: Accuracy and Kappa Statistic on Edinburgh corpus 20 / 26
  • 24. twittersentiment Corpus Prequential Accuracy and Kappa Accuracy Kappa Time Multinomial Naïve Bayes 75.05% 50.10% 116.62 sec. SGD 82.80% 62.60% 219.54 sec. Hoeffding Tree 73.11% 46.23% 5525.51 sec. Total prequential accuracy and Kappa measured on the twittersentiment data stream 21 / 26
  • 25. Edinburgh Corpus Prequential Accuracy and Kappa Accuracy Kappa Time Multinomial Naïve Bayes 86.11% 36.15% 173.28, sec. SGD 86.26% 31.88% 293.98 sec. Hoeffding Tree 84.76% 20.40% 6151.51 sec. Total prequential accuracy and Kappa obtained on the Edinburgh corpus data stream. 22 / 26
  • 26. SGD coefficient variations on the Edinburgh corpus Middle of Stream End of Stream Tags Coefficient Coefficient Variation apple 0.3 0.7 0.4 microsoft -0.4 -0.1 0.3 facebook -0.3 0.4 0.7 mcdonalds 0.5 0.1 -0.4 google 0.3 0.6 0.3 disney 0.0 0.0 0.0 bmw 0.0 -0.2 -0.2 pepsi 0.1 -0.6 -0.7 dell 0.2 0.0 -0.2 gucci -0.4 0.6 1.0 amazon -0.1 -0.4 -0.3 23 / 26
  • 27. Summary Twitter is a new “what’s-happening-right-now” tool Twitter as a stream mining dataset for real-time predictions Sliding window Kappa statistic Recommend SGD-based model 24 / 26
  • 28. twittersentiment Corpus Hold-out Accuracy and Kappa Accuracy Kappa Multinomial Naïve Bayes 82.45% 64.89% SGD 78.55% 57.23% Hoeffding Tree 69.36% 38.73% Accuracy and Kappa for the test dataset obtained from twittersentiment 25 / 26
  • 29. Edinburgh Corpus Hold-out Accuracy and Kappa Accuracy Kappa Multinomial Naïve Bayes 73.81% 47.28% SGD 67.41% 34.23% Hoeffding Tree 60.72% 20.59% Accuracy and Kappa for the test dataset obtained from twittersentiment using the Edinburgh corpus as training data stream. 26 / 26