SlideShare una empresa de Scribd logo
1 de 31
Automatic term extraction of dynamically 
updated text collections for sentiment 
classification into three classes 
Yuliya Rubtsova 
The A.P. Ershov Institute of Informatics Systems 
(IIS)
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products for 
businesses;
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products for 
businesses; 
 recommender systems;
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products 
for businesses; 
 recommender systems; 
 Human Machine Interface of a computer system 
which is responsible for adapting the system's 
behavior to the current emotional state of the 
person
Human Machine Interface of a computer system which 
is responsible for adapting the system's behavior to the 
current emotional state of the person 
 psychological and medical diagnosis; 
 safety control by analyzing the behavior of mass 
gatherings; 
 assistance in carrying out investigative measures.
Most common sentiment 
analysis approaches 
Supervised 
machine 
learning 
Dictionaries 
and rules 
Combined 
method
Existing corpora 
 Corpora of reviews which contain user marks 
 Belongs to one subject domain (movies reviews, 
books reviews, gadgets reviews) 
 Corps of news (a few emotional texts)
Filtration 
 Texts containing both positive and negative emotions; 
 Not informative tweets (less than 40 characters long); 
 Copied texts and retweets.
Corpus of short texts consists of 
114 991 – positive texts 
111 923 – negative texts 
107 990 – neutral texts
Corpus of short texts 
Collection type Number of words Number of unique 
words 
Positive messages 1 559 176 150 720 
Negative messages 1 445 517 191 677 
Neutral messages 1 852 995 105 239
Unique terms distribution in relation depending on 
the number of tweets 
0	 
50000	 
100000	 
150000	 
200000	 
250000	 
300000	 
350000	 
400000	 
53	 
8213	 
16461	 
24624	 
32824	 
40999	 
49264	 
57414	 
65571	 
73660	 
81791	 
89882	 
97945	 
106068	 
114238	 
123009	 
131937	 
140682	 
149495	 
158284	 
167136	 
175859	 
184578	 
193442	 
202354	 
211426	 
220117	 
229570	 
238882	 
247995	 
256716	 
265561	 
274244	 
282350	 
Number	of	the	unuque	terms	 
Number	of	texts
Uniformity of used collections 
Words frequency distribution
Most common approaches for 
used for N-grams extracting 
 Manually, using a thesaurus. 
 Term Extraction, based on significance of this term 
for a collection
Data sets characteristics 
 The entire data set is known 
 The entire data set is avaliable 
 The entire data set is static (can’t change during calculation) 
When new document is added, it is necessary to the update the 
document frequency of many terms and all previously generated 
term weights needs recalibration. For N documents in a data 
stream, the computational complexity is O(N2).
Human speech is constantly 
changing => there is a need to 
update emotional dictionaries
Change in vocabulary and 
topics discussed 
Percentage of references to the Olympic theme on all 
12% 
0.50% 
14% 
12% 
10% 
8% 
6% 
4% 
2% 
0% 
posts 
Febrary August
Change in vocabulary and 
topics discussed 
Percentage of references to the vacation theme on all 
0.06% 
0.12% 
0.14% 
0.12% 
0.10% 
0.08% 
0.06% 
0.04% 
0.02% 
0.00% 
posts 
Febrary August
Change in vocabulary and 
topics discussed 
Percentage of using term “Sebyashka” (selfie – rus) on all 
0.00% 
0.02% 
0.03% 
0.02% 
0.02% 
0.01% 
0.01% 
0.00% 
posts 
Febrary August
Filtration 
 Punctuation – commas, colons, quotation marks 
(exclamation marks, question marks and ellipses were 
retained); 
 References to significant personalities and events 
 Proper names; 
 Numerals; 
 All links were replaced with the word "Link" and were taken 
into consideration as a whole; 
 Many dots were replaced with ellipsis.
TF-ICF 
C – number of categories, 
cf – the number of categories in which weighed term is found
TF-IDF 
tf – is the frequency of term occurrence in the collection (positive or 
negative tweets) , 
T – total number of messages in the collections, 
– the number of messages in the positive and negative 
T(ti ) 
collections contained the term
Experiments
Corpus of News texts consists of 
46 339 – positive news 
46 337 – negative news 
46 340 – neutral news
ROMIP mixed collection consists of 
Reviews on books, movies, or digital camera from 
blogs 
543– positive blog texts 
236– negative blog texts 
103– neutral blog texts
Short text collection 
TF-IDF TF-ICF 
Accuracy 95,5981 95,0664 
Precision 0,958092631 0,953112184 
Recall 0,955204837 0,94984672 
F-Measure 0,956646554 0,95147665 
News collection 
TF-IDF TF-ICF 
Accuracy 69,8619 58,1397 
Precision 0,709246342 0,61278022 
Recall 0,698624505 0,581402868 
F-Measure 0,703895355 0,596679322 
ROMIP collection 
TF-IDF TF-ICF 
Accuracy 53,9773 57,9545 
Precision 0,561341047 0,558902611 
Recall 0,5311636 0,535790598 
F-Measure 0,545835539 0,547102625
Results
Experimental results in terms of F-measure 
95.66 
70.39 
54.58 
95.15 
59.68 
54.71 
120 
100 
80 
60 
40 
20 
0 
Short texts News Romip 
TF-IDF 
TF-ICF
The program module allows 
 dynamically update the unigram dictionary, 
recalculate the weight of terms, depending on the 
accessories to the collection; 
 take into account the lexical speech changes in time; 
 investigate new terms entering into active 
vocabulary.
Thank you! 
Presentation: http://www.slideshare.net/mokoron 
Yuliya Rubtsova 
yu.rubtsova@gmail.com 
study.mokoron.com

Más contenido relacionado

Similar a Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Reasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxReasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxAnkitaVerma776806
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Experiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter ZadroznyExperiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter Zadroznypadatascience
 
Zouaq wole2013
Zouaq wole2013Zouaq wole2013
Zouaq wole2013Amal Zouaq
 
Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis csandit
 
Lexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment AnalysisLexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment AnalysisJinho Choi
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptxMOINDALVS
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
 
Twitter sentiment analysis.pptx
Twitter sentiment analysis.pptxTwitter sentiment analysis.pptx
Twitter sentiment analysis.pptxRishita Gupta
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptxLonghow Lam
 

Similar a Automatic term extraction of dynamically updated text collections for sentiment classification into three classes (12)

Semantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of TwitterSemantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of Twitter
 
Reasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxReasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptx
 
Omsa
OmsaOmsa
Omsa
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Experiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter ZadroznyExperiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter Zadrozny
 
Zouaq wole2013
Zouaq wole2013Zouaq wole2013
Zouaq wole2013
 
Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis
 
Lexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment AnalysisLexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment Analysis
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
Twitter sentiment analysis.pptx
Twitter sentiment analysis.pptxTwitter sentiment analysis.pptx
Twitter sentiment analysis.pptx
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptx
 

Más de Yuliya Rubtsova

Как продать самолет с помощью соц.сетей или социальные сети для бизнеса
Как продать самолет с помощью соц.сетей или социальные сети для бизнесаКак продать самолет с помощью соц.сетей или социальные сети для бизнеса
Как продать самолет с помощью соц.сетей или социальные сети для бизнесаYuliya Rubtsova
 
Entity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problemsEntity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problemsYuliya Rubtsova
 
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Yuliya Rubtsova
 
Измеряй и властвуй или практическая web-аналитика
Измеряй и властвуй или практическая web-аналитика Измеряй и властвуй или практическая web-аналитика
Измеряй и властвуй или практическая web-аналитика Yuliya Rubtsova
 
Метод построения корпуса коротких текстов
Метод построения корпуса коротких текстовМетод построения корпуса коротких текстов
Метод построения корпуса коротких текстовYuliya Rubtsova
 
Веб аналитика на практике
Веб аналитика на практикеВеб аналитика на практике
Веб аналитика на практикеYuliya Rubtsova
 
Курс леций по основам интернет маркетинга и поисковой оптимизации
Курс леций по основам интернет маркетинга и поисковой оптимизацииКурс леций по основам интернет маркетинга и поисковой оптимизации
Курс леций по основам интернет маркетинга и поисковой оптимизацииYuliya Rubtsova
 
Web analytics в картинках и денежных знаках
Web analytics в картинках и денежных знакахWeb analytics в картинках и денежных знаках
Web analytics в картинках и денежных знакахYuliya Rubtsova
 
Продвижение мобильных приложений в AppStore и Google Play
Продвижение мобильных приложений в AppStore и Google PlayПродвижение мобильных приложений в AppStore и Google Play
Продвижение мобильных приложений в AppStore и Google PlayYuliya Rubtsova
 
Увеличение конверсии сайта
Увеличение конверсии сайтаУвеличение конверсии сайта
Увеличение конверсии сайтаYuliya Rubtsova
 
Как из посетителя сделать покупателя
Как из посетителя сделать покупателяКак из посетителя сделать покупателя
Как из посетителя сделать покупателяYuliya Rubtsova
 
Mobile applications market
Mobile applications marketMobile applications market
Mobile applications marketYuliya Rubtsova
 
Twitter marketing communications
Twitter marketing communicationsTwitter marketing communications
Twitter marketing communicationsYuliya Rubtsova
 

Más de Yuliya Rubtsova (17)

Как продать самолет с помощью соц.сетей или социальные сети для бизнеса
Как продать самолет с помощью соц.сетей или социальные сети для бизнесаКак продать самолет с помощью соц.сетей или социальные сети для бизнеса
Как продать самолет с помощью соц.сетей или социальные сети для бизнеса
 
Entity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problemsEntity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problems
 
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]
 
Измеряй и властвуй или практическая web-аналитика
Измеряй и властвуй или практическая web-аналитика Измеряй и властвуй или практическая web-аналитика
Измеряй и властвуй или практическая web-аналитика
 
Метод построения корпуса коротких текстов
Метод построения корпуса коротких текстовМетод построения корпуса коротких текстов
Метод построения корпуса коротких текстов
 
Веб аналитика на практике
Веб аналитика на практикеВеб аналитика на практике
Веб аналитика на практике
 
Mad analyst
Mad analyst   Mad analyst
Mad analyst
 
Курс леций по основам интернет маркетинга и поисковой оптимизации
Курс леций по основам интернет маркетинга и поисковой оптимизацииКурс леций по основам интернет маркетинга и поисковой оптимизации
Курс леций по основам интернет маркетинга и поисковой оптимизации
 
Web analytics в картинках и денежных знаках
Web analytics в картинках и денежных знакахWeb analytics в картинках и денежных знаках
Web analytics в картинках и денежных знаках
 
Продвижение мобильных приложений в AppStore и Google Play
Продвижение мобильных приложений в AppStore и Google PlayПродвижение мобильных приложений в AppStore и Google Play
Продвижение мобильных приложений в AppStore и Google Play
 
Увеличение конверсии сайта
Увеличение конверсии сайтаУвеличение конверсии сайта
Увеличение конверсии сайта
 
Как из посетителя сделать покупателя
Как из посетителя сделать покупателяКак из посетителя сделать покупателя
Как из посетителя сделать покупателя
 
Mobile applications market
Mobile applications marketMobile applications market
Mobile applications market
 
Intranet
IntranetIntranet
Intranet
 
Networking
NetworkingNetworking
Networking
 
Usability testing
Usability testingUsability testing
Usability testing
 
Twitter marketing communications
Twitter marketing communicationsTwitter marketing communications
Twitter marketing communications
 

Último

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 

Último (20)

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 

Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

  • 1. Automatic term extraction of dynamically updated text collections for sentiment classification into three classes Yuliya Rubtsova The A.P. Ershov Institute of Informatics Systems (IIS)
  • 2. Applied problems which can be solved with sentiment classification  consumer reviews study to commercial products for businesses;
  • 3.
  • 4. Applied problems which can be solved with sentiment classification  consumer reviews study to commercial products for businesses;  recommender systems;
  • 5.
  • 6. Applied problems which can be solved with sentiment classification  consumer reviews study to commercial products for businesses;  recommender systems;  Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person
  • 7. Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person  psychological and medical diagnosis;  safety control by analyzing the behavior of mass gatherings;  assistance in carrying out investigative measures.
  • 8. Most common sentiment analysis approaches Supervised machine learning Dictionaries and rules Combined method
  • 9. Existing corpora  Corpora of reviews which contain user marks  Belongs to one subject domain (movies reviews, books reviews, gadgets reviews)  Corps of news (a few emotional texts)
  • 10. Filtration  Texts containing both positive and negative emotions;  Not informative tweets (less than 40 characters long);  Copied texts and retweets.
  • 11. Corpus of short texts consists of 114 991 – positive texts 111 923 – negative texts 107 990 – neutral texts
  • 12. Corpus of short texts Collection type Number of words Number of unique words Positive messages 1 559 176 150 720 Negative messages 1 445 517 191 677 Neutral messages 1 852 995 105 239
  • 13. Unique terms distribution in relation depending on the number of tweets 0 50000 100000 150000 200000 250000 300000 350000 400000 53 8213 16461 24624 32824 40999 49264 57414 65571 73660 81791 89882 97945 106068 114238 123009 131937 140682 149495 158284 167136 175859 184578 193442 202354 211426 220117 229570 238882 247995 256716 265561 274244 282350 Number of the unuque terms Number of texts
  • 14. Uniformity of used collections Words frequency distribution
  • 15. Most common approaches for used for N-grams extracting  Manually, using a thesaurus.  Term Extraction, based on significance of this term for a collection
  • 16. Data sets characteristics  The entire data set is known  The entire data set is avaliable  The entire data set is static (can’t change during calculation) When new document is added, it is necessary to the update the document frequency of many terms and all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2).
  • 17. Human speech is constantly changing => there is a need to update emotional dictionaries
  • 18. Change in vocabulary and topics discussed Percentage of references to the Olympic theme on all 12% 0.50% 14% 12% 10% 8% 6% 4% 2% 0% posts Febrary August
  • 19. Change in vocabulary and topics discussed Percentage of references to the vacation theme on all 0.06% 0.12% 0.14% 0.12% 0.10% 0.08% 0.06% 0.04% 0.02% 0.00% posts Febrary August
  • 20. Change in vocabulary and topics discussed Percentage of using term “Sebyashka” (selfie – rus) on all 0.00% 0.02% 0.03% 0.02% 0.02% 0.01% 0.01% 0.00% posts Febrary August
  • 21. Filtration  Punctuation – commas, colons, quotation marks (exclamation marks, question marks and ellipses were retained);  References to significant personalities and events  Proper names;  Numerals;  All links were replaced with the word "Link" and were taken into consideration as a whole;  Many dots were replaced with ellipsis.
  • 22. TF-ICF C – number of categories, cf – the number of categories in which weighed term is found
  • 23. TF-IDF tf – is the frequency of term occurrence in the collection (positive or negative tweets) , T – total number of messages in the collections, – the number of messages in the positive and negative T(ti ) collections contained the term
  • 25. Corpus of News texts consists of 46 339 – positive news 46 337 – negative news 46 340 – neutral news
  • 26. ROMIP mixed collection consists of Reviews on books, movies, or digital camera from blogs 543– positive blog texts 236– negative blog texts 103– neutral blog texts
  • 27. Short text collection TF-IDF TF-ICF Accuracy 95,5981 95,0664 Precision 0,958092631 0,953112184 Recall 0,955204837 0,94984672 F-Measure 0,956646554 0,95147665 News collection TF-IDF TF-ICF Accuracy 69,8619 58,1397 Precision 0,709246342 0,61278022 Recall 0,698624505 0,581402868 F-Measure 0,703895355 0,596679322 ROMIP collection TF-IDF TF-ICF Accuracy 53,9773 57,9545 Precision 0,561341047 0,558902611 Recall 0,5311636 0,535790598 F-Measure 0,545835539 0,547102625
  • 29. Experimental results in terms of F-measure 95.66 70.39 54.58 95.15 59.68 54.71 120 100 80 60 40 20 0 Short texts News Romip TF-IDF TF-ICF
  • 30. The program module allows  dynamically update the unigram dictionary, recalculate the weight of terms, depending on the accessories to the collection;  take into account the lexical speech changes in time;  investigate new terms entering into active vocabulary.
  • 31. Thank you! Presentation: http://www.slideshare.net/mokoron Yuliya Rubtsova yu.rubtsova@gmail.com study.mokoron.com

Notas del editor

  1. show that when the document set size is small, the unique term count continues to climb up as the number of documents increases. However, this growth of the unique term count is reduced sharply as the number of documents becomes very large. This observation indicates that if the document collection is sufficiently large, we can expect to see very few new words by adding more documents.
  2. References to significant personalities and events – the attitude towards them may vary over time, but a classifier trained on "old texts" will not be able to adapt quickly;