3. Twitter as a tool for scientific communication in Spain
Relevant network: user volume, free generation of content and its information in real time.
Advantage: immediacy | Disadvantage: saturation
• It has enormous potential and begins to be protagonist, but at the same time requires
efficient use.
• Twitter is the most used network by science journalists.
• Science communicators increasingly use digital technology and social networks.
• The first data on a scientific or technical scoop are already made public on Twitter.
• The opinion shown on Twitter has a direct link with national and international scientific news.
4. RQ1 - Can we analyze a part of the public data available in the social
network Twitter to know attitudes, opinions and sentiments towards
the communication topics of science that are shared?
5. Objectives
Main Objective:
Develop and evaluate a classifier for the analysis of sentiment of messages on scientific topics,
in Spanish and in real time, on the social network Twitter using machine learning techniques.
Secondary Objectives:
1. Creation of a specific corpus of texts classified by positive or negative sentiment.
2. Development of a prototype for the analysis of sentiment of scientific messages on Twitter
in real time.
3. Test the prototype.
6. Expected Results
Corpus of texts of scientific topics in Spanish,
labeled with positive or negative sentiment.
Prototype "OPSCIENCE" Spanish version
8. Machine Learning
• Selection
• Preprocessing
• Transformation
• Modeling
• Interpretation
• Evaluation
Data Mining
Patterns in large
volumes of data set.
• Supervised:
establishes a
correspondence
between the desired
inputs and outputs of
the system.
Machine
Learning
It uses algorithms
and statistics to
understand, learn
and reproduce
human language.
• Probabilistic
models based on
data
Natural Language
Processing NLP
Computational study
of sentiments
expressed through
texts.
• Polarity: positive
or negative
Sentiment
Analysis
9. The goal of supervised machine learning is
create a function
that is able to
predict
what the value of an input element would be
after being trained with the sentiment classifier.
10. OPScience classificator
It allows to analyze locally the tone of scientific tweets in real time:
- Using free available resources such as Python (version 2.7) and the Application Program
Interface (API) of Twitter (REST and STREAMING).
- Based on the NLTK and Sci-Learn libraries for Python.
- Train a supervised machine learning model with 6 classification algorithms (Original Naive
Bayes Original, Naive Bayes for multimodal models, Naive Bayes for multivariate Bernoulli
models, Logistic Regression, Linear Support Vector Classification and Linear classifiers with
stochastic gradient descent -SGD- training).
12. STEP 1:
Creation of a corpus of scientific texts in Spanish
which will serve to train an automatic learning model.
STEP 2:
Supervised machine learning model
trained with 6 classification algorithms
STEP 3:
Real-time classifier test
Connecting to the Twitter streaming API
13. STEP 1. Creation of a corpus of scientific texts in Spanish
1.1 Acquisition of the Data
• Downloading data from Twitter
• Creating an app
• Data obtained
• Script for data download
Characteristics of the total dataset
Language Spanish
tweets downloaded in streaming 171.459
tweets downloaded in Rest 37.292
Total of downloaded tweets 208.751
14. STEP 1. Creation of a corpus of scientific texts in Spanish
1.2 Preprocessing of the data:
• Store the tweets in csv text.
• UTF / ANSI formats
• Spanish Language
• Texts in lowercase
• Retweets
• Suppression of possible
duplicates with R
• Tokenization
• Other preprocessing
• Manual classification of the
sentiment of the text
15. STEP 1. Creation of a corpus of scientific texts in Spanish
Corpus of texts:
10,000 elements
• 5,000 messages labeled as positive
• 5,000 messages labeled as negative
16. STEP 2. Supervised machine learning model
Learning: The classifier will be trained with the corpus of positive and negative scientific
tweets in Spanish: Training 70% - Test validation 30%
6 Algorithms used:
– Original Naive Bayes,
– Naive Bayes for multimodal models,
– Naive Bayes for multivariate Bernoulli models,
– Logistic Regression,
– Linear Support Vector Classification (SVC) and
– Linear classifiers with stochastic gradient descent -SGD- training.
Combination of classification algorithms: voting by feature intervals.
A voting system is created where each algorithm has one vote and the classification that
has the most votes is the one chosen.
17. STEP 3. Real-time classifier test
Validation of the Model
• Using these predictive models, the classifier will allow to connect to the streaming of
Twitter data in real time (using the API streaming available) and
• filter tweets by keywords or hashtag, written in Spanish about science to predict
the sentiment of each tweet generated
• and automatically visualize with the Matplot library those with high confidence
intervals (> 0.80).
19. Classifier Results
Accuracy = correct predictions / total predictions
Average of this type of models 70%
Example: TASS project is around 72% (Cumbreras et al., 2016).
Algoritmo Accuracy %
Original Naive Bayes Algo 72.64
MNB_classifier 72.24
BernoulliNB_classifier 72.80
LogisticRegression_classifier 71.88
LinearSVC_classifier 70.45
SGDClassifier 71.15
20. Combination of classifiers
voted_classifier: Accuracy 72.31 %
Confussion Matrix
Predicción
Pos Neg
Real Pos TP FN
Neg TF TN
Predicción
Pos Neg
Real Pos <1158> 342
Neg 465 <1047>
22. Conclusions
• Microblogging and Twitter as a communication tool of Science.
• Preparation of a specific corpus of scientific texts in Spanish
• Training of a model: used algorithms and parameters.
• Evaluation of obtaining results. Accuracy 72%
• Test in real time.
23. Future lines of research
• This study can support the strategies of scientific communication.
• Test and study of individual results of the classification algorithms.
• Enlargement of the corpus and labeling with more classes: positive,
negative and neutral to include the informative messages.
• Measurement of the models at the end of each preprocessing phase, in
order to assess their relative importance.
• Real-time, large-scale studies with distributed computing.
24. Future lines of research
Continue
RQ1 - Can we analyze a part of the public data available in the social network Twitter to
know attitudes, opinions, sentiments towards the communication topics of science that
are shared?
with
and move towards the prediction of future trends in science topics?.
25. Pa t r i c i a S á n c h e z - H o l ga d o
C a r l o s A r c i l a - C a l d e ró n