Recent years have seen an increasing attention to social aspects of software engineering, including studies of emotions and sentiments experienced and expressed by the software developers. Most of these studies reuse existing sentiment analysis tools such as SentiStrength and NLTK. However, these tools have been trained on product reviews and movie reviews and, therefore, their results might not be applicable in the software engineering domain.
In this paper we study whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool on software engineering studies by conducting a simple study of differences in issue resolution times for positive, negative and neutral texts. We repeat the study for seven datasets (issue trackers and STACK OVERFLOW questions) and different
sentiment analysis tools and observe that the disagreement between the tools can lead to contradictory conclusions.
Post Quantum Cryptography – The Impact on Identity
Sentiment analysis tools for software engineering research cannot be used out of the box
1. On sentiment analysis tools for
software engineering research
Robbert Jongeling Subhajit Datta Alexander Serebrenik
Eindhoven U of
Technology (NL)
Singapore U of Technology
and Design (SG)
Eindhoven U of
Technology (NL)
@jongeling_r @datta_subhajit @aserebrenik
2. E. Guzman, D. Azócar, and Y. Li,
“Sentiment analysis of commit
comments in GitHub: An empirical
study,” MSR 2014
A.-I. Rousinopoulos, G. Robles, and
J. M. González-Barahona, “Sentiment
analysis of Free/Open Source
developers: preliminary findings from
a case study,” Revista Eletrônica de
Sistemas de Informação, 2014
E. Guzman and B. Bruegge, “Towards
emotional awareness in software
development teams,” in Joint Meeting on
Foundations of Software Engineering, 2013
D. Pletea, B. Vasilescu, and A. Serebrenik,
“Security and emotion: Sentiment analysis
of security discussions on GitHub”, MSR
2014
M. Ortu, B. Adams, G. Destefanis, P. Tourani,
M. Marchesi, and R. Tonelli, “Are bullies
more productive? empirical study of
affectiveness vs. issue fixing time,” in MSR
2015
D. Garcia, M. S. Zanetti, and F. Schweitzer,
“The role of emotions in contributors
activity: A case study on the Gentoo
community,” in International Conference on
Cloud and Green Computing, 2013
3. E. Guzman, D. Azócar, and Y. Li,
“Sentiment analysis of commit
comments in GitHub: An empirical
study,” MSR 2014
A.-I. Rousinopoulos, G. Robles, and
J. M. González-Barahona, “Sentiment
analysis of Free/Open Source
developers: preliminary findings from
a case study,” Revista Eletrônica de
Sistemas de Informação, 2014
E. Guzman and B. Bruegge, “Towards
emotional awareness in software
development teams,” in Joint Meeting on
Foundations of Software Engineering, 2013
D. Pletea, B. Vasilescu, and A. Serebrenik,
“Security and emotion: Sentiment analysis
of security discussions on GitHub”, MSR
2014
M. Ortu, B. Adams, G. Destefanis, P. Tourani,
M. Marchesi, and R. Tonelli, “Are bullies
more productive? empirical study of
affectiveness vs. issue fixing time,” in MSR
2015
D. Garcia, M. S. Zanetti, and F. Schweitzer,
“The role of emotions in contributors
activity: A case study on the Gentoo
community,” in International Conference on
Cloud and Green Computing, 2013
NLTK SentiStrength
4. E. Guzman, D. Azócar, and Y. Li,
“Sentiment analysis of commit
comments in GitHub: An empirical
study,” MSR 2014
A.-I. Rousinopoulos, G. Robles, and
J. M. González-Barahona, “Sentiment
analysis of Free/Open Source
developers: preliminary findings from
a case study,” Revista Eletrônica de
Sistemas de Informação, 2014
E. Guzman and B. Bruegge, “Towards
emotional awareness in software
development teams,” in Joint Meeting on
Foundations of Software Engineering, 2013
D. Pletea, B. Vasilescu, and A. Serebrenik,
“Security and emotion: Sentiment analysis
of security discussions on GitHub”, MSR
2014
M. Ortu, B. Adams, G. Destefanis, P. Tourani,
M. Marchesi, and R. Tonelli, “Are bullies
more productive? empirical study of
affectiveness vs. issue fixing time,” in MSR
2015
D. Garcia, M. S. Zanetti, and F. Schweitzer,
“The role of emotions in contributors
activity: A case study on the Gentoo
community,” in International Conference on
Cloud and Green Computing, 2013
NLTK SentiStrength
Trained on movie/product reviews.
Threat: might misidentify (or fail to identify) a
sentiment in a software engineering artefact
5. • RQ1: To what extent do different sentiment analysis
tools agree with emotions of software developers?
• RQ2: To what extent do different sentiment analysis
tools agree with each other?
• RQ3: Do different sentiment analysis tools lead to
contradictory results in a software engineering
study?
6. Murgia et al.
MSR 2014
392 comments x 4 evaluators
joy love surprise anger fearsadness
positive negative
{
{
RQ1
RQ2
7. Murgia et al.
MSR 2014
392 comments x 4 evaluators
joy love surprise anger fearsadness
positive negative
{
{
Consistent:
positive: 3 positive, none negative
negative: 3 negative, none positive
neutral: ≥3 without emotion indication
Alchemy
Stanford NLP
NLTK
SentiStrength
RQ1
Manual
neg neu pos
Tool
neg
neu
pos
RQ2
Tool A
neg neu pos
Tool
B
neg
neu
pos
RQ1
RQ2
8. Murgia et al.
MSR 2014
392 comments x 4 evaluators
joy love surprise anger fearsadness
positive negative
{
{
Consistent:
positive: 3 positive, none negative
negative: 3 negative, none positive
neutral: ≥3 without emotion indication
Alchemy
Stanford NLP
NLTK
SentiStrength
RQ1
Manual
neg neu pos
Tool
neg
neu
pos
54
24
217
0 ≤ Adjusted Rand Index ≤ 1
[Santos, Embrechts, ICANN 2009]
RQ2
Tool A
neg neu pos
Tool
B
neg
neu
pos
RQ1
RQ2
9. Murgia et al.
MSR 2014
392 comments x 4 evaluators
joy love surprise anger fearsadness
positive negative
{
{
Consistent:
positive: 3 positive, none negative
negative: 3 negative, none positive
neutral: ≥3 without emotion indication
Alchemy
Stanford NLP
NLTK
SentiStrength
RQ1
Manual
neg neu pos
Tool
neg
neu
pos
54
24
217
0 ≤ Adjusted Rand Index ≤ 1
[Santos, Embrechts, ICANN 2009]
RQ2
Tool A
neg neu pos
Tool
B
neg
neu
pos
RQ1
RQ2
10. RQ1: To what extent do different sentiment analysis tools
agree with emotions of software developers?
RQ1
Manual
neg neu pos
NLTK
neg 19 51 11
neu 0 138 7
pos 5 28 36
Tool ARI
NLTK 0.239
SentiStrength 0.113
Stanford NLP 0.108
Alchemy 0.079
Tools do not agree with manual evaluation
RQ1
RQ2
11. RQ2: To what extent do different sentiment analysis tools
agree with each other?
RQ2
SentiStrength
neg neu pos
NLTK
neg 17 39 25
neu 15 96 34
pos 6 20 43
Tool A Tool B ARI
NLTK Alchemy 0.104
NLTK SentiStrength 0.090
Tools do not agree with each other
RQ1
RQ2
13. issue tracker
over
text
response
time
Sentiment
Anal. Tool
compare times for
neg, neu, pos
issues/questions
q & a site
NLTK ∩
SentiStrength
issue tracker
over
text
response
time
Sentiment
Anal. Tool
compare times for
neg, neu, pos
issues/questions
q & a site
SentiStrength
RQ3
issue tracker
over
text
response
time
Sentiment
Analysis Tool
compare times for
neg, neu, pos
issues/questions
q & a site
NLTK
Are the results the same?
14. NLTK SentiStrength NLTK ∩ SentiStrength
ASF
descr
neg > neu*** neg > neu***
pos > neu*** pos > neu*** pos > neu***
pos > neg*** pos > neg***
ASF title
neg > neu**
pos > neu*** pos > neu**
pos > neg* pos > neg**
GNOME
descr
neg > neu*** neg > neu*** neg > neu***
pos > neu*** pos > neu*** pos > neu***
pos > neg***
neg > pos***
SO
descr
ø neg > pos* ø
RQ3 RQ3: Do different sentiment analysis tools lead to
contradictory results in a software engineering study?
Choice of the sentiment analysis tool affects results of the
software engineering study
15. Tools do not agree with manual evaluation
Tools do not agree with each other
Choice of the sentiment analysis tool affects results of the
software engineering study
Summary
Sentiment analysis tools are trained on movie/
product reviews.
Threat: might misidentify (or fail to identify) a
sentiment in a software engineering artefact
16. Next steps?
• Train sentiment analysis tools on software
engineering data
• Data of Murgia et al.: first step
• More and better-suited data is needed