1. Harnessing Twitter to Support
Serendipitous Learning of Developers
Abhishek Sharma1, Yuan Tian1, Agus Sulistya1, David Lo1
and Aiko Fallas Yamashita2
1School of Information Systems,
Singapore Management University
2Oslo and Akershus University, Norway
24th IEEE International Conference on Software Analysis,
Evolution, and Reengineering (SANER 2017)
2. • Keeping up to date a big challenge
(Storey et al. TSE’16)
Developer Challenges?
2
3. Why Twitter for Learning
• Keeping up to date a big challenge
(Storey et al. TSE’16)
• Twitter is used by software
developers to share important
information (Tian et al. MSR’12)
2
https://unsplash.com/photos/HAIPJ8PyeL8
4. Why Twitter for Learning
• Keeping up to date a big challenge
(Storey et al. TSE’16)
• Twitter is used by software
developers to share important
information (Tian et al. MSR’12)
• Twitter enables serendipitous
(pleasant and undirected) learning
for developers (Singer et al.
ICSE’14)
2
https://unsplash.com/photos/HAIPJ8PyeL8
6. Challenges
• Finding useful articles not easy
• Developers need to identify
– many relevant Twitter users to follow
– sieve through a large amount of
tweets/URLs
3
7. Challenges
• Finding useful articles not easy
• Developers need to identify
– many relevant Twitter users to follow
– sieve through a large amount of
tweets/URLs
Singer et al. ICSE’14
3
8. Challenges
• Finding useful articles not easy
• Developers need to identify
– many relevant Twitter users to follow
– sieve through a large amount of
tweets/URLs
Singer et al. ICSE’14
• Too much information can make learning using Twitter an
unpleasant experience
3
https://unsplash.com/photos/yD5rv8_WzxA
9. This Study
• Can we automatically extract popular and relevant URLs
from Twitter for developers
• In this work, we:
• propose 14 features to characterize a URL
• evaluate a supervised and unsupervised approach to
recommend URLs harvested from Twitter
4
11. Methodology (1): Collecting Seed Data
• Get a list of seed twitter users
5
http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
12. Methodology (1): Collecting Seed Data
• Get a list of seed twitter users
5
http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
13. Methodology (1): Collecting Seed Data
• Get a list of seed twitter users
• Get a larger set of people who
– Follow (or are followed by) >= 5 seed users
– Results in 85,171 Twitter users
5
14. Methodology (1): Collecting Seed Data
• Get a list of seed twitter users
• Get a larger set of people who
– Follow (or are followed by) >= 5 seed users
– Results in 85,171 Twitter users
• Collect tweets generated by these users for 1 month
period (Nov’ 15)
5
24. Methodology (3): Feature Extraction
• Content
– cosine similarity between
keyword and
• tweet text (CosSimT)
8
25. Methodology (3): Feature Extraction
• Content
– cosine similarity between
keyword and
• tweet text (CosSimT)
• user profile text (CosSimP)
8
26. Methodology (3): Feature Extraction
• Content
– cosine similarity between
keyword and
• tweet text (CosSimT)
• user profile text (CosSimP)
• webpage text (CosSimW)
8
28. Methodology (3): Feature Extraction
– Network
• estimate importance of
users through
– centrality scores
– page rank
9
29. – Network
• estimate importance of
users through
– centrality scores
– page rank
9
Methodology (3): Feature Extraction
30. – Network
• estimate importance of
users through
– centrality scores
– page rank
– Popularity
• number of times the
tweets containing the
URL were
9
Methodology (3): Feature Extraction
31. – Network
• estimate importance of
users through
– centrality scores
– page rank
– Popularity
• number of times the
tweets containing the
URL were
– retweeted
9
Methodology (3): Feature Extraction
32. – Network
• estimate importance of
users through
– centrality scores
– page rank
– Popularity
• number of times the
tweets containing the
URL were
– retweeted
– liked
9
Methodology (3): Feature Extraction
33. Methodology (4): Labelling the URLs
• Labelled independently by
– 2 persons having having more than 4 years of professional
programming experience in Java
– one a PhD student and another a Research Engineer
10
34. Methodology (4): Labelling the URLs
• Labelled independently by
– 2 persons having having more than 4 years of professional
programming experience in Java
– one a PhD student and another a Research Engineer
• Both persons sat together to resolve disagreements
10
35. Methodology (4): Labelling the URLs
• Labelled independently by
– 2 persons having having more than 4 years of professional
programming experience in Java
– one a PhD student and another a Research Engineer
• Both persons sat together to resolve disagreements
• URLs assigned relevance scores from 0-3
10
36. Methodology (5): Recommendation
• Unsupervised –Borda Count
– assigns ranking points for each feature score for an
URL and then combines the scores
11
• Supervised –Learning to Rank
– learns a ranking function based on the weighted sum
of features of an URL
37. RQ1: Effectiveness of Our Approach
12
• NDCG (Normalized Discounted Cumulative Gain)
• Measures the capability to recommend higher ranked URLs at
top ranks
• Score closer to 1 specifies better performance with the range
of scores being 0-1
38. RQ1: Effectiveness of Our Approach
12
0.832
0.719
0
0.2
0.4
0.6
0.8
1
Supervised Unsupervised
NDCGScore
Recommendation Approach
• NDCG (Normalized Discounted Cumulative Gain)
• Measures the capability to recommend higher ranked URLs at
top ranks
• Score closer to 1 specifies better performance with the range
of scores being 0-1
42. Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
14
43. Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
• Only 1 domain
14
44. Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
• Only 1 domain
– evaluate more domains in future work
14
45. Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
• Only 1 domain
– evaluate more domains in future work
• Suitability of evaluation metric
14
46. Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
• Only 1 domain
– evaluate more domains in future work
• Suitability of evaluation metric
– used NDCG which is a standard metric
14
47. Conclusion and Future Work
• Supervised and unsupervised approaches
show promise in recommending URLs
• Future work:
– Automatically categorize the recommended
URLs
– Build an automated system to recommend
relevant URLs
15
48. Feedback/Advice
• What additional resources we can
consider for mining URLs?
• How to infer developer interests
automatically?
Thank you!