Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data

Lies, Damned Lies and Software Analytics:
Why Big Data Needs Thick Data
Margaret-Anne (Peggy) Storey
University of Victoria
@margaretstorey
Presented at UCI, Irvine, April 2016 and
ACM SIGSOFT Webinar, May 4th 2016

Acknowledgements:
Alexey Zagalsky, Daniel German,
Matthieu Foucault (Uvic)
Jacek Czerwonka, Brendan Murphy
(Microsoft Research)
http://www.slideshare.net/mastorey/lies-damned-lies-and-
software-analytics-why-big-data-needs-rich-data
@margaretstorey

My research…
Human and social aspects in software
engineering:
Software visualization
The social programmer and a participatory
culture in software engineering
Qualitative research and mixed methods in
software engineering

Dashboards for developers awareness:
Treude and Storey, “Awareness 2.0: staying aware of projects,
developers and tasks using dashboards and feeds,” ICSE 2010.

1968 1980 1990 2000 20101970
Developer tools…

How developers stay up to date using Twitter
How developers assess each other based
on their development and networking activity
How a crowd of developers document open
source API’s through Stackoverflow
How developers share tacit knowledge on
How developers coordinate which code is
committed and accepted through GitHub

1968 1980 1990 2000 20101970
Telephone
Face2Face
Project
Workbook
Documents
Email
Email Lists
VisualAge
Visual Studio
NetBeans Eclipse
IRC
ICQ Skype
SourceForge
Wikis
Trello
Basecamp
Jazz
Slack
Google
Hangouts
Punchcards TFS
Books Usenet
Stack
Overflow
Twitter
Google
Groups
Podcasts
Blogs
GitHub
Conferences
Societies LinkedIn
Facebook
Slashdot
HackerNews
Nondigital Digital Digital & Socially Enabled
Masterbranch
Coderwall
Meetups
Yammer

1968 1980 1990 2000 20101970
Telephone
Face2Face
Project
Workbook
Documents
Email
Email Lists
VisualAge
Visual Studio
NetBeans Eclipse
IRC
ICQ Skype
SourceForge
Wikis
Trello
Basecamp
Jazz
Slack
Google
Hangouts
Punchcards TFS
Books Usenet
Stack
Overflow
Twitter
Google
Groups
Podcasts
Blogs
GitHub
Conferences
Societies LinkedIn
Facebook
Slashdot
HackerNews
Nondigital Digital Digital & Socially Enabled
Masterbranch
Coderwall
Meetups
Yammer
Surveyed over
2,500 devs

Ecosystem of tools and activities

Learning
CodeHosting
Q&Asites
Websearch

Coordination
CodeHosting
Coordinationtools
Privatechat
Privatediscuss

FacetoFace
Connecting
Microblogging
Privatediscuss
FacetoFace
Codehosting

Social tools facilitate a participatory
development culture in software
engineering, with support for the social
creation and sharing of content, informal
mentorship, and awareness that
contributions matter to one another
Storey, M.-A., L. Singer, F. Figueira Filho, B. Cleary and A. Zagalsky,
The (R)evolutionary Role of Social Media in Software Engineering,
ICSE 2014 Future of Software Engineering.

How to study a participatory culture?

(Competing) concerns in
software engineering…
Code: faster, cheaper, more features,
more reliable/secure
Developers: more productive, more
skilled, happier, better connected
Organizations/communities:
attract/retain contributors, encourage a
participatory culture, increase value

https://www.flickr.com/photos/opensourceway/5755219017
Do the answers lie in here?

“The machine does not isolate us from the great problem
of nature but plunges us more deeply into them.”
Antoine de Saint Exupéry

Talk outline…
History of software analytics in software engineering
Risks of software analytics
Why big data needs thick data
Consider both researchers and practitioners….

Role of data science in
software engineering
Metrics (late 1960’s)
Mining software repositories (mid 2000’s)
Software analytics (early 2010’s)

The dawn of software metrics
“The realization came over me with full force
that a good part of the remainder of my life
was going to be spent in finding errors in my
own programs.” Maurice Wilkes, 1949
“If you can't measure it, you can't manage it”
Tom de Marco, 1982

Why use metrics?
To discover facts about the world
To steer our actions
To modify human behaviour
[DeMarco]
Used by individuals, teams, companies,
external organizations…

Software metrics
Product: KLOC, Complexity measures (cyclomatic
complexity, function points), OO metrics, #defects
Process metrics: Testing, code review,
deployment, agile practices (e.g., #sprints,
burndown rate)
Productivity: KLOC, Mean time to repair, #commits
Developer metrics: Skills, followers, biometrics
Estimation: cost metrics and models

Success in industry?
• Adoption at large, small companies (e.g., HP)
• Integrated in CASE tools
• Initial focus on product rather than process
• Initial poor use of metrics led to the
Goal Question Metric Approach [Basili et al.]

Lines of Code
§ Easy to calculate, to
understand, to visualize
§ Descriptive of the product,
and developer productivity
§ Correlates with complexity
measures and # of bugs

“Measuring programming
progress by lines of code is like
measuring aircraft building
progress by weight.”

Mining software repositories
“We have all this data, the problem is what to
do with it.” [A Software Engineering Researcher]
Mining Software Repositories (MSR) conference
series established in 2004
“Outcroppings of past human behaviour.”
[McGrath]

Data, data, everywhere…
Program data: runtime traces, program logs,
system events, failure logs, performance logs,
continuous deployment,…
User data: usage logs, user surveys, user forums,
A/B testing, Twitter, blogs, …
Development data: source code versions, bug
data, check-in information, test cases and results,
communication between developers, social media

Techniques
Association rules and frequency patterns
Classification
Clustering
Text mining/natural language processing
Searching and mining
Qualitative analysis
See papers from the Mining Software Repositories Conference!

Benefits of mining trace data
Low interference
Low reactivity
Records made by the participants
Data is easy to collect

“Only metric worth counting is defects”
[Demarco, 1997]
Why mine and measure information about bugs?
Personal discovery, evaluation by managers,
understand product status, predict reliability

Bug prediction
• Models to predict bugs show promise
(ownership, churn, tangled code changes)
• Poor replication across organizations!
• Poor actionability (practitioners know
which modules are buggy!)
• The secret life of bugs [Aranda et al.]

Data science movement…
http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science

Goals of software analytics
Improve:
quality of the software
experience of the users
developer productivity
Dongmei Zhang & Tao Xie, http://research.microsoft.com/en-
us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf

Data Science Spectrum
Past Present Future
Explore trends alerts forecasting
Analyze summarize compare what-if
Experiment model benchmark simulate
The Art and Science of Analyzing Software Data, by
Bird, Menzies, Zimmermann, Elsevier 2015.

Software Analytics and its role
in Automation
• Scaling to 1000’s of developers —
automation is required! [Jacek Czerwonka]
• Goal is to optimize competing concerns of
quality, time, resources
• Data Scientists manage and measure impacts
of automation and software analytics
[Kim et al., 2016]

Does increasing test code
coverage increase reliability?

No!
Wasting time testing simple code may increase the
presence of bugs!
[Mockus et al.]
Does increasing test code
coverage increase reliability?

Five Risks
1) Data and construct trustworthiness
2) Reliability of the results
3) Ethical concerns
4) Unintended and unexpected
consequences
5) Big data can’t answer big questions

Risk #1:
Trustworthiness of the data
Data representativeness
(construct validity)
Data completeness
Inaccuracies in profiles, exaggerations,
skewed opinions
Treating humans as “rational” animals
[Harper et al.]

Perils from using GitHub data:
A repository is not necessarily a (development) project
Most projects are inactive or have few commits
Most projects are for personal use only
Only 10% of projects use pull requests
History can be rewritten on GitHub
A lot happens outside of GitHub
The Promises and Perils of Mining GitHub, Eirini Kalliamvakou et al., MSR 2014.

Risk #2:
Trustworthiness of the results
Researcher bias [Shepperd et al., 2014]
Confusing correlations with cause and
effect
Big data and small effects [Marcus et al.]
Inappropriate generalization
Conclusion instability [Menzies et al.]

“all models are wrong, but some are useful”
[Box, 1976]
http://www.dataists.com/2010/09/a-taxonomy-of-data-science/

Risk #3:
Ethical concerns
Private, public, blurred spaces
Surveillance at the level of the individual
Opaque algorithms, opaque biases
[Tufecki, CSCW Keynote, 2015]

http://www.informationweek.com/big-data/big-data-analytics/data-
scientists-want-big-data-ethics-standards/d/d-id/1315798)

Risk #4:
Unexpected consequences
Negative side effects [Gender studies]
Gaming the gamification
Incentives? handle with care!

Assessing and watching developers
Singer, Filho, Cleary, Treude, Storey, Schneider. Mutual Assessment in the Social Programmer
Ecosystem: An Empirical Investigation of Developer Profile Aggregators, CSCW 2013.

Contributing graphs
considered harmful
https://github.com/isaacs/github/issues/627
http://www.hanselman.com/blog/GitHubActi
vityGuiltAndTheCodersFitBit.aspx

Most unwise questions!
Analyze This! 145 Questions for Data Scientists in Software
Engineering Andrew Begel and Thomas Zimmermann

Risk #5:
Big Data can’t answer Big
Questions
Or

Risk #5:
Big Data can’t answer Big
Questions
alone

Examples of big questions?
• What is a good architecture to solve problem x?
[Devanbu]
• What makes a really awesome programmer?
[Software managers]
• How to build a great development team? [Google]
• How is program knowledge distributed? [Naur]
• What is the ideal software engineering process?
[Facebook, Microsoft, IBM,…]
• What tools/practices support a participatory
development process? [Storey et al.]

Talk outline…
History of software analytics in software engineering
Risks of software analytics
Why big data needs thick data, and why thick data
needs big data!
Consider both researchers and practitioners….

Data scientists…
“Typically start with the data, rather
than starting with the problem.”
[Forbes]
“I love data” “I love patterns”
[Kim et al., ICSE 2016]
http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/print/

John Snow’s theory about
cholera came from talking to
people [1850’s]

Danger zones…
http://blogs.lse.ac.uk/impactofsocialsciences/2015/02/12/philosophy-of-data-science-
emma-uprichard/
“Most big data is social data –
the analytics need serious interrogation”
Social
Science+
“It doesn’t matter how much or how good our data is if the
approach to modelling social systems is backwards.”

What is “thick” data?
Researcher generated “thick” data
Explanations, motivations, recommendations
Questions rather than answers
Variables for a model
Future challenges
Limitations: Self reporting, researcher bias,
ambiguity in instruments and collected data

Beyond “Mixed Methods”:
Ethnomining
Combines the ethos of ethnography
interleaved with data mining techniques
around behavioral/social data
Storytelling (to support the numbers)
Leverages visualization within tight loops of
eliciting/reporting results
http://ethnographymatters.net/blog/2013/04/02/april-2013-
ethnomining-and-the-combination-of-qualitative-quantitative-data/

Research challenges ahead
Big data! (of trace and thick data!)
Rapid pace of change
(increased automation,
participatory culture)
Studying unstable objects [Rogers]
Poor boundaries of study contexts

Kevin Kelly, Futurist: “You’ll be paid in the future based
on how well you work with robots.”

Key Takeaway:
Big Data needs Thick Data

Future of data science in
software engineering?
Metrics (late 1960’s)
Mining software repositories (mid 2000’s)
Software analytics (early 2010’s)
Big Data meets Thick Data
@margaretstorey

References:
“Mad about Measurement”, De Marco,
http://ca.wiley.com/WileyCDA/WileyTitle/productCd-0818676450.html
Van Solingen, Rini, et al. "Goal question metric (GQM) approach."
Encyclopedia of software engineering (2002).
The Emerging Role of Data Scientists on Software Development Team,
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel, ICSE
May 2016.
Analyze This! 145 Questions for Data Scientists in Software Engineering,
Andrew Begel and Thomas Zimmermann, ICSE June 2014.
Dongmei Zhang & Tao Xie, http://research.microsoft.com/en-
us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf
Rules of Data Science in SE, see www.slideshare.net/timmenzies/the-art-and-
science-of-analyzing-software-data
Audris Mockus, Nachiappan Nagappan, Trung T. Dinh-Trong, Test coverage
and post-verification defects: A multiple case study. ESEM 2009: 291-301
Shepperd, Martin, David Bowes, and Tracy Hall. "Researcher bias: The use of
machine learning in software defect prediction." Software Engineering, IEEE
Transactions on 40.6 (2014): 603-616.

M. Storey, The Evolution of the Social Programmer, Mining Software
Repositories (MSR) 2012 Keynote http://www.slideshare.net/mastorey/msr-
2012-keynote-storey-slideshare
M. Storey et al., The (R)evolution of Social Media in Software Engineering,
ICSE Future of Software Engineering 2014,
http://www.slideshare.net/mastorey/icse2014-fose-social-media
H. Jenkins, K. Clinton, R. Purushotma, A. J. Robison, and M. Weigel.
Confronting the challenges of participatory culture: Media education for the
21st century, 2006.
http://digitallearning.macfound.org/atf/cf/%7B7E45C7E0-A3E0-4B89-
AC9C-E807E1B0AE4E%7D/JENKINS_WHITE_PAPER.PDF
L. Singer, F. F. Filho, B. Cleary, C. Treude, M.-A. Storey, K. Schneider.
Mutual Assessment in the Social Programmer Ecosystem: An Empirical
Investigation of Developer Profile Aggregators
Treude, C., and M.-A. Storey, “Awareness 2.0: staying aware of projects,
developers and tasks using dashboards and feeds,” in ICSE’10: Proc. of
the 32nd ACM/IEEE Int. Conference on Software Engineering, ACM.
C. Treude and M.-A. Storey. Work Item Tagging: Communicating Concerns
in Collaborative Software Development. In IEEE Transactions on Software
Engineering 38, 1 (January/February 2012). pp. 19-34

[Marcus2014] Gary Marcus and Ernest Davis, "Eight (No, Nine!) Problems with Big
Data", New York Times, April 6, 2014
[Harper2013] Richard Harper, Christian Bird, Thomas Zimmermann, and Brendan
Murphy"Dwelling in Software: Aspects of the felt-life of engineers in large software
projects", Proceedings of the 13th European Conference on Computer Supported
Cooperative Work (ECSCW '13), Springer, September 2013.
P. Naur and B. Randell. Software Engineering: Report of a Conference Sponsored
by the NATO Science Committee, Garmisch, Germany, Oct.1968. NATO
Mcgrath, E. "Methodology matters: Doing research in the behavioral and social
sciences." Readings in Human-Computer Interaction: Toward the Year 2000 (2nd
ed. 1995.
Aranda, Jorge, and Gina Venolia. "The secret life of bugs: Going past the errors and
omissions in software repositories." Proceedings of the 31st International
Conference on Software Engineering. IEEE Computer Society, 2009.
Ethno-Mining: Integrating Numbers and Words from the Ground Up:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-125.pdf
How Google builds a really development team, New York Times, 2016.
[Tufekci2015] Zeynep Tufekci, "Algorithms in our Midst: Information, Power
and Choice when Software is Everywhere", Proceedings of the 18th ACM
Conference on Computer Supported Cooperative Work & Social
Computing, pp.1918-1918, ACM 2015.

Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (20)

Similar to Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data

Similar to Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data (20)

More from Margaret-Anne Storey

More from Margaret-Anne Storey (8)

Recently uploaded

Recently uploaded (20)

Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data