SlideShare a Scribd company logo
1 of 79
Download to read offline
Lies, Damned Lies and Software Analytics:
Why Big Data Needs Thick Data
Margaret-Anne (Peggy) Storey
University of Victoria
@margaretstorey
Presented at UCI, Irvine, April 2016 and
ACM SIGSOFT Webinar, May 4th 2016
Acknowledgements:
Alexey Zagalsky, Daniel German,
Matthieu Foucault (Uvic)
Jacek Czerwonka, Brendan Murphy
(Microsoft Research)
http://www.slideshare.net/mastorey/lies-damned-lies-and-
software-analytics-why-big-data-needs-rich-data
@margaretstorey
My research…
Human and social aspects in software
engineering:
Software visualization
The social programmer and a participatory
culture in software engineering
Qualitative research and mixed methods in
software engineering
Dashboards for developers awareness:
Treude and Storey, “Awareness 2.0: staying aware of projects,
developers and tasks using dashboards and feeds,” ICSE 2010.
1968 1980 1990 2000 20101970
Developer tools…
How developers stay up to date using Twitter
How developers assess each other based
on their development and networking activity
How a crowd of developers document open
source API’s through Stackoverflow
How developers share tacit knowledge on
How developers coordinate which code is
committed and accepted through GitHub
1968 1980 1990 2000 20101970
Telephone
Face2Face
Project
Workbook
Documents
Email
Email Lists
VisualAge
Visual Studio
NetBeans Eclipse
IRC
ICQ Skype
SourceForge
Wikis
Trello
Basecamp
Jazz
Slack
Google
Hangouts
Punchcards TFS
Books Usenet
Stack
Overflow
Twitter
Google
Groups
Podcasts
Blogs
GitHub
Conferences
Societies LinkedIn
Facebook
Slashdot
HackerNews
Nondigital Digital Digital & Socially Enabled
Masterbranch
Coderwall
Meetups
Yammer
1968 1980 1990 2000 20101970
Telephone
Face2Face
Project
Workbook
Documents
Email
Email Lists
VisualAge
Visual Studio
NetBeans Eclipse
IRC
ICQ Skype
SourceForge
Wikis
Trello
Basecamp
Jazz
Slack
Google
Hangouts
Punchcards TFS
Books Usenet
Stack
Overflow
Twitter
Google
Groups
Podcasts
Blogs
GitHub
Conferences
Societies LinkedIn
Facebook
Slashdot
HackerNews
Nondigital Digital Digital & Socially Enabled
Masterbranch
Coderwall
Meetups
Yammer
Surveyed over
2,500 devs
Ecosystem of tools and activities
Learning
CodeHosting
Q&Asites
Websearch
Ecosystem of tools and activities
Coordination
CodeHosting
Coordinationtools
Privatechat
Privatediscuss
Ecosystem of tools and activities
FacetoFace
Connecting
Microblogging
Privatediscuss
FacetoFace
Codehosting
Ecosystem of tools and activities
Social tools facilitate a participatory
development culture in software
engineering, with support for the social
creation and sharing of content, informal
mentorship, and awareness that
contributions matter to one another
Storey, M.-A., L. Singer, F. Figueira Filho, B. Cleary and A. Zagalsky,
The (R)evolutionary Role of Social Media in Software Engineering,
ICSE 2014 Future of Software Engineering.
How to study a participatory culture?
(Competing) concerns in
software engineering…
Code: faster, cheaper, more features,
more reliable/secure
Developers: more productive, more
skilled, happier, better connected
Organizations/communities:
attract/retain contributors, encourage a
participatory culture, increase value
https://www.flickr.com/photos/opensourceway/5755219017
Do the answers lie in here?
“The machine does not isolate us from the great problem
of nature but plunges us more deeply into them.”
Antoine de Saint Exupéry
Thick data…
Talk outline…
History of software analytics in software engineering
Risks of software analytics
Why big data needs thick data
Consider both researchers and practitioners….
Talk outline…
History of software analytics in software engineering
Risks of software analytics
Why big data needs thick data
Consider both researchers and practitioners….
Role of data science in
software engineering
Metrics (late 1960’s)
Mining software repositories (mid 2000’s)
Software analytics (early 2010’s)
Role of data science in
software engineering
Metrics (late 1960’s)
Mining software repositories (mid 2000’s)
Software analytics (early 2010’s)
The dawn of software metrics
“The realization came over me with full force
that a good part of the remainder of my life
was going to be spent in finding errors in my
own programs.” Maurice Wilkes, 1949
“If you can't measure it, you can't manage it”
Tom de Marco, 1982
Why use metrics?
To discover facts about the world
To steer our actions
To modify human behaviour
[DeMarco]
Used by individuals, teams, companies,
external organizations…
Software metrics
Product: KLOC, Complexity measures (cyclomatic
complexity, function points), OO metrics, #defects
Process metrics: Testing, code review,
deployment, agile practices (e.g., #sprints,
burndown rate)
Productivity: KLOC, Mean time to repair, #commits
Developer metrics: Skills, followers, biometrics
Estimation: cost metrics and models
Research success?
Success in industry?
• Adoption at large, small companies (e.g., HP)
• Integrated in CASE tools
• Initial focus on product rather than process
• Initial poor use of metrics led to the
Goal Question Metric Approach [Basili et al.]
Lines of Code
§ Easy to calculate, to
understand, to visualize
§ Descriptive of the product,
and developer productivity
§ Correlates with complexity
measures and # of bugs
“Measuring programming
progress by lines of code is like
measuring aircraft building
progress by weight.”
Role of data science in
software engineering
Metrics (late 1960’s)
Mining software repositories (mid 2000’s)
Software analytics (early 2010’s)
Mining software repositories
“We have all this data, the problem is what to
do with it.” [A Software Engineering Researcher]
Mining Software Repositories (MSR) conference
series established in 2004
“Outcroppings of past human behaviour.”
[McGrath]
Data, data, everywhere…
Program data: runtime traces, program logs,
system events, failure logs, performance logs,
continuous deployment,…
User data: usage logs, user surveys, user forums,
A/B testing, Twitter, blogs, …
Development data: source code versions, bug
data, check-in information, test cases and results,
communication between developers, social media
Techniques
Association rules and frequency patterns
Classification
Clustering
Text mining/natural language processing
Searching and mining
Qualitative analysis
See papers from the Mining Software Repositories Conference!
Benefits of mining trace data
Low interference
Low reactivity
Records made by the participants
Data is easy to collect
“Only metric worth counting is defects”
[Demarco, 1997]
Why mine and measure information about bugs?
Personal discovery, evaluation by managers,
understand product status, predict reliability
Bug prediction
• Models to predict bugs show promise
(ownership, churn, tangled code changes)
• Poor replication across organizations!
• Poor actionability (practitioners know
which modules are buggy!)
• The secret life of bugs [Aranda et al.]
Role of data science in
software engineering
Metrics (late 1960’s)
Mining software repositories (mid 2000’s)
Software analytics (early 2010’s)
Data science movement…
http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science
Goals of software analytics
Improve:
quality of the software
experience of the users
developer productivity
Dongmei Zhang & Tao Xie, http://research.microsoft.com/en-
us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf
Data Science Spectrum
Past Present Future
Explore trends alerts forecasting
Analyze summarize compare what-if
Experiment model benchmark simulate
The Art and Science of Analyzing Software Data, by
Bird, Menzies, Zimmermann, Elsevier 2015.
Software Analytics and its role
in Automation
• Scaling to 1000’s of developers —
automation is required! [Jacek Czerwonka]
• Goal is to optimize competing concerns of
quality, time, resources
• Data Scientists manage and measure impacts
of automation and software analytics
[Kim et al., 2016]
Does increasing test code
coverage increase reliability?
No!
Wasting time testing simple code may increase the
presence of bugs!
[Mockus et al.]
Does increasing test code
coverage increase reliability?
Role of data science in
software engineering
Metrics (late 1960’s)
Mining software repositories (mid 2000’s)
Software analytics (early 2010’s)
Talk outline…
History of software analytics in software engineering
Risks of software analytics
Why big data needs thick data
Consider both researchers and practitioners….
Five Risks
1) Data and construct trustworthiness
2) Reliability of the results
3) Ethical concerns
4) Unintended and unexpected
consequences
5) Big data can’t answer big questions
Risk #1:
Trustworthiness of the data
Data representativeness
(construct validity)
Data completeness
Inaccuracies in profiles, exaggerations,
skewed opinions
Treating humans as “rational” animals
[Harper et al.]
Perils from using GitHub data:
A repository is not necessarily a (development) project
Most projects are inactive or have few commits
Most projects are for personal use only
Only 10% of projects use pull requests
History can be rewritten on GitHub
A lot happens outside of GitHub
The Promises and Perils of Mining GitHub, Eirini Kalliamvakou et al., MSR 2014.
Risk #2:
Trustworthiness of the results
Researcher bias [Shepperd et al., 2014]
Confusing correlations with cause and
effect
Big data and small effects [Marcus et al.]
Inappropriate generalization
Conclusion instability [Menzies et al.]
“all models are wrong, but some are useful”
[Box, 1976]
http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
Risk #3:
Ethical concerns
Private, public, blurred spaces
Surveillance at the level of the individual
Opaque algorithms, opaque biases
[Tufecki, CSCW Keynote, 2015]
http://www.informationweek.com/big-data/big-data-analytics/data-
scientists-want-big-data-ethics-standards/d/d-id/1315798)
Risk #4:
Unexpected consequences
Negative side effects [Gender studies]
Gaming the gamification
Incentives? handle with care!
Assessing and watching developers
Singer, Filho, Cleary, Treude, Storey, Schneider. Mutual Assessment in the Social Programmer
Ecosystem: An Empirical Investigation of Developer Profile Aggregators, CSCW 2013.
Contributing graphs
considered harmful
https://github.com/isaacs/github/issues/627
http://www.hanselman.com/blog/GitHubActi
vityGuiltAndTheCodersFitBit.aspx
Most unwise questions!
Analyze This! 145 Questions for Data Scientists in Software
Engineering Andrew Begel and Thomas Zimmermann
Risk #5:
Big Data can’t answer Big
Questions
Or
Risk #5:
Big Data can’t answer Big
Questions
Or
Risk #5:
Big Data can’t answer Big
Questions
alone
Examples of big questions?
• What is a good architecture to solve problem x?
[Devanbu]
• What makes a really awesome programmer?
[Software managers]
• How to build a great development team? [Google]
• How is program knowledge distributed? [Naur]
• What is the ideal software engineering process?
[Facebook, Microsoft, IBM,…]
• What tools/practices support a participatory
development process? [Storey et al.]
Five Risks
1) Data and construct trustworthiness
2) Reliability of the results
3) Ethical concerns
4) Unintended and unexpected
consequences
5) Big data can’t answer big questions
Talk outline…
History of software analytics in software engineering
Risks of software analytics
Why big data needs thick data, and why thick data
needs big data!
Consider both researchers and practitioners….
Data scientists…
“Typically start with the data, rather
than starting with the problem.”
[Forbes]
“I love data” “I love patterns”
[Kim et al., ICSE 2016]
http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/print/
John Snow’s theory about
cholera came from talking to
people [1850’s]
Danger zones…
http://blogs.lse.ac.uk/impactofsocialsciences/2015/02/12/philosophy-of-data-science-
emma-uprichard/
“Most big data is social data –
the analytics need serious interrogation”
Social
Science+
“It doesn’t matter how much or how good our data is if the
approach to modelling social systems is backwards.”
What is “thick” data?
Researcher generated “thick” data
Explanations, motivations, recommendations
Questions rather than answers
Variables for a model
Future challenges
Limitations: Self reporting, researcher bias,
ambiguity in instruments and collected data
Beyond “Mixed Methods”:
Ethnomining
Combines the ethos of ethnography
interleaved with data mining techniques
around behavioral/social data
Storytelling (to support the numbers)
Leverages visualization within tight loops of
eliciting/reporting results
http://ethnographymatters.net/blog/2013/04/02/april-2013-
ethnomining-and-the-combination-of-qualitative-quantitative-data/
Tagging work items in
ConcernLines
Research challenges ahead
Big data! (of trace and thick data!)
Rapid pace of change
(increased automation,
participatory culture)
Studying unstable objects [Rogers]
Poor boundaries of study contexts
Kevin Kelly, Futurist: “You’ll be paid in the future based
on how well you work with robots.”
Key Takeaway:
Big Data needs Thick Data
Future of data science in
software engineering?
Metrics (late 1960’s)
Mining software repositories (mid 2000’s)
Software analytics (early 2010’s)
Big Data meets Thick Data
@margaretstorey
References:
“Mad about Measurement”, De Marco,
http://ca.wiley.com/WileyCDA/WileyTitle/productCd-0818676450.html
Van Solingen, Rini, et al. "Goal question metric (GQM) approach."
Encyclopedia of software engineering (2002).
The Emerging Role of Data Scientists on Software Development Team,
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel, ICSE
May 2016.
Analyze This! 145 Questions for Data Scientists in Software Engineering,
Andrew Begel and Thomas Zimmermann, ICSE June 2014.
Dongmei Zhang & Tao Xie, http://research.microsoft.com/en-
us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf
Rules of Data Science in SE, see www.slideshare.net/timmenzies/the-art-and-
science-of-analyzing-software-data
Audris Mockus, Nachiappan Nagappan, Trung T. Dinh-Trong, Test coverage
and post-verification defects: A multiple case study. ESEM 2009: 291-301
Shepperd, Martin, David Bowes, and Tracy Hall. "Researcher bias: The use of
machine learning in software defect prediction." Software Engineering, IEEE
Transactions on 40.6 (2014): 603-616.
M. Storey, The Evolution of the Social Programmer, Mining Software
Repositories (MSR) 2012 Keynote http://www.slideshare.net/mastorey/msr-
2012-keynote-storey-slideshare
M. Storey et al., The (R)evolution of Social Media in Software Engineering,
ICSE Future of Software Engineering 2014,
http://www.slideshare.net/mastorey/icse2014-fose-social-media
H. Jenkins, K. Clinton, R. Purushotma, A. J. Robison, and M. Weigel.
Confronting the challenges of participatory culture: Media education for the
21st century, 2006.
http://digitallearning.macfound.org/atf/cf/%7B7E45C7E0-A3E0-4B89-
AC9C-E807E1B0AE4E%7D/JENKINS_WHITE_PAPER.PDF
L. Singer, F. F. Filho, B. Cleary, C. Treude, M.-A. Storey, K. Schneider.
Mutual Assessment in the Social Programmer Ecosystem: An Empirical
Investigation of Developer Profile Aggregators
Treude, C., and M.-A. Storey, “Awareness 2.0: staying aware of projects,
developers and tasks using dashboards and feeds,” in ICSE’10: Proc. of
the 32nd ACM/IEEE Int. Conference on Software Engineering, ACM.
C. Treude and M.-A. Storey. Work Item Tagging: Communicating Concerns
in Collaborative Software Development. In IEEE Transactions on Software
Engineering 38, 1 (January/February 2012). pp. 19-34
[Marcus2014] Gary Marcus and Ernest Davis, "Eight (No, Nine!) Problems with Big
Data", New York Times, April 6, 2014
[Harper2013] Richard Harper, Christian Bird, Thomas Zimmermann, and Brendan
Murphy"Dwelling in Software: Aspects of the felt-life of engineers in large software
projects", Proceedings of the 13th European Conference on Computer Supported
Cooperative Work (ECSCW '13), Springer, September 2013.
P. Naur and B. Randell. Software Engineering: Report of a Conference Sponsored
by the NATO Science Committee, Garmisch, Germany, Oct.1968. NATO
Mcgrath, E. "Methodology matters: Doing research in the behavioral and social
sciences." Readings in Human-Computer Interaction: Toward the Year 2000 (2nd
ed. 1995.
Aranda, Jorge, and Gina Venolia. "The secret life of bugs: Going past the errors and
omissions in software repositories." Proceedings of the 31st International
Conference on Software Engineering. IEEE Computer Society, 2009.
Ethno-Mining: Integrating Numbers and Words from the Ground Up:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-125.pdf
How Google builds a really development team, New York Times, 2016.
[Tufekci2015] Zeynep Tufekci, "Algorithms in our Midst: Information, Power
and Choice when Software is Everywhere", Proceedings of the 18th ACM
Conference on Computer Supported Cooperative Work & Social
Computing, pp.1918-1918, ACM 2015.

More Related Content

What's hot

After the Pandemic: Rethinking Developer Productivity (There’s more to it th...
After the Pandemic:  Rethinking Developer Productivity (There’s more to it th...After the Pandemic:  Rethinking Developer Productivity (There’s more to it th...
After the Pandemic: Rethinking Developer Productivity (There’s more to it th...Margaret-Anne Storey
 
The (R)evolution of Social Media in Software Engineering
The (R)evolution of Social Media in Software EngineeringThe (R)evolution of Social Media in Software Engineering
The (R)evolution of Social Media in Software EngineeringMargaret-Anne Storey
 
The Elusive Nature of Software Documentation
The Elusive Nature of Software DocumentationThe Elusive Nature of Software Documentation
The Elusive Nature of Software DocumentationMargaret-Anne Storey
 
How Developers Stay Current Using Twitter
How Developers Stay Current Using TwitterHow Developers Stay Current Using Twitter
How Developers Stay Current Using TwitterMargaret-Anne Storey
 
To Bot or Not: How Bots can Support Collaboration in Software Engineering (I...
To Bot or Not:  How Bots can Support Collaboration in Software Engineering (I...To Bot or Not:  How Bots can Support Collaboration in Software Engineering (I...
To Bot or Not: How Bots can Support Collaboration in Software Engineering (I...Margaret-Anne Storey
 
Data excellence: Better data for better AI
Data excellence: Better data for better AIData excellence: Better data for better AI
Data excellence: Better data for better AILora Aroyo
 
Automated metadata creation - Possibilities and pitfalls
Automated metadata creation - Possibilities and pitfallsAutomated metadata creation - Possibilities and pitfalls
Automated metadata creation - Possibilities and pitfallsNASIG
 
Presentation of the InVID tools for image forensics analysis
Presentation of the InVID tools for image forensics analysisPresentation of the InVID tools for image forensics analysis
Presentation of the InVID tools for image forensics analysisInVID Project
 

What's hot (9)

After the Pandemic: Rethinking Developer Productivity (There’s more to it th...
After the Pandemic:  Rethinking Developer Productivity (There’s more to it th...After the Pandemic:  Rethinking Developer Productivity (There’s more to it th...
After the Pandemic: Rethinking Developer Productivity (There’s more to it th...
 
The (R)evolution of Social Media in Software Engineering
The (R)evolution of Social Media in Software EngineeringThe (R)evolution of Social Media in Software Engineering
The (R)evolution of Social Media in Software Engineering
 
The Elusive Nature of Software Documentation
The Elusive Nature of Software DocumentationThe Elusive Nature of Software Documentation
The Elusive Nature of Software Documentation
 
How Developers Stay Current Using Twitter
How Developers Stay Current Using TwitterHow Developers Stay Current Using Twitter
How Developers Stay Current Using Twitter
 
To Bot or Not: How Bots can Support Collaboration in Software Engineering (I...
To Bot or Not:  How Bots can Support Collaboration in Software Engineering (I...To Bot or Not:  How Bots can Support Collaboration in Software Engineering (I...
To Bot or Not: How Bots can Support Collaboration in Software Engineering (I...
 
Data excellence: Better data for better AI
Data excellence: Better data for better AIData excellence: Better data for better AI
Data excellence: Better data for better AI
 
Lopez
LopezLopez
Lopez
 
Automated metadata creation - Possibilities and pitfalls
Automated metadata creation - Possibilities and pitfallsAutomated metadata creation - Possibilities and pitfalls
Automated metadata creation - Possibilities and pitfalls
 
Presentation of the InVID tools for image forensics analysis
Presentation of the InVID tools for image forensics analysisPresentation of the InVID tools for image forensics analysis
Presentation of the InVID tools for image forensics analysis
 

Viewers also liked

Benevol 2012 Keynote: The Social Software (R)evolution
Benevol 2012 Keynote: The Social Software (R)evolutionBenevol 2012 Keynote: The Social Software (R)evolution
Benevol 2012 Keynote: The Social Software (R)evolutionMargaret-Anne Storey
 
Canary in the Coalmine: How Social Media Can Prepare Us for Big Data
Canary in the Coalmine: How Social Media Can Prepare Us for Big Data Canary in the Coalmine: How Social Media Can Prepare Us for Big Data
Canary in the Coalmine: How Social Media Can Prepare Us for Big Data Susan Etlinger
 
Fun Facts about Big Data
Fun Facts about Big DataFun Facts about Big Data
Fun Facts about Big DataCrayon Data
 
Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Nicolas Bettenburg
 
ICSE 2011: Research industry panel
ICSE 2011: Research industry panelICSE 2011: Research industry panel
ICSE 2011: Research industry panelMargaret-Anne Storey
 
Big Data can be fun!
Big Data can be fun!Big Data can be fun!
Big Data can be fun!Bruno Aziza
 
Mining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better SoftwareMining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better SoftwareMarat Akhin
 
MSR 2009
MSR 2009MSR 2009
MSR 2009swy351
 
ICSME2014
ICSME2014ICSME2014
ICSME2014swy351
 
ICPE2015
ICPE2015ICPE2015
ICPE2015swy351
 
WCRE2011
WCRE2011WCRE2011
WCRE2011swy351
 
ICSE2013
ICSE2013ICSE2013
ICSE2013swy351
 
Msr2016 tarek
Msr2016 tarek Msr2016 tarek
Msr2016 tarek swy351
 
ICSE2014
ICSE2014ICSE2014
ICSE2014swy351
 
Mining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software RepositoriesMining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software RepositoriesMarco Aurelio Gerosa
 
MSR End of Internship Talk
MSR End of Internship TalkMSR End of Internship Talk
MSR End of Internship TalkRay Buse
 
ASE2010
ASE2010ASE2010
ASE2010swy351
 
Empirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft ResearchEmpirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft ResearchThomas Zimmermann
 
A Metric for Code Readability
A Metric for Code ReadabilityA Metric for Code Readability
A Metric for Code ReadabilityRay Buse
 

Viewers also liked (20)

Benevol 2012 Keynote: The Social Software (R)evolution
Benevol 2012 Keynote: The Social Software (R)evolutionBenevol 2012 Keynote: The Social Software (R)evolution
Benevol 2012 Keynote: The Social Software (R)evolution
 
Canary in the Coalmine: How Social Media Can Prepare Us for Big Data
Canary in the Coalmine: How Social Media Can Prepare Us for Big Data Canary in the Coalmine: How Social Media Can Prepare Us for Big Data
Canary in the Coalmine: How Social Media Can Prepare Us for Big Data
 
Fun Facts about Big Data
Fun Facts about Big DataFun Facts about Big Data
Fun Facts about Big Data
 
Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...
 
ICSE 2011: Research industry panel
ICSE 2011: Research industry panelICSE 2011: Research industry panel
ICSE 2011: Research industry panel
 
Icpc 2011 storey
Icpc 2011 storeyIcpc 2011 storey
Icpc 2011 storey
 
Big Data can be fun!
Big Data can be fun!Big Data can be fun!
Big Data can be fun!
 
Mining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better SoftwareMining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better Software
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
ICSME2014
ICSME2014ICSME2014
ICSME2014
 
ICPE2015
ICPE2015ICPE2015
ICPE2015
 
WCRE2011
WCRE2011WCRE2011
WCRE2011
 
ICSE2013
ICSE2013ICSE2013
ICSE2013
 
Msr2016 tarek
Msr2016 tarek Msr2016 tarek
Msr2016 tarek
 
ICSE2014
ICSE2014ICSE2014
ICSE2014
 
Mining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software RepositoriesMining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software Repositories
 
MSR End of Internship Talk
MSR End of Internship TalkMSR End of Internship Talk
MSR End of Internship Talk
 
ASE2010
ASE2010ASE2010
ASE2010
 
Empirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft ResearchEmpirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft Research
 
A Metric for Code Readability
A Metric for Code ReadabilityA Metric for Code Readability
A Metric for Code Readability
 

Similar to Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data

Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)HPCC Systems
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
AudrisMockus_MSR22.pdf
AudrisMockus_MSR22.pdfAudrisMockus_MSR22.pdf
AudrisMockus_MSR22.pdfTapajitDey1
 
Data_Mining_for_Software_Engineering.pdf
Data_Mining_for_Software_Engineering.pdfData_Mining_for_Software_Engineering.pdf
Data_Mining_for_Software_Engineering.pdfassadabbas22
 
Big Data meets Big Social: Social Machines and the Semantic Web
Big Data meets Big Social: Social Machines and the Semantic WebBig Data meets Big Social: Social Machines and the Semantic Web
Big Data meets Big Social: Social Machines and the Semantic WebDavid De Roure
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)Tao Xie
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October IssueJIMS Rohini Sector 5
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
Towards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software MetadataTowards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software Metadatadgarijo
 
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Chetan Khatri
 
V1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docV1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docpraveena06
 
Improvement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining TechniquesImprovement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining Techniquesijdmtaiir
 
IOT-2016 7-9 Septermber, 2016, Stuttgart, Germany
IOT-2016  7-9 Septermber, 2016, Stuttgart, GermanyIOT-2016  7-9 Septermber, 2016, Stuttgart, Germany
IOT-2016 7-9 Septermber, 2016, Stuttgart, GermanyCharith Perera
 
Black Box Learning Analytics? Beyond Algorithmic Transparency
Black Box Learning Analytics? Beyond Algorithmic TransparencyBlack Box Learning Analytics? Beyond Algorithmic Transparency
Black Box Learning Analytics? Beyond Algorithmic TransparencySimon Buckingham Shum
 
The materiality of code: Towards an understanding of socio-technical relations
The materiality of code: Towards an understanding of socio-technical relationsThe materiality of code: Towards an understanding of socio-technical relations
The materiality of code: Towards an understanding of socio-technical relationsAarhus University
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Softwaredgarijo
 
Better Software, Better Research
Better Software, Better ResearchBetter Software, Better Research
Better Software, Better ResearchCarole Goble
 
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...CREST @ University of Adelaide
 

Similar to Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data (20)

Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
AudrisMockus_MSR22.pdf
AudrisMockus_MSR22.pdfAudrisMockus_MSR22.pdf
AudrisMockus_MSR22.pdf
 
Data_Mining_for_Software_Engineering.pdf
Data_Mining_for_Software_Engineering.pdfData_Mining_for_Software_Engineering.pdf
Data_Mining_for_Software_Engineering.pdf
 
Big Data meets Big Social: Social Machines and the Semantic Web
Big Data meets Big Social: Social Machines and the Semantic WebBig Data meets Big Social: Social Machines and the Semantic Web
Big Data meets Big Social: Social Machines and the Semantic Web
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October Issue
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
Towards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software MetadataTowards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software Metadata
 
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
 
V1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docV1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.doc
 
Improvement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining TechniquesImprovement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining Techniques
 
IOT-2016 7-9 Septermber, 2016, Stuttgart, Germany
IOT-2016  7-9 Septermber, 2016, Stuttgart, GermanyIOT-2016  7-9 Septermber, 2016, Stuttgart, Germany
IOT-2016 7-9 Septermber, 2016, Stuttgart, Germany
 
Black Box Learning Analytics? Beyond Algorithmic Transparency
Black Box Learning Analytics? Beyond Algorithmic TransparencyBlack Box Learning Analytics? Beyond Algorithmic Transparency
Black Box Learning Analytics? Beyond Algorithmic Transparency
 
The materiality of code: Towards an understanding of socio-technical relations
The materiality of code: Towards an understanding of socio-technical relationsThe materiality of code: Towards an understanding of socio-technical relations
The materiality of code: Towards an understanding of socio-technical relations
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Software
 
Better Software, Better Research
Better Software, Better ResearchBetter Software, Better Research
Better Software, Better Research
 
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
 

More from Margaret-Anne Storey

An Actionable Framework for Understanding and Improving Developer Experience
An Actionable Framework for Understanding and Improving Developer ExperienceAn Actionable Framework for Understanding and Improving Developer Experience
An Actionable Framework for Understanding and Improving Developer ExperienceMargaret-Anne Storey
 
ASE Keynote 2022: From Automation to Empowering Software Developers
ASE Keynote 2022: From Automation to Empowering Software Developers ASE Keynote 2022: From Automation to Empowering Software Developers
ASE Keynote 2022: From Automation to Empowering Software Developers Margaret-Anne Storey
 
Software Bots as Superheroes in the SPACE of Developer Productivity
Software Bots as Superheroes in the SPACE of Developer ProductivitySoftware Bots as Superheroes in the SPACE of Developer Productivity
Software Bots as Superheroes in the SPACE of Developer ProductivityMargaret-Anne Storey
 
What does productivity mean to developers
What does productivity mean to developersWhat does productivity mean to developers
What does productivity mean to developersMargaret-Anne Storey
 
Towards a Theory of Developer Satisfaction and Productivity
Towards a Theory of Developer Satisfaction and ProductivityTowards a Theory of Developer Satisfaction and Productivity
Towards a Theory of Developer Satisfaction and ProductivityMargaret-Anne Storey
 
Publish or Perish: Questioning the Impact of Our Research on the Software Dev...
Publish or Perish: Questioning the Impact of Our Research on the Software Dev...Publish or Perish: Questioning the Impact of Our Research on the Software Dev...
Publish or Perish: Questioning the Impact of Our Research on the Software Dev...Margaret-Anne Storey
 
Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...
Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...
Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...Margaret-Anne Storey
 

More from Margaret-Anne Storey (8)

An Actionable Framework for Understanding and Improving Developer Experience
An Actionable Framework for Understanding and Improving Developer ExperienceAn Actionable Framework for Understanding and Improving Developer Experience
An Actionable Framework for Understanding and Improving Developer Experience
 
ASE Keynote 2022: From Automation to Empowering Software Developers
ASE Keynote 2022: From Automation to Empowering Software Developers ASE Keynote 2022: From Automation to Empowering Software Developers
ASE Keynote 2022: From Automation to Empowering Software Developers
 
Software Bots as Superheroes in the SPACE of Developer Productivity
Software Bots as Superheroes in the SPACE of Developer ProductivitySoftware Bots as Superheroes in the SPACE of Developer Productivity
Software Bots as Superheroes in the SPACE of Developer Productivity
 
What does productivity mean to developers
What does productivity mean to developersWhat does productivity mean to developers
What does productivity mean to developers
 
Icse 2020 bof reviewing papers
Icse 2020 bof reviewing papersIcse 2020 bof reviewing papers
Icse 2020 bof reviewing papers
 
Towards a Theory of Developer Satisfaction and Productivity
Towards a Theory of Developer Satisfaction and ProductivityTowards a Theory of Developer Satisfaction and Productivity
Towards a Theory of Developer Satisfaction and Productivity
 
Publish or Perish: Questioning the Impact of Our Research on the Software Dev...
Publish or Perish: Questioning the Impact of Our Research on the Software Dev...Publish or Perish: Questioning the Impact of Our Research on the Software Dev...
Publish or Perish: Questioning the Impact of Our Research on the Software Dev...
 
Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...
Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...
Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data

  • 1. Lies, Damned Lies and Software Analytics: Why Big Data Needs Thick Data Margaret-Anne (Peggy) Storey University of Victoria @margaretstorey Presented at UCI, Irvine, April 2016 and ACM SIGSOFT Webinar, May 4th 2016
  • 2. Acknowledgements: Alexey Zagalsky, Daniel German, Matthieu Foucault (Uvic) Jacek Czerwonka, Brendan Murphy (Microsoft Research) http://www.slideshare.net/mastorey/lies-damned-lies-and- software-analytics-why-big-data-needs-rich-data @margaretstorey
  • 3. My research… Human and social aspects in software engineering: Software visualization The social programmer and a participatory culture in software engineering Qualitative research and mixed methods in software engineering
  • 4. Dashboards for developers awareness: Treude and Storey, “Awareness 2.0: staying aware of projects, developers and tasks using dashboards and feeds,” ICSE 2010.
  • 5. 1968 1980 1990 2000 20101970 Developer tools…
  • 6. How developers stay up to date using Twitter How developers assess each other based on their development and networking activity How a crowd of developers document open source API’s through Stackoverflow How developers share tacit knowledge on How developers coordinate which code is committed and accepted through GitHub
  • 7. 1968 1980 1990 2000 20101970 Telephone Face2Face Project Workbook Documents Email Email Lists VisualAge Visual Studio NetBeans Eclipse IRC ICQ Skype SourceForge Wikis Trello Basecamp Jazz Slack Google Hangouts Punchcards TFS Books Usenet Stack Overflow Twitter Google Groups Podcasts Blogs GitHub Conferences Societies LinkedIn Facebook Slashdot HackerNews Nondigital Digital Digital & Socially Enabled Masterbranch Coderwall Meetups Yammer
  • 8. 1968 1980 1990 2000 20101970 Telephone Face2Face Project Workbook Documents Email Email Lists VisualAge Visual Studio NetBeans Eclipse IRC ICQ Skype SourceForge Wikis Trello Basecamp Jazz Slack Google Hangouts Punchcards TFS Books Usenet Stack Overflow Twitter Google Groups Podcasts Blogs GitHub Conferences Societies LinkedIn Facebook Slashdot HackerNews Nondigital Digital Digital & Socially Enabled Masterbranch Coderwall Meetups Yammer Surveyed over 2,500 devs
  • 9. Ecosystem of tools and activities
  • 13. Social tools facilitate a participatory development culture in software engineering, with support for the social creation and sharing of content, informal mentorship, and awareness that contributions matter to one another Storey, M.-A., L. Singer, F. Figueira Filho, B. Cleary and A. Zagalsky, The (R)evolutionary Role of Social Media in Software Engineering, ICSE 2014 Future of Software Engineering.
  • 14. How to study a participatory culture?
  • 15. (Competing) concerns in software engineering… Code: faster, cheaper, more features, more reliable/secure Developers: more productive, more skilled, happier, better connected Organizations/communities: attract/retain contributors, encourage a participatory culture, increase value
  • 17. “The machine does not isolate us from the great problem of nature but plunges us more deeply into them.” Antoine de Saint Exupéry
  • 19. Talk outline… History of software analytics in software engineering Risks of software analytics Why big data needs thick data Consider both researchers and practitioners….
  • 20. Talk outline… History of software analytics in software engineering Risks of software analytics Why big data needs thick data Consider both researchers and practitioners….
  • 21. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  • 22. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  • 23. The dawn of software metrics “The realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.” Maurice Wilkes, 1949 “If you can't measure it, you can't manage it” Tom de Marco, 1982
  • 24. Why use metrics? To discover facts about the world To steer our actions To modify human behaviour [DeMarco] Used by individuals, teams, companies, external organizations…
  • 25. Software metrics Product: KLOC, Complexity measures (cyclomatic complexity, function points), OO metrics, #defects Process metrics: Testing, code review, deployment, agile practices (e.g., #sprints, burndown rate) Productivity: KLOC, Mean time to repair, #commits Developer metrics: Skills, followers, biometrics Estimation: cost metrics and models
  • 27. Success in industry? • Adoption at large, small companies (e.g., HP) • Integrated in CASE tools • Initial focus on product rather than process • Initial poor use of metrics led to the Goal Question Metric Approach [Basili et al.]
  • 28. Lines of Code § Easy to calculate, to understand, to visualize § Descriptive of the product, and developer productivity § Correlates with complexity measures and # of bugs
  • 29. “Measuring programming progress by lines of code is like measuring aircraft building progress by weight.”
  • 30. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  • 31. Mining software repositories “We have all this data, the problem is what to do with it.” [A Software Engineering Researcher] Mining Software Repositories (MSR) conference series established in 2004 “Outcroppings of past human behaviour.” [McGrath]
  • 32. Data, data, everywhere… Program data: runtime traces, program logs, system events, failure logs, performance logs, continuous deployment,… User data: usage logs, user surveys, user forums, A/B testing, Twitter, blogs, … Development data: source code versions, bug data, check-in information, test cases and results, communication between developers, social media
  • 33. Techniques Association rules and frequency patterns Classification Clustering Text mining/natural language processing Searching and mining Qualitative analysis See papers from the Mining Software Repositories Conference!
  • 34. Benefits of mining trace data Low interference Low reactivity Records made by the participants Data is easy to collect
  • 35. “Only metric worth counting is defects” [Demarco, 1997] Why mine and measure information about bugs? Personal discovery, evaluation by managers, understand product status, predict reliability
  • 36. Bug prediction • Models to predict bugs show promise (ownership, churn, tangled code changes) • Poor replication across organizations! • Poor actionability (practitioners know which modules are buggy!) • The secret life of bugs [Aranda et al.]
  • 37. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  • 38.
  • 39.
  • 41. Goals of software analytics Improve: quality of the software experience of the users developer productivity Dongmei Zhang & Tao Xie, http://research.microsoft.com/en- us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf
  • 42. Data Science Spectrum Past Present Future Explore trends alerts forecasting Analyze summarize compare what-if Experiment model benchmark simulate The Art and Science of Analyzing Software Data, by Bird, Menzies, Zimmermann, Elsevier 2015.
  • 43. Software Analytics and its role in Automation • Scaling to 1000’s of developers — automation is required! [Jacek Czerwonka] • Goal is to optimize competing concerns of quality, time, resources • Data Scientists manage and measure impacts of automation and software analytics [Kim et al., 2016]
  • 44.
  • 45. Does increasing test code coverage increase reliability?
  • 46. No! Wasting time testing simple code may increase the presence of bugs! [Mockus et al.] Does increasing test code coverage increase reliability?
  • 47. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  • 48. Talk outline… History of software analytics in software engineering Risks of software analytics Why big data needs thick data Consider both researchers and practitioners….
  • 49. Five Risks 1) Data and construct trustworthiness 2) Reliability of the results 3) Ethical concerns 4) Unintended and unexpected consequences 5) Big data can’t answer big questions
  • 50. Risk #1: Trustworthiness of the data Data representativeness (construct validity) Data completeness Inaccuracies in profiles, exaggerations, skewed opinions Treating humans as “rational” animals [Harper et al.]
  • 51. Perils from using GitHub data: A repository is not necessarily a (development) project Most projects are inactive or have few commits Most projects are for personal use only Only 10% of projects use pull requests History can be rewritten on GitHub A lot happens outside of GitHub The Promises and Perils of Mining GitHub, Eirini Kalliamvakou et al., MSR 2014.
  • 52. Risk #2: Trustworthiness of the results Researcher bias [Shepperd et al., 2014] Confusing correlations with cause and effect Big data and small effects [Marcus et al.] Inappropriate generalization Conclusion instability [Menzies et al.]
  • 53. “all models are wrong, but some are useful” [Box, 1976] http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
  • 54. Risk #3: Ethical concerns Private, public, blurred spaces Surveillance at the level of the individual Opaque algorithms, opaque biases [Tufecki, CSCW Keynote, 2015]
  • 56. Risk #4: Unexpected consequences Negative side effects [Gender studies] Gaming the gamification Incentives? handle with care!
  • 57. Assessing and watching developers Singer, Filho, Cleary, Treude, Storey, Schneider. Mutual Assessment in the Social Programmer Ecosystem: An Empirical Investigation of Developer Profile Aggregators, CSCW 2013.
  • 59. Most unwise questions! Analyze This! 145 Questions for Data Scientists in Software Engineering Andrew Begel and Thomas Zimmermann
  • 60. Risk #5: Big Data can’t answer Big Questions Or
  • 61. Risk #5: Big Data can’t answer Big Questions Or
  • 62. Risk #5: Big Data can’t answer Big Questions alone
  • 63. Examples of big questions? • What is a good architecture to solve problem x? [Devanbu] • What makes a really awesome programmer? [Software managers] • How to build a great development team? [Google] • How is program knowledge distributed? [Naur] • What is the ideal software engineering process? [Facebook, Microsoft, IBM,…] • What tools/practices support a participatory development process? [Storey et al.]
  • 64. Five Risks 1) Data and construct trustworthiness 2) Reliability of the results 3) Ethical concerns 4) Unintended and unexpected consequences 5) Big data can’t answer big questions
  • 65. Talk outline… History of software analytics in software engineering Risks of software analytics Why big data needs thick data, and why thick data needs big data! Consider both researchers and practitioners….
  • 66. Data scientists… “Typically start with the data, rather than starting with the problem.” [Forbes] “I love data” “I love patterns” [Kim et al., ICSE 2016] http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/print/
  • 67. John Snow’s theory about cholera came from talking to people [1850’s]
  • 68. Danger zones… http://blogs.lse.ac.uk/impactofsocialsciences/2015/02/12/philosophy-of-data-science- emma-uprichard/ “Most big data is social data – the analytics need serious interrogation” Social Science+ “It doesn’t matter how much or how good our data is if the approach to modelling social systems is backwards.”
  • 69. What is “thick” data? Researcher generated “thick” data Explanations, motivations, recommendations Questions rather than answers Variables for a model Future challenges Limitations: Self reporting, researcher bias, ambiguity in instruments and collected data
  • 70. Beyond “Mixed Methods”: Ethnomining Combines the ethos of ethnography interleaved with data mining techniques around behavioral/social data Storytelling (to support the numbers) Leverages visualization within tight loops of eliciting/reporting results http://ethnographymatters.net/blog/2013/04/02/april-2013- ethnomining-and-the-combination-of-qualitative-quantitative-data/
  • 73. Research challenges ahead Big data! (of trace and thick data!) Rapid pace of change (increased automation, participatory culture) Studying unstable objects [Rogers] Poor boundaries of study contexts
  • 74. Kevin Kelly, Futurist: “You’ll be paid in the future based on how well you work with robots.”
  • 75. Key Takeaway: Big Data needs Thick Data
  • 76. Future of data science in software engineering? Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s) Big Data meets Thick Data @margaretstorey
  • 77. References: “Mad about Measurement”, De Marco, http://ca.wiley.com/WileyCDA/WileyTitle/productCd-0818676450.html Van Solingen, Rini, et al. "Goal question metric (GQM) approach." Encyclopedia of software engineering (2002). The Emerging Role of Data Scientists on Software Development Team, Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel, ICSE May 2016. Analyze This! 145 Questions for Data Scientists in Software Engineering, Andrew Begel and Thomas Zimmermann, ICSE June 2014. Dongmei Zhang & Tao Xie, http://research.microsoft.com/en- us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf Rules of Data Science in SE, see www.slideshare.net/timmenzies/the-art-and- science-of-analyzing-software-data Audris Mockus, Nachiappan Nagappan, Trung T. Dinh-Trong, Test coverage and post-verification defects: A multiple case study. ESEM 2009: 291-301 Shepperd, Martin, David Bowes, and Tracy Hall. "Researcher bias: The use of machine learning in software defect prediction." Software Engineering, IEEE Transactions on 40.6 (2014): 603-616.
  • 78. M. Storey, The Evolution of the Social Programmer, Mining Software Repositories (MSR) 2012 Keynote http://www.slideshare.net/mastorey/msr- 2012-keynote-storey-slideshare M. Storey et al., The (R)evolution of Social Media in Software Engineering, ICSE Future of Software Engineering 2014, http://www.slideshare.net/mastorey/icse2014-fose-social-media H. Jenkins, K. Clinton, R. Purushotma, A. J. Robison, and M. Weigel. Confronting the challenges of participatory culture: Media education for the 21st century, 2006. http://digitallearning.macfound.org/atf/cf/%7B7E45C7E0-A3E0-4B89- AC9C-E807E1B0AE4E%7D/JENKINS_WHITE_PAPER.PDF L. Singer, F. F. Filho, B. Cleary, C. Treude, M.-A. Storey, K. Schneider. Mutual Assessment in the Social Programmer Ecosystem: An Empirical Investigation of Developer Profile Aggregators Treude, C., and M.-A. Storey, “Awareness 2.0: staying aware of projects, developers and tasks using dashboards and feeds,” in ICSE’10: Proc. of the 32nd ACM/IEEE Int. Conference on Software Engineering, ACM. C. Treude and M.-A. Storey. Work Item Tagging: Communicating Concerns in Collaborative Software Development. In IEEE Transactions on Software Engineering 38, 1 (January/February 2012). pp. 19-34
  • 79. [Marcus2014] Gary Marcus and Ernest Davis, "Eight (No, Nine!) Problems with Big Data", New York Times, April 6, 2014 [Harper2013] Richard Harper, Christian Bird, Thomas Zimmermann, and Brendan Murphy"Dwelling in Software: Aspects of the felt-life of engineers in large software projects", Proceedings of the 13th European Conference on Computer Supported Cooperative Work (ECSCW '13), Springer, September 2013. P. Naur and B. Randell. Software Engineering: Report of a Conference Sponsored by the NATO Science Committee, Garmisch, Germany, Oct.1968. NATO Mcgrath, E. "Methodology matters: Doing research in the behavioral and social sciences." Readings in Human-Computer Interaction: Toward the Year 2000 (2nd ed. 1995. Aranda, Jorge, and Gina Venolia. "The secret life of bugs: Going past the errors and omissions in software repositories." Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 2009. Ethno-Mining: Integrating Numbers and Words from the Ground Up: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-125.pdf How Google builds a really development team, New York Times, 2016. [Tufekci2015] Zeynep Tufekci, "Algorithms in our Midst: Information, Power and Choice when Software is Everywhere", Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp.1918-1918, ACM 2015.