1. Back to the future: Empirical
Revolution(s) in Software
Engineering
Audris Mockus
University of Tennessee
audris@utk.edu
MSR’22 [2022-05-19 Thu]
2. How to use history
“. . . my history will . . . be judged useful . . . as an
aid to the interpretation of the future, which in the
course of human things must resemble if it does
not reflect it.” The History of the Peloponnesian
War, Thucydides
I Fit model on past data and predict future
I This talk is about best features/predictors to
use
3. Modes of Production and Capability to
Measure
I Modes of software production
I Computers: direct/assembly
I Programming languages, tools, styles
I Development process/coordination
I Widespread collaboration
I Science also needs:
I Increased ability to measure
I Reduction of measurement costs
I Improved measurement accuracy
4. Examples from the past
I First languages: McCabe, Halstead
I OO programming: OO complexity/smells
I Large teams: coordination:
I What do developers do?
I Defects, development effort, delays
I CMM/Software Factories as the apotheosis of
process
I Open Source Software: Collaboration
I Increased ability to measure
I No cost: operational data in VCS/ITS/Mailing lists
I Accuracy: problematic accuracy
5. Present mode of production
I Massive collaboration, reuse, knowledge
transfer
I 60M contributors (43M aliased)
I 173M projects (107M deforked)
I 22 percent copied (to 14 projects on average)
I Largest communities
I Collaboration: 6M projects+contributors (75M
clique)
I Code reuse: 4M projects (46M clique), 5M
contributors (17M clique)
6. Conceptual models
I Software supply chains (SSCs)
I Key objectives: SSC security
I Vulnerabilities
I Upstream, orphan (copied), bad practice
I Sustainability
I Basic maintenance
I Response to user needs
7. Why think of supply chains
I Traditional SC
I Product, service, information flow
I SSC
I Dependencies, copying, knowledge
I The same character
I Complex Interdependencies
I Decentralized decision making
I Need to manage risks
8. SSC of the 1st kind
I Technical dependencies among projects with
change effort as product flow
I Primary risks: unknown vulnerabilities,
breaking changes, lack of maintenance, lack of
popularity
Examples of SSC of the first kind
I Python: import re
I Java: import java.util.Collection;
I JavaScript: package.json
9. SSC of the 2nd kind
I Copying of the source code from project to
project as product flow
I Primary risks: license compliance, unfixed
vulnerabilities/bugs, missing updated
functionality
Examples of SSC of the second kind
I Implementation of a complex algorithm
I Useful template
I Build configuration
10. SSC of the 3rd kind
I Knowledge (product) flow through code
changes as developers learn from and impart
their knowledge to the source code
I Primary risks: developers may leave, companies
may discontinue support
Examples of SSC of the third kind
I Developers gaining skills with
tools/packages/practices
I Developers spreading practices, e.g., testing
frameworks
11. SSCs Measurement: Key Needs
I Completeness: the entirety of OSS
I Autocuration: address data quality at scale
I Cross-referencing: to make analysis run in
minutes not months
12. What does SSC-based production mean?
I Sw. project is a small part of its supply chains
I e.g., activity, practices, quality, sustainability, and
popularity
I Reconstruct global dynamic properties of the
entire public source code
I Proper project/Author/Code sampling.
13. How to conduct NextGen SW research?
I World of Code Infrastructure[2, 1]:
I Complete, Current, Curated, Cross-referenced
I “Research Ready” for Supply Chain Research
I How to use: github.com/woc-hack/tutorial,
worldofcode.org
I May 21-22: https://github.com/woc-hack/
hackathon-pittsburgh-2022
I 170+M projects; 60+M Authors; 3.1B Commits;
12B Blobs and Trees
I Specialized database without redundancies: 300TB
14. Summary
I Revolutions are predicted by
I New modes of production (SSC)
I Suitable Measurement for SSCs (e.g., WoC) need:
I Completeness
I Low Cost
I High Accuracy
15. Bio
Audris Mockus worked at AT&T, then Lucent Bell Labs and Avaya Labs for 21 years.
Now he is the Ericsson-Harlan D. Mills Chair professor in the Department of Electrical
Engineering and Computer Science of the University of Tennessee.
He specializes in the recovery, documentation, and analysis of digital remains left as
traces of collective and individual activity. He would like to reconstruct and improve
the reality from these projections via methods that contextualize, correct, and
augment these digital traces, modeling techniques that present and affect the behavior
of teams and individuals, and statistical models and optimization techniques that help
understand the nature of individual and collective behavior. His work has improved the
understanding of how teams of software engineers interact and how to measure their
productivity.
Dr. Mockus received a B.S. and an M.S. in Applied Mathematics from Moscow
Institute of Physics and Technology in 1988. In 1991 he received an M.S. and in 1994
he received a Ph.D. in Statistics from Carnegie Mellon University.
16. Abstract
The desire to better understand software development lead to numerous attempts to
quantify it. Easy-to-measure artifacts, such as source code, could provide only the
most basic understanding of the entire development process and the attempts to
directly measure quality and effort were cost-prohibitive, error-prone, and rarely shared
with researchers or made public. The rise of open source not only provided a reliable
software infrastructure but also the rich data source for the software engineering
community to finally measure aspects of software development never seen before.
Over time and with increased use of open source software, actual developers
increasingly need to deal not just with their own project but with projects upstream,
downstream, or sideways in the huge open source software supply chain. Measures
derived from the entire software supply chain are now likely to bring about the next
software engineering revolution.
17. References
Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus.
World of code: An infrastructure for mining the universe of open source vcs data.
In IEEE Working Conference on Mining Software Repositories, May 26 2019.
Audris Mockus.
Amassing and indexing a large sample of version control systems: towards the census of public source code
history.
In 6th IEEE Working Conference on Mining Software Repositories, May 16–17 2009.