Talk given at SUITE 2010
Abstract:
The state-of-the-art in mining software repositories stores software artifacts from various sources into monolithic relational databases. This puts a lot of querying power in the hands of the software miners, however it comes at the cost of enclosing the data and hamper cross-application reuse. In this paper we discuss four problem scenarios to illustrate that Semantic Web technology is able to overcome these limitations. However, it requires that the software engineering research community agrees on two prerequisites: (a) a common vocabulary to talk about software repositories -- an ontology; (b) a strategy for generating unique and stable references to all software artifacts inside such a repository -- a Universal Resource Identifier (URI).
ICT role in 21st century education and its challenges
Fostering Synergies - How Semantic Web Technology could influence Software Repositories
1. Fostering
Synergies
How Semantic Web Technology could
influence Software Repositories
Michael Würsch, Gerald Reif, Serge Demeyer, Harald
Gall
University of Zurich, Switzerland
University of Zurich
Department of Informatics software evolution & architecture lab
2. Developer’s Information
Needs
‣ Who has changed this code and why?
‣ How can I persist data in Spring?
‣ What are the subclasses of
JComponent?
‣ What class in my project had the most
bugs prior to the last release?
‣ ...
3. Information Silos
Bugzilla
Mailinglists
CVS
Atlassian Jira
Subversion
Wikis
‣ Limited search
‣ No unified data model capabilities
‣ No references across silo ‣ No cross-domain
boundaries queries
4. Leveraging Information: State of the
Art
www.google.com/codesearch
Bugzilla
preprocess
www.koders.com
mirror
CVS
sourcerer.ics.uci.edu
e
ee
e
www.evolizer.org
5. Again, Silos...
‣ Database schemas
are not portable
www.google.com/codesearch sourcerer.ics.uci.edu ‣ Relations are local
‣ There is no
consistent way of
e
ee getting the meaning
e of a relation
www.evolizer.org
www.koders.com
6. Release your Data!
‣ Use a common vocabulary to
describe software artifacts and their
relationships
‣ Expose unique identifiers for
software artifacts on the web
7. The Semantic Web/The Web of
Data
‣ Graph-based data model
described by S-P-O triples
‣ URIs to reference Resources
‣ Ontologies to formalize a
common understanding of a
domain
‣ SPARQL to search by
matching graph-patterns
8. Example: Building an RDF
Graph
http://myProject.org/bugs/nr124
http://evolizer.org/bugOntology/affects
http://sourcerer.ics.uci.edu/myProject/Foo.java
http://sourcerer.ics.uci.edu/
myProject/Foo.java http://myProject.org/bugs/nr124
9. Research Agenda
Come up with a strategy for generating
URIs for software artifacts
Develop an ontology of software
artifacts and their relationships
10. Existing Ontologies
EvoOnt
http://www.ifi.uzh.ch/ddis/evo/
SEON - Software Engineering Ontology
http://evolizer.org
Baetle - Bug And Enhancement Tracking
Language
http://code.google.com/p/baetle/
DOAP - Description of a Project
http://trac.usefulinc.com/doap
11. Release your Data! The Semantic Web/The Web of Data
‣ Graph-based data model
described by S-P-O triples
‣ Formalize a common vocabulary to
describe software artifacts and their ‣ URIs to reference Resources
relationships ‣ Ontologies to formalize a
‣ Devise strategies to generate URIs for
common understanding of a
software artifacts domain
‣ Expose these URIs on the Web ‣ SPARQL to search by matching
graph-patterns
Existing Ontologies Research Agenda
EvoOnt Come up with a strategy for generating
http://www.ifi.uzh.ch/ddis/evo/
URIs for software artifacts
SEON - Software Engineering Ontology
http://evolizer.org Develop an ontology of software artifacts
Baetle - Bug And Enhancement Tracking Language and their relationships
http://code.google.com/p/baetle/
DOAP - Description of a Project
http://trac.usefulinc.com/doap
Notas del editor
Search-Driven Software Engineering is all about fulfilling information needs of developers or maintainers. These information needs can be expressed in terms of questions, such as...(Read some of the questions above)
The data needed to answer such questions is often locked away in data silos, such as Bug Tracking Systems, Version Control Systems, Mailing lists, etc. I say locked away, because many of these tools are not made for querying. Further, many information needs span more than one domain. This is where limitations of the existing systems are apparent. To summarize them (read examples).
(continue) we usually parse, for example, CVS logs or XML exports of bug reports and use some heuristics to establish links between them. Or we build richer source code models by parsing or partially compiling source files. Then we more or less mirror all the information in a relational database and provide a query interface on top of it. Examples are...(name the examples)
From the point of view of other researchers and tool builders, we are again building silos that are barely useful for other than the originally envisioned purposes.
There are three main reasons for that:
First, in theory, db schemas should be exchangeable thanks to DDLs, in practice is is still a painful undertaking
Second, relations are local - you can not simply reference an entity stored within another database in your database. you basically enclose your data in the db.
Third, there is no consistent way to get the meaning of a relation - a query can join tables by any columns which match by data type, without any check on semantics
We believe that we should overcome this limitations by defining a common data schema, meaning a common syntax and vocabulary to describe software artifacts and the relationships between them. This would, for example, give us the possibility to try out different search-engines on different data-repositories.
We should further come up with ways to expose unique identifiers for software artifacts on the web. This enables two things: first, we can reference information across these silo boundaries and second, we could then potentially run distributed queries, without all the preprocessing and mirroring effort, I have mentioned before.
We believe that the Semantic Web provides the tools for this. Forget about all the A.I. magic that you might associate with the Semantic Web. It is just a very convenient, but yet simple, framework for describing and working with data.
It provides a graph-based data model, described by simple subject-predicate-object models and URIs to reference resources. Vocabulary is described by ontologies. You can search in such information graphs with SPARQL, the query language of the Semantic Web.
Given two repositories, one that stores bug reports and one that stores a full-fledged source code model, we can then, in a third place make statements about a particular bug and a particular Java class. This is a s-p-o triple.
Dereferencing the URIs leads to the resources, or, in the case of the ‘affects’-property to the ontology definition.
We need unique and stable identifiers for s-e artifacts. It’s easy to come up with such URIs for some artifacts, but not so straight-forward for others.
We need agree on a common vocabulary (data schema) for software engineering concepts. A source code visualization tool should not need to care whether it works with data retrieved from Evolizer, Google Code Search, Koders, or Sourcerer. These tasks are clearly a community effort
No need to start from scratch - take existing ontologies and consolidate.