6. The Semantic Scout
• A framework for search, presentation, and analysis of entities and
their associated knowledge
• Employs SW, LOD, NLP, IR
• Scientific work goes back to 2006, first presented at ISWC2007
• An evolving prototype for requirements of the EU IP IKS: semantic
search, hybrid IR/SW identity management, automatic document
classification (against DBpedia)
• 2009 requirements from the technology transfer office of CNR for the
NetwOrK initiative
6
7. The CNR
• CNR is the largest research institution in Italy
– about 8000 permanent researchers (+14000)
– 7 departments focused on the main scientific
research areas
– 108 institutes spread all over Italy
• Subdivided into research units, labs, etc.
7
8. The CNR data sources
Organizational data
File System
DB
DB
Administration
DB Frameworks,
Departments documentation
Programmes,
Workpackages
Institutes,
Central admin,
Publications
Activity-related data
Only partly as open data!
DB DB
Curricula Permanent
DB
employees DB
Financial data Accounting,
Other Contracts,
research Invoicing
employees,
Personnel-related data Externally
funded projects
8
9. The CNR tasks
• Strategic objective: matching the research
demand to the research supply
• Requirements
– Semantic interoperability between heterogeneous
data sources
– Expert finding based on competence
– Monitoring funding and evolution of different
research areas and units
– Browsing and reporting capabilities
9
14. Sources and lifting
• Situation usually not as clean as using a
unique CMS for most organizational tasks
• DB (e.g. SQL Server) + a lot of textual
records + HTML Web Site + textual corpus +
linked open data
• DB + interaction schemata (XML templates
and HTML scraping, needed because of
schemata degradation and user perspective
evolution)
14
15. Ontology design
• Starting from XML templates as module/pattern drafts
• Reengineering XML and scraped templates
• Reengineering DB schemata (system engineer
involved)
• Obtained modular, pattern-based, task-based ontology
• Textual DB records with identity: precondition for
hybridizing IR and SW (see later)
• Alignments to FOAF, SIOC, SKOS, WordNet ontologies
• Used patterns: situation, place, transitive reduction
15
17. Data design
• Triplifiers based on SQL rules (automatic
scripting on JDBC drivers not enough because
of legacy degradation of physical schemata)
– Cf. also: Semion reengineering tool
• Inferences: OWL (Pellet, HermiT), SPARQL
CONSTRUCT
• Extraction tool: Semiosearch, categorizer over
Wikipedia categories
– Next: deep parsing approach (facts, relations, entities)
17
18. Publishing and hybridizing
• Publishing OWL-RDF datasets
– linked data approach (persistent URIs, triple stores for RDF dataset management,
linking to common vocabularies: FOAF, DBpedia, Geonames, Bibo, ...)
– OWL ontologies for dataset generation, querying, inference (new enriched
datasets)
• Subgraph extraction through SNA
• Virtual semantic corpus
– IRW to distinguish information and non-information resources
– SPARQL rules to generate virtual texts associated with entities
• Indexing
– Lucene+LSA indexing of semantic corpus
– “Semantic” Lucene extension to produce tight coupling of virtual texts with
entities
– Multilinguality
18
19. Consuming
• SPARQL endpoint, with interface enhancement
• Keyword-based search
– Semantic browsing with SPARQL-based AJAX DHTML, RDF
relation browser, or XML-based relation browser
• Category-based search
– Keyword-based result focusing
19
23. Expert finding: Task-based testing
• It is based on the ability to materialize on
demand a contextual network of relevant
information.
• It is performed with a combination of tools in the
toolkit to:
– Identify the main topics of research
– Recursively search the CNR data cloud
23
24. Identifying the main topics of research:
project description
• “Reputation is a social knowledge, on which a number of social decisions are
accomplished. Regulating society from the morning of mankind becomes more
crucial with the pace of development of ICT technologies, dramatically
enlarging the range of interaction and generating new types of aggregation.
Despite its critical role, reputation generation, transmission and use are
unclear. The project aims to an interdisciplinary theory of reputation and to
modeling the interplay between direct evaluations and meta-evaluations in
three types of decisions, epistemic (whether to form a given evaluation),
strategic (whether and how interact with target), and memetic (whether and
which evaluation to transmit).”
– Project About: Social Knowledge for e-Governance.
– Topics can be manually annotated, or automatically induced,
e.g.: ethics, sociology, collaboration, social network,
reputation
24
25. Identifying the main topics of
research: text categorization
• Query: “ethics, sociology, collaboration, social network, reputation”
25
26. Search the CNR data cloud: identify an
entry point
• “Commessa” (programme): “Il Circuito dell’Integrazione: Mente, Relazioni
e Reti Sociali. Simulazione Sociale e Strumenti di Governance”
26
27. Search the CNR data cloud: identify
key people
• Ing. Jordi Sabater: Cognitive Science;
• Dott. Mario Paolucci: Sociology, Psichology;
• Gennaro di Tosto: Artificial Intelligence;
• Walter Quattrociocchi: Interdisciplinary Fields;
• Giuseppe Castaldi: Ethics;
27
• Aldo Gangemi: Semantic Web, Knowledge representation.
28. Expert Finding: Results
• The description of “eRep project” was adopted as a
gold standard to evaluate the results when testing the
Semantic Scout.
• 6 out of 10 CNR researchers, were correctly retrieved
and a project member affiliated with another
institution.
– Project Coordinator: Dott. Mario Paolucci
– External Member: Jordi Sabater Mir
28
29. Functional evaluation of Semantic
Scout (example)
• Expert finding accuracy
– All the 6 retrieved people scored among the first 10 in the
result from the search engine.
• Benefit of integrated data cloud
– The user judged an “activity” to be relevant to his goal and
used it as entry point to the CNR newtork of resources.
29
30. Functional evaluation of Semantic
Scout
• Accessibility and Interaction
– Multiple users interfaces guarantee the users an adaptive level
of interaction to each specific type of required information
• Completeness of retrieval
– 4 people have not been included in our result set.
– Antonietta Di Salvatore: scored below the first 10 people in the
list;(+1)
– Giulia Andrighetto was not listed among the people relevant to
the query, but belongs to the social network of Dr. Rosaria
Conte.(+1)
– Marco Capenni and Stefano Picascia: have a technician profile,
hence they are neither reported among the people relevant to
the search query, nor belong to the network of any of the other
researchers.
30
31. Ongoing work
• More data linking (e.g. DBLP,
Georeferencing)
• Synchronization with data sources
• More interaction paradigms
• Privacy issues interlaced with hierarchical
and idiosyncratic practices
31
32. Conclusions
• Hybridizing several semantic and retrieval
technologies provides added value to a
research organization
• Scalability works for CNR figures
• Interaction is a core selling point
• Try it at http://bit.ly/semanticscout
• @data_cnr_it, @semanticscout,
@aldogangemi
32