TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Recent Trends in Semantic Search Technologies
1. Peter Mika| Yahoo! Research, Spain
pmika@yahoo-inc.com
Thanh Tran | Semsolute, Germany
Tran@semsolute.com
Semantic Search on the Rise
2. About the speakers
Peter Mika
Senior Research Scientist
Head of Semantic Search group at
Yahoo! Labs
Expertise: Semantic Search, Web
Object Retrieval, Natural Language
Processing
Tran Duc Thanh
CEO of Semsolute, Semantic Search
Technologies Company
Served as Assistant Professor for
Karlsruhe Institute of Technology and
Stanford University
Expertise: Semantic Search,
Semantic / Linked Data Management
3. Agenda
Why Semantic Search
What is Semantic Search
Innovative Semantic Search Applications
Behind the Scene
Questions
5. Why Semantic Search? I.
“We are at the beginning of search.“ (Marissa Mayer)
Solved large classes of queries, e.g. navigational
Remaining queries are hard, not solvable by brute
force, require deep understanding of the world and
human cognition, e.g.
Ambiguous searches: paris hilton
Imprecise or overly precise searches
Searches for descriptions: 34 year old computer scientist
living in barcelona
Background knowledge and metadata can help to
address poorly solved queries
Many of these queries
would not be asked by
users, who learned over
time what search
technology can and can
not do.
6. Why Semantic Search? II.
The Semantic Web is now a reality
Large amounts of data published in RDF
Linked Data
Metadata in HTML
Facebook‟s Open Graph Protocol
Schema.org
Casual users
Don‟t know SPARQL
Unaware of the schema of the data
Searching data instead or in addition to searching
documents
Enable innovative search applications / tasks
8. Semantic Search: Using Semantic Models for
Search
Semantic search is a retrieval paradigm that
Exploits the semantics of the data or explicit background
knowledge to understand user intent and the meaning of
content
Incorporates the intent of the query and the meaning of
content into the search process (semantic models)
9. Semantic Search: Different Kinds / Different
Uses of Semantic Models
Wide range of semantic search systems
Employ different semantic models, possibly at
different steps of the search process and in order to
support different tasks
Query formulation
Query processing / understanding
Ranking
Result presentation
Result / query refinement
10. Semantic models
Semantics is concerned with the meaning of the
resources made available for search
Various representations of meaning
Word-level models: models of relationships among
words
Taxonomies, thesauri, dictionaries of entity names
Inference along linguistic relations, e.g. broader/narrower
terms
Concept-level models: models of relationships
among objects
Ontologies capture entities in the world and their
relationships
Inference along domain-specific relations
11. Graph-based Conceptual Models
Core of W3C standards for knowledge representation
and data exchange: RDF, OWL
Large amount of data / knowledge on the Web
available as graphs
Linked Data: hundreds of interconnected datasets
capturing domain-independent and domain-specific
knowledge
Metadata in HTML
RDFa, microdata, Facebook‟s OGP
Private graphs
Google‟s Knowledge Graph
Facebook Graph
Yahoo‟s Knowledge Base (talk yesterday)
Microsoft's Satori
13. Where can you find Linked Data?
Downloads
Dbpedia data dumps
SPARQL access
LOD cache by OpenLink: 51 billion triples
Keyword search
Sindice by SindiceTech
14. Google Knowledge Graph
Start with Freebase‟s database, which had 12 million
entities
As of June 2012, Knowledge Graph has 500 million
entities and over 3.5 billion relationships between
those entities
Prioritize properties based on what users were most
15. Facebook‟s Open Graph Protocol
The „Like‟ button provides publishers with a way to
promote their content on Facebook and build
communities
Shows up in profiles and news feed
Site owners can later reach users who have liked an
object
Facebook Graph API allows 3rd party developers to
access the data
Open Graph Protocol is an RDFa-based format that
allows to describe the object that the user „Likes‟
16. Facebook‟s Open Graph Protocol
RDF vocabulary to be used in conjunction with RDFa
Simplify the work of developers by restricting the freedom in RDFa
Activities, Businesses, Groups, Organizations, People, Places,
Products and Entertainment
Only HTML <head> accepted
http://opengraphprotocol.org/
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="movie" />
<meta property="og:url"
content="http://www.imdb.com/title/tt0117500/" />
<meta property="og:image" content="http://ia.media-
imdb.com/images/rock.jpg" /> …
</head> ...
17. Semantic Web markup: schema.org
Agreement on a shared set of schemas for common types
of web content
Use a single format to communicate the same information to all three
search engines
Bing, Google, and Yahoo! (June, 2011), Yandex (Nov, 2011)
Microdata and RDFa support
Schemas for most common web content
Business listings, images/video, recipes, reviews, products, jobs…
Community
public-vocabs@w3.org
19. Current state of metadata on the Web
Analysis of the Bing/Yahoo! Search Crawl
US crawl, January, 2012
31% of webpages, 5% of domains contain some metadata
P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus,
LDOW 2012
WebDataCommons.org
Data extracted from a public crawl (commoncrawl.org)
February, 2012 results show 11% of URLs with metadata
compared to 5% in 2009/2010 data
7.3 billion triples available for download
H.Mühleisen, C.Bizer.Web Data Commons - Extracting
Structured Data from Two Large Web Corpora, LDOW 2012
Large increase in RDFa and microdata adoption compared
to microformats
20. Where can you find HTML metadata?
Web Data Commons
Glimmer: glimmer.research.yahoo.com
Online index of the schema.org data in Web Data
Commons
22. Innovative Semantic Search Applications
Entity search: entity/entities as results
Factual search: direct answers, facts (about entities)
Relational search: complex relationships between entities
Semantic auto-completion: suggesting queries based on
the intent of the provided inputs
Results aggregation / analysis / prediction: apply
computational models
Semantic log analysis: understanding user behavior in
terms of objects
Semantic profiling: recommendations based on particular
interests
Semantic context: contextual model of users / interests
Support for complex tasks, e.g. booking a vacation using a
combination of services
Conversational search
31. Contextual (pervasive, ambient) search
Yahoo! Connected
TV:
Widget engine
embedded into the
TV
Yahoo! IntoNow:
recognize audio and
show related content
32. Interactive Voice Search
Siri
Question-Answering
Variety of backend sources
including Wolfram Alpha and
various Yahoo! services
Task completion
E.g. schedule an event
34. Conversational Search
Parlance EU project
Complex dialogs around a set of objects
Restaurant
Area
Price range
Type of cuisine
Complete system
Automated Speech Recognition (ASR)
Spoken Language Understanding (SLU)
Interaction Management
Knowledge Base
Natural Language Generation (NLG)
Text-to-Speech (TTS)
Video
Commercial alternatives from Nuance
36. Main Technological Building Blocks
Query Interpretation
Spelling Correction
Query Segmentation
Entity Recognition
Query Intent Interpretation for Semantic Auto-Completion
Ranking
Entity Ranking
Relationship Ranking
Aggregation
Result Fusion
Rank / Score Aggregation
Result Presentation
Summary Generation
Visualization
37. Semsolute‟s Building Blocks - Keyword / Key Phrase
Interpretation
Entity
“address company san
francisco”
Semantic entity index
Inverted index for entities /
triples
Return entities / entities‟
relationships as results to
keys
Semantic entity ranking
Structured language model:
one language model for every
attribute
Returns entities‟ LMs that
most likely generate the
keywords, i.e. the entity
descriptions that best match
38. Relationship
s / Structure
Entity
“address company san
francisco”
Semsolute‟s Building Blocks – Semantic Graph
Construction
Offline component: query-
independent schema graph
Reuse schema
Pseudo-schema construction:
all possible connections
between classes of entities,
e.g. friendships between users
Online component: query-
specific keyword matching
elements
Connect keyword matching
elements / entities to the
classes they belong to
39. Relationship
s / Structure
Entity
“address company san
francisco”
Semsolute‟s Building Blocks – Graph Exploration
Top-k graph exploration
Shortest-path based algorithm
that finds top-k graphs
connecting keyword matching
elements
Top-k graph ranking
Language model based
Aggregated model that
combines the LMs of entities
matching the keywords
40. Semsolute‟s Building Blocks – Query Generation &
Processing
TripleRelationship
s / Structure
Entity
Address of companies located in San
Francisco?
“address company san
francisco”
Graph to query mapping
Translation rules that map top
ranked graphs to structured
queries (SQL, SPARQL)
Translation rules that map
structured queries to natural
language questions
Graph matching
Triple index: cover index
supporting different triple
patterns
Various join implementations
41. Yahoo! Spark: Entity Recommendation in
Search
Different use cases in Web Search
Some users are short on time
Need direct answers
Query expansion, question-answering, information boxes, rich
results…
Other users want to explore
Long term interests such as sports, celebrities, movies and music
Long running tasks such as travel planning
Spark is a search assistance tool for exploration
Recommend related entities given the user‟s current
query
Based on explicit relations in a Knowledge Base
46. Spark challenges
Interpretation and disambiguation
Obama and Toyota are places in Japan, but maybe
the user is not looking for them
The popularity of “obama” is not a sign of the
popularity of a Japanese town
Ranking
“Release me” from Engelbert Humperdinck should
rank higher than “Lesbian Seagull” which only
appeared on the soundtrack of a Beavis and
Butthead episode
Editorial relevance vs. what people click
Large-scale data processing and ML
Knowledge Base built from Wikipedia, Yahoo!
data, Web extraction
Feature extraction from query logs, Flickr and Twitter
data
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
47. Contact
Peter Mika
pmika@yahoo-inc.com
@pmika
Tran Duc Thanh
thanh.tran@semsolute.com
49. Resources
Detailed information
Peter Mika. Entity Search on the Web, Keynote at Web of
Linked Entities WS
Peter Mika, Thanh Tran. Semantic search tutorial
SemTech2012
Books
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern
Information Retrieval. ACM Press. 2011
Survey papers
Thanh Tran, Peter Mika. Survey of Semantic Search
Approaches. Under submission, 2012.
Conferences and workshops
ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
Semantic Search workshop series
Exploiting Semantic Annotations in Information Retrieval
(ESAIR)
Entity-oriented Search (EOS) workshop
Web of Linked Entities (WoLE) workshop
Notas del editor
Mobile: Google interactive voice search (conversation), Siri (Peter)Facebook’s Graph Search (Thanh)Knowledge Graph (infoboxes)... entity search (“tom cruise actor”) to list/category queries (“tom cruise spouses”) to question-answering (“tom cruise height”) (Thanh)Spark (Yahoo!): related entity recommendation (Peter)Thanh’s search engine: auto-complete based on the schema/data, entity search to relational search using Yago data (Thanh)Glimmer: RDF search engine (Peter)
Semantic search can be seen as a retrieval paradigm Centered on the use of semanticsIncorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
Facebook invited, but continues to pursue OGP
We implemented the search paradigms and integrated them as separate search modules into a demonstrator system of the Information Workbench7 that has been developed as a showcase for interaction with the Web of data. In particular, keyword search is implemented according to the design and technologies employed by standard Semantic Web search engines. Like Sindice and FalconS, we use an invertedindex to store and retrieve RDF resources based on terms. Also using the inverted index, faceted search is implemented based on the techniques discussed in [25]. Result completion is based on recent work discussed for the TASTIER system [8]. For computing join graphs, we use the top-k procedure elaborated in [9]. This technique is also used for computing top-k interpretations, i.e. to support query completion. We choose to display the top-6 queries and the top-25 results respectively.