From Queries to Answers in the Web

From queries to answers in the Web
R o i B l a n c o , S r . R e s e a r c h S c i e n t i s t
Y a h o o L a b s

6
Answers
arrive
even
before
finishing
the query!

Mobile shift
7
Desktop Tablet Mobile
Av Words 2.73 2.88 3.05
Av Chars 17.44 18.02 18.93
Song, Ma, Wang, Wang,
Exploring and exploiting
user search behavior on mobile
and tablet devices to improve
search relevance
WWW 2013
Mobile categories are less skewed (Image 42%, Adult 23.5%, Navigational 15%) vs
Desktop (37% Navigational, 19.9% Image, 7.7% commerce)
There’s also a difference between top-level domains:
Mobile Desktop
youtube.com facebook.com
wikipedia.org yahoo.com
answers.yahoo.com wikipedia.org
ehow.com youtube.com
imdb.com walmart.com

Q&A in search engines?
8
Fagni, Perego, Silvestri, Orlando. Caching and prefetching query results by exploiting historical usage data. TOIS 2006

The web search perspective
 Web search today is really fast, without necessarily being
intelligent
› A search engine without any understanding
 Trends
› Convergence of search and online media
• End of the 10 blue links
› Personal, social search
• Search over my world
• Search using my profile
› New interfaces
• Contextual, interactive
› Search that anticipates
› Solve tasks not queries

Search is really fast, without necessarily being intelligent
Could Watson
explain why the
answer is
Toronto?

We came to bury the 10 blue links
8/31/201511
Meaningless
query

We came to bury the 10 blue links
Meaningful
query

13
Facebook is a
search engine

Personalized search Yahoo news feed is
a personalized
search engine

Search that anticipates
15
 Google Now
 Star Trek computer
• Jason Douglas: Structured Data at Google, SemTechBiz SF 2013

Interactive Voice Search
 Apple’s Siri
› Question-Answering
• Variety of backend sources
including Wolfram Alpha and
various Yahoo! services
› Task completion
• E.g. schedule an event
 Google Now
 Facebook’s M

17
Facebook’s M
mobile assistant

Web search by 2009
19
 Large classes of queries are solved to perfection
 Improvements in web search are harder and harder to come by
› Relevance models, hyperlink structure and interaction data
› Combination of features using machine learning
› Heavy investment in computational power
• real-time indexing, instant search, datacenters and edge services
 Search ranking features
› Text matching (including anchor text)
› Page authority (Pagerank)
› User behavior signals
› Other features: context, history (still not very well understood)

 Language issues
› Multiple interpretations
• jaguar
• paris hilton
› Secondary meaning
• george bush (and I mean the beer brewer
in Arizona)
› Subjectivity
• reliable digital camera
• paris hilton sexy
› Imprecise or overly precise searches
• jim hendler
 Complex needs
› Missing information
• brad pitt zombie
• florida man with 115 guns
• 35 year old computer scientist living in
barcelona
› Category queries
• countries in africa
• barcelona nightlife
› Transactional or computational queries
• 120 dollars in euros
• digital camera under 300 dollars
• world temperature in 2020
Poorly solved information needs remain
Many of these queries would
not be asked by users, who
learned over time what search
technology can and can not
do.

Semantic Search: a definition
 Semantic search is a retrieval paradigm where
› User intent and resources are represented using semantic models
• Not just symbolic representations
› Semantic models are exploited in the matching and ranking of resources
 Often a hybrid of document and data retrieval
› Documents with metadata
• Metadata may be embedded inside the document
• I’m looking for documents that mention countries in Africa.
› Data retrieval
• Structured data, but searchable text fields
• I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.
 Wide range of semantic search systems
› Employ different semantic models, possibly at different steps of the search process and in order to support different
tasks

Semantic Search – a process view
Query
Constructi
on
•Keywords
•Forms
•NL
•Formal language
Query
Processin
g
•IR-style matching & ranking
•DB-style precise matching
•KB-style matching & inferences
Result
Presentation
•Query visualization
•Document and data presentation
•Summarization
Query
Refinement
•Implicit feedback
•Explicit feedback
•Incentives
Document Representation
Knowledge Representation
Semantic Models
Resources
Documents

Yahoo’s Knowledge Graph
Chicago Cubs
Chicago
Barack Obama
Carlos Zambrano
10% off tickets
for
plays for
plays in
lives in
Brad Pitt
Angelina Jolie
Steven Soderbergh
George Clooney
Ocean’s Twelve
partner
directs
casts in
E/R
casts
in
takes place in
Fight Club
casts in
Dust Brothers
casts
in
music by
Nicolas Torzec: Making knowledge reusable at Yahoo!:
a Look at the Yahoo! Knowledge Base (SemTech 2013)

The role of Information Extraction in Semantic Search
 Making sense of
› Content
• Web, News, Twitter, email, etc.
› User behavior
• Not just queries, also interaction
› NER, NEC, NEL, Time expressions, topic, event and relation extraction
 Mapping to an abstract representation
› Linguistic models
• Taxonomies, thesauri, dictionaries of entity names
• Natural language structures extracted from text, e.g. using dependency parsing
• Inference along linguistic relations, e.g. broader/narrower terms, textual entailment
› Conceptual models
• Ontologies capture entities in the world and their relationships
• Words and phrases in text or records in a database are identified as representations of ontological elements
• Inference along ontological relations, e.g. logical entailment

Linguistic Representations of Text
25
Pablo Picasso was born in Málaga, Spain.
Pablo
Picasso
was
born
Málaga Spain
÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.
№£Ë
¿¥r© ÷ŝc£ËËð
÷£¿≠¥X£≠£g£ Ë÷£ŝ©
IR
Text
Part-of-Speech
tagging
Dependency
parsing
÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.
VBDNNP VBN NNP NNPIN
Word-sense
disambiguation
born S#2: (v) give birth, deliver, bear, birth, have (cause to be born) "My wife had twins yesterday!"
Root

born-in
Conceptual Representations of Text
26
Pablo Picasso was born in Málaga, Spain.
Pablo
Picasso
was
born
Málaga Spain
÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.
№£Ë
¿¥r© ÷ŝc£ËËð
÷£¿≠¥X£≠£g£ Ë÷£ŝ©
÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.
LOC LOCPER
÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.
IR
Text
NER
Mapping to
ontology
(NED)
city-in

Document processing
 Goal
› Provide a higher level representation of information in some conceptual space
› Conceptual space is different for Semantic Web and NLP based search engines
 Limited document understanding in traditional search
› Page structure such as fields, templates
› Understanding of anchors, other HTML elements
› Limited NLP
 In Semantic Search, more advanced text processing and/or reliance on
explicit metadata
› Information sources are not only text but also databases and web services

Example: microformats and RDFa
<div class="vcard">
<a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>
<div class="tel">+1-919-555-7878</div>
<div class="title">Area Administrator, Assistant</div>
</div>
<p typeof="contact:Info" about="http://example.org/staff/jo">
<span property="contact:fn">Jo Smith</span>.
<span property="contact:title">Web hacker</span> at
<a rel="contact:org" href="http://example.org"> Example.org </a>.
You can contact me <a rel="contact:email"
href="mailto:jo@example.org">
via email </a>.
</p> ...
Microformat (hCard)
RDFa

schema.org
 Agreement on a shared set of schemas for common types of web content
› Bing, Google, and Yahoo! as initial founders (June, 2011)
• Yandex joins schema.org in Nov, 2011
› Similar in intent to sitemaps.org
• Use a single format to communicate the same information to all three search engines
 schema.org covers areas of interest to all search engines
› Business listings (local), creative works (video), recipes, reviews and more
› Microdata, RDFa, JSON-LD syntax
 Collaborative effort
› Growing number of 3rd party contributions
› schema.org discussions at public-vocabs@w3.org

Summary
30
 If we want to…
› Answer queries, not just show links
› Personalize search
› Take context into account
› Anticipate user needs
 … we need to understand users, content and the world at large!
 Search engine have changed considerably
› Queries have changed
• Users seek for more info
• Vertical search (travel, local, images, videos, news, etc.)
• Will move towards a more task-oriented scenario (mobile context shift)
 Semantics help tail queries
› Head queries solved mostly by clickthrough data

Search over graph data
 Unstructured or hybrid search over RDF/graph data
› Supporting end-users
• Users who can not express their need in SPARQL
› Dealing with large-scale data
• Giving up query expressivity for scale
› Dealing with heterogeneity
• Users who are unaware of the schema of the data
• No single schema to the data
– Example: 2.6m classes and 33k properties in Billion Triples 2009
 Entity search
› Queries where the user is looking for a single entity named or described in the query
› e.g. kaz vaporizer, hospice of cincinnati, mst3000
 Elbassuoni, Blanco. Keyword Search over RDF graphs. CIKM 2011
 Blanco, Mika, Vigna. Effective and Efficient entity search in RDF data. ISWC 2011

 Entity-seeking queries make up 40-
50% of the query volume
› Jeffrey Pound, Peter Mika, Hugo
Zaragoza: Ad-hoc object retrieval in the
web of data. WWW 2010
› Thomas Lin, Patrick Pantel, Michael
Gamon, Anitha Kannan, Ariel Fuxman:
Active objects: actions for entity-
centric search. WWW 2012
› Show a summary of the most likely
information-needs
› Including related entities for navigation
› Roi Blanco, Berkant Barla Cambazoglu,
Peter Mika, Nicolas Torzec: Entity
Recommendations in Web Search.
ISWC 2013
Application:
entity displays in web search

Semantic understanding of queries
38
 Entities play an important role
› [Pound et al, WWW 2010], [Lin et al WWW 2012]
› ~70% of queries contain a named entity (entity mention queries)
• brad pitt height
› ~50% of queries have an entity focus (entity seeking queries)
• brad pitt attacked by fans
› ~10% of queries are looking for a class of entities
• brad pitt movies
 Entity mention query = <entity> {+ <intent>}
› Intent is typically an additional word or phrase to
• Disambiguate, most often by type e.g. brad pitt actor
• Specify action or aspect e.g. brad pitt net worth, toy story trailer

oakland as bradd pitt movie moneyball movies.yahoo.com oakland as wikipedia.org
captain america movies.yahoo.com moneyball trailer movies.yahoo.com
money moneyball movies.yahoo.com
moneyball movies.yahoo.com movies.yahoo.com en.wikipedia.org movies.yahoo.com peter brand
peter brand oakland nymag.com moneyball the movie www.imdb.com
moneyball trailer movies.yahoo.com moneyball trailer
brad pitt brad pitt moneyball brad pitt moneyball movie brad pitt moneyball brad pitt moneyball oscar
www.imdb.com
relay for life calvert ocunty www.relayforlife.org trailer for moneyball movies.yahoo.com
moneyball.movie-trailer.com
moneyball en.wikipedia.org movies.yahoo.com map of africa www.africaguide.com
money ball movie www.imdb.com money ball movie trailer moneyball.movie-trailer.com
brad pitt new www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com brad pitt
news news.search.yahoo.com moneyball trailer moneyball trailer www.imdb.com www.imdb.com
Patterns in logs are hard to see
 Sample of sessions from June, 2011 containing the term “moneyball”
› What are users trying to do?

oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org
Semantic annotations help to generalize…
Sports team
Movie
Actor

… and understand user needs
8/31/201541
moneyball trailer
what the user wants to do with it
Movie
Object of the query

Semantic analysis of query logs
8/31/201542
 Multiple approaches
› Dictionary tagging
• Match entities in a fixed dictionary
• Scalable, high recall, not very precise
› Entity retrieval
• Retrieval an index of the knowledge base
› Post-retrieval methods
• Annotate a document corpus with entities
• Retrieve documents and aggregate annotations
 Applications
› Usage mining
• L. Hollink, P. Mika and R. Blanco. Web Usage Mining with Semantic Analysis. WWW 2013
› Related-entity recommendations
• R. Blanco, B. Cambazoglu, P. Mika, N. Torzec: Entity Recommendations in Web Search. ISWC 2013

Usage mining
43
 Site owners would like to find usage patterns
› Reducing abandonment
› Competitive analysis
 Problem: patterns are lost in the data
› 64% of queries are unique within a year
› Even the most frequent patterns have low support

Solving the sparseness problem through annotations
44
 Frequent patterns of annotations are more general and less noisy

 Match by keywords
› Closer to text retrieval
• Match individual keywords
• Score and aggregate
• https://github.com/yahoo/Glimmer/
 Match by aliases
› Closer to entity linking
• Find potential mentions of entities (spots)
in query
• Score candidates for each spot
Two matching approaches
brad
(actor) (boxer) (city)
(actor) (boxer) (lake)
pitt
brad pitt
(actor) (boxer)

… back to query understanding
8/31/201546
moneyball trailer
what the user wants to do with it
Movie
Object of the query

Fast Entity Linking in Queries
47
 Use aliases to “entity pages” (Wikipedia, IMDB, local, etc.) as source of
information for entity-query aliases
 Chunk the query into the most likely segmentation
 Be fast by avoiding entity to entity decisions when scoring
 Add context externally using semantic relatedness of keywords and
entities
 Compression:
› Minimal perfect hashes + Golomb coding
› All Wiki + 1 year of query logs of aliases + 1 year of query sessions w2v model < 3GB
Blanco, Ottaviano, Meij. Fast and space-efficient entity linking for queries. WSDM 2015

Problem definition
48
 Given
› Query q consisting of an ordered list of tokens ti
› Segment s from a segmentation s from all possible segmentations Sq
› Entity e from a set of candidate entities e from the complete set E
 Find
› For all possible segmentations and candidate entities
› Select best entity for segment independently of other segments

 Keyphraseness
› How likely is a segment to be an entity
mention?
› e.g. how common is “in”(unlinked) vs.
“in” (linked) in the text
 Commonness
› How likely that a linked segment refers
to a particular entity?
› e.g. how often does “brad pitt” refers to
Brad Pitt (actor) vs. Brad Pitt (boxer)
49
Intuitions
Assume: also given annotated collections ci with segments of text linked to entities from E.

Ranking function
Probability of the segment generated
by a given collection
Commonness
Keyphraseness

Context-aware extension
51
Estimated by word2vec
representation
Probability of segment and
query are independent
of each other given the entity
Probability of segment and
query are independent
of each other

Results: effectiveness
52
 Significant improvement over external baselines and internal system
› Measured on public Webscope dataset Yahoo Search Query Log to Entities
Search over Bing, top
Wikipedia result
State-of-the-art in literature
A trivial search engine
over Wikipedia
Our method:
Fast Entity Linker (FEL)
FEL + context

 Two orders of magnitude faster
than state-of-the-art
› Simplifying assumptions at scoring time
› Adding context independently
› Dynamic pruning
 Small memory footprint
› Compression techniques, e.g. 10x
reduction in word2vec storage
53
Results: efficiency

Mobile search challenges and opportunities
55
 Interaction
› Question-answering
› Support for interactive retrieval
› Spoken-language access
› Task completion
 Contextualization
› Personalization
› Geo
› Context (work/home/travel)
• Try getaviate.com

Task completion
56
 We would like to help our users in task completion
› But we have trained our users to talk in nouns
• Retrieval performance decreases by adding verbs to queries
› We need to understand what the available actions are
 Modeling actions
› Understand what actions can be taken on a page
› Help users in mapping their query to potential actions
› Applications in web search, email etc.
THING
THING
Schema.org v1.2
including Actions
published
April 16, 2014

The end
57
 Many thanks for the Semantic Search team in London
› Peter Mika,
› Edgar Meij
› Hugues Bouchard
 Joint work with many collaborators: Sebastiano Vigna, Laura Hollink,
Giuseppe Ottaviano, Nicolas Torzec, among others.
 roi@yahoo-inc.com

From Queries to Answers in the Web

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to From Queries to Answers in the Web

Similar to From Queries to Answers in the Web (20)

More from Roi Blanco

More from Roi Blanco (7)

Recently uploaded

Recently uploaded (20)

From Queries to Answers in the Web