How to build the next 1000 search engines?!

How to build the next 1000
search engines?!

Arjen P. de Vries
arjen@acm.org
Centrum Wiskunde & Informatica
Delft University of Technology
Spinque B.V.

Search is everywhere
 Yet it only works well on the web…

Complications
 Heterogeneous data sources
 WWW, wikipedia, news, e-
mail, patents, twitter, personal information, …
 Varying result types
 “Documents”, tweets, courses, people, expert
s, gene expressions, temperatures, …
 Multiple dimensions of relevance
 Topicality, recency, reading level, …

Complications
 Many search tasks require a mix within
these dimensions:
 News and patents
 Companies and their CEOs
 Recent and on topic
 Many search tasks also require a mix
across these dimensions:
 Patents assigned to our top 3 competitors in
market segments mentioned in the recent
press releases issued by our top 10 clients

 System‟s internal information representation
 Linguistic annotations
 Named entities, sentiment, dependencies, …
 Knowledge resources
 Wikipedia, Freebase, IDC9, IPTC, …
 Links to related documents
 Citations, urls
 Anchors that describe the URI
 Anchor text
 Queries that lead to clicks on the URI
 Session, user, dwell-time, …
 Tweets that mention the URI
 Time, location, user, …
 Other social media that describe the URI
 User, rating
 Tag, organisation of `folksonomy‟
+ UNCERTAINTY ALL OVER!

What goes in the black box?
Document Collection:
Anchors
Entity types
Sentiment
Tweets BM25
Cited documents BM25F
… LM
RM Ranked
VSM list
DFR of
answers
QIR?
User Learning to rank?

Context

ECIR / CIKM / SIGIR / ICTIR / WSDM papers!

Rarely & scarcely addressed…

Student: How do I build it?
Professor: Who will build it for
me?

Last session of the conference…

Parameterised Search System

Cornacchia, De Vries, ECIR 2007
A Parametrised Search System

Parameterised Search System

Cannot we ‘remove’
this IR engineer (or
scientist!) from the
loop, like DBMS
software removes
the data engineer
from the loop?

Cornacchia, De Vries, ECIR 2007
A Parametrised Search System
And three (four?) children, a startup and 5 years later, a PhD defense!

Search by Strategy
 Visually construct search strategies by
connecting building blocks

Search by Strategy
 Visually construct search strategies by
connecting building blocks
 Each block describes either data or actions
upon that data
 Connection points (“pins”) are typed:
doc / sec / term / ne (named entity) / tuple
 Actions are expressed as scripts (later more)

Generate Search Engine!

Or, really, generate a REST API from the strategy specification!

Demo
(Showed demo of children‟s search engine)

How Strategies Help
 Strategies improve communication between
search intermediary and user
 Encapsulate domain expert knowledge
 Abstract representation of search expert knowledge
 Analyze information seeking process at any stage
 Strategies facilitate knowledge management
 Store / share / publish / refine
 Strategies mix exact (DB) and ranked (IR)
searches
 Avoid the need for “human (probabilistic) joins”

Search Intermediaries
 Travel agency

Task complexity
 Real estate agents
 Recruiters
 Librarians
 Archivists
 Digital forensics detectives
 Patent information specialists

Exploratory Search
 Search & (Faceted) Browsing
 Help discover schema, ontology, etc.
 Help discover the relevant sources
 Within-collection (by year/location, by type, …)
 Across multiple collections (by source)

Probabilistic faceted browsing
Traditional (boolean
filters) Probabilistic
Price Price

• 100K - 200K • 100K - 200K
• 200K - 300K • 200K - 300K
• 300K - 400K • 300K - 400K

Rooms Rooms

• 3 • 3
• 4 • 4
• 5 • 5

Size Size

• 100 - 150 m2 • 100 - 150 m2
• 150 - 200 m2 • 150 - 200 m2
• 200 - 250 m2 • 200 - 250 m2

• Good when user knows exactly • Good for exploratory search
which filters to apply
• Will see perfect-match results
• Will see perfect-match results
• Won’t see “interesting” results • Will also see “interesting” results

Dynamic facets

Pre-indexed Dynamic
Price Price

• 100K - 200K • 100K - 200K
• 200K - 300K • 200K - 300K
• 300K - 400K • 300K - 400K

Rooms Rooms

• 3 • 3
• 4 • 4
• 5 • 5

Size Size

• 100 - 150 m2 • 100 - 150 m2
• 150 - 200 m2 • 150 - 200 m2
• 200 - 250 m2 • 200 - 250 m2

• Pre-defined ad-hoc indices • Facets decided from result set
intersected with result set • Challenge: dynamically adapt granularity
• Challenge: many indices to maintain • Different price ranges for villa/garage!
• Challenge: heavy concurrent queries to DB

Demo
(Showed Spinque‟s Real-estate search
demo)

Limitations Search & Browse
 Faceted exploration does not include joins
 Cannot construct new data sources from
existing ones!
 Only the pre-defined paths through the
information space can actually be traversed

Who needs a Join?
 You!!!
… whenever „relevance cues‟ are typed:
 People (e.g., inventors)
 Companies (e.g., assignees)
 Categories (e.g., IPTC)
 Time (e.g., expiry date)
 Location (e.g., country)
… or whenever multiple sources are to be
combined
 E.g., patents & news, patents & Wikipedia, …

Patents on X by Y(y)

by Y(y)

Interactive Information Access

 Feedback:
 Interaction improves information
representation
 Faceted Browsing:
 Interaction can let user take over where
machine would fail
 Search by Strategy:
 Interaction can let user take over where
system designer would fail

Conclusion
 “No idealized one-shot search engine”
 Empower the user!

From Strategies to DB Queries
in1 in2 in3
Strategy

• Data flow
BB1(in1,in2,in3, u1,u2)

out

in1

BB2(in1)
Spinque: strategy
out

CREATE VIEW a AS
SELECT .. • Query: strategy made operational
CREATE VIEW b AS
SELECT ..

CREATE VIEW c AS
Spinque: PRA
SELECT ..

 Database
Spinque: RDBMS (MonetDB)
Relational DB

Probabilistic Relational Algebra
Strategy

x = Project DISTINCT
• PRA: probabilistic
[$1,$3](y); relational algebra
(Fuhr and
Roelleke, TOIS 2001)

CREATE VIEW x AS
SELECT a1, a3, • SQL
1-prod(1-prob) AS prob
FROM y explicit probabilities
GROUP BY a1, a3;

Relational DB

What‟s in the DB?
 Text-based ranking T D f
 term-doc-freq relations (inverted file) t0 d3 3
 One per language, stemming, section t0 d5 10
 Domain-independent, click and index t1 d2 4

 Entity ranking subj pred/attr obj/value p

 Probabilistic triples Arjen speaks_to you 0.95

 Domain-aware you follow Arjen 0.5

speech minutes 45 0.8
 Needs supervised indexing

 Content-based (MM) retrieval Img_id f1 … fN

…
 Feature vectors, click and index
0 0.12 0.84

1 0.54 … 0.31

2 0.23 … 0.1

VIEWS and TABLES
User
Stored relation parameter

CREATE VIEW
TABLE a AS SELECT … FROM term-doc … ;
CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ;
CREATE VIEW
TABLE c AS SELECT … FROM a WHERE a.x = 42 ;
CREATE VIEW d AS SELECT … FROM b … ; No user
parameter
Pre-computable
 BB content: sequence of VIEW definitions relation
 A VIEW is pre-computable when
 All the relations addressed are pre-computable / stored
 No dependency on user parameters
 Pre-computable VIEWs can become TABLEs (or MATERIALIZED
VIEWs)
 Query-independent computations are performed only once, then
read from TABLEs at each query
 Recognition of these patterns is fully automatic
 Extends MonetDB‟s per-session caching to across-sessions caching

Current Situation
 index ; Schema definition
 repeat {
 specify ;
 retrieve Search & explore
 } until 

Traditional Indexing

 Preprocessing determines to large extend how
search request form will be processed
 Especially regarding tokenization, stemming, etc.
 Fast and scalable, but inflexible
 E.g., entity search hard-coded on top of engine,
advertisements matched on different data, etc.

Search by Strategy

 Flexible: generate arbitrary engine on the fly
 Not as fast as highly optimized and very well
engineered inverted file based systems

Desirable Situation
 repeat {
 index ; Mixed Initiative
 specify ; Schema definition
Search & explore
 retrieve
 } until 

Non-Indexed Search

 Grep
 Very flexible
 Use it all the time on my mh mail folders when gmail
fails me!
 Not scalable, little or no structure

Minimal Indexing

 How to reduce pre-processing necessary to
create a search engine over a new collection?
 Can we do without a keyword index?
 Can we avoid hardwired decisions for tokenization,
language detection, stemming, …

Suffix Array
 Pro's:
 provides many core search functions: term
statistics, keyword search, phrase search.
 no upfront tokenization needed (access at
character level)
 no upfront language detection needed
 Con's:
 difficult to build for large corpora
 expensive w.r.t. disk space

Demo
(Showed patent search demo)

PRA
s__STRATEGY___filter_DOC_with_NE_nes =
Project [$2,$3](
Join [$1 = $2](
s__STRATEGY___clef_ip_patents_DATA_result,
Project [$1,$3](
Select [$2 = "ipcr-classification"](
s__STRATEGY___clef_ip_patents_DATA_ne_doc
)
)
)
);

CREATE TABLE s__STRATEGY___filter_DOC_with_NE_nes AS
SELECT
tmp_1814091754.a2 AS a1,
tmp_1814091754.a3 AS a2,
tmp_1814091754.prob AS prob
FROM
(
SELECT
s__STRATEGY___clef_ip_patents_DATA_result.a1 AS a1,
tmp__1652836708.a1 AS a2,
tmp__1652836708.a2 AS a3,
s__STRATEGY___clef_ip_patents_DATA_result.prob
* tmp__1652836708.prob AS prob
FROM
s__STRATEGY___clef_ip_patents_DATA_result,
(
SELECT
tmp_1444787941.a1 AS a1,
tmp_1444787941.a3 AS a2,
tmp_1444787941.prob AS prob
FROM
(
SELECT
s__STRATEGY___clef_ip_patents_DATA_ne_doc.a1 AS a1,
s__STRATEGY___clef_ip_patents_DATA_ne_doc.prob AS prob
FROM
s__STRATEGY___clef_ip_patents_DATA_ne_doc
WHERE
s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2
=‘ipcr-classification’
) AS tmp_1444787941
) AS tmp__1652836708
WHERE
s__STRATEGY___clef_ip_patents_DATA_result.a1
= tmp__1652836708.a2
) AS tmp_1814091754
ORDER BY a1
WITH DATA;

info@spinque.com
www.spinque.com
facebook.com/spinque

How to build the next 1000 search engines?!

Recommended

Recommended

More Related Content

Similar to How to build the next 1000 search engines?!

Similar to How to build the next 1000 search engines?! (20)

More from Arjen de Vries

More from Arjen de Vries (20)

Recently uploaded

Recently uploaded (20)

How to build the next 1000 search engines?!

Editor's Notes