The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
How to build the next 1000 search engines?!
1. How to build the next 1000
search engines?!
Arjen P. de Vries
arjen@acm.org
Centrum Wiskunde & Informatica
Delft University of Technology
Spinque B.V.
5. Complications
Many search tasks require a mix within
these dimensions:
News and patents
Companies and their CEOs
Recent and on topic
Many search tasks also require a mix
across these dimensions:
Patents assigned to our top 3 competitors in
market segments mentioned in the recent
press releases issued by our top 10 clients
6. System‟s internal information representation
Linguistic annotations
Named entities, sentiment, dependencies, …
Knowledge resources
Wikipedia, Freebase, IDC9, IPTC, …
Links to related documents
Citations, urls
Anchors that describe the URI
Anchor text
Queries that lead to clicks on the URI
Session, user, dwell-time, …
Tweets that mention the URI
Time, location, user, …
Other social media that describe the URI
User, rating
Tag, organisation of `folksonomy‟
+ UNCERTAINTY ALL OVER!
7. What goes in the black box?
Document Collection:
Anchors
Entity types
Sentiment
Tweets BM25
Cited documents BM25F
… LM
RM Ranked
VSM list
DFR of
answers
QIR?
User Learning to rank?
Context
ECIR / CIKM / SIGIR / ICTIR / WSDM papers!
8. Rarely & scarcely addressed…
Student: How do I build it?
Professor: Who will build it for
me?
Last session of the conference…
11. Parameterised Search System
Cannot we ‘remove’
this IR engineer (or
scientist!) from the
loop, like DBMS
software removes
the data engineer
from the loop?
Cornacchia, De Vries, ECIR 2007
A Parametrised Search System
And three (four?) children, a startup and 5 years later, a PhD defense!
12. Search by Strategy
Visually construct search strategies by
connecting building blocks
13.
14. Search by Strategy
Visually construct search strategies by
connecting building blocks
Each block describes either data or actions
upon that data
Connection points (“pins”) are typed:
doc / sec / term / ne (named entity) / tuple
Actions are expressed as scripts (later more)
20. How Strategies Help
Strategies improve communication between
search intermediary and user
Encapsulate domain expert knowledge
Abstract representation of search expert knowledge
Analyze information seeking process at any stage
Strategies facilitate knowledge management
Store / share / publish / refine
Strategies mix exact (DB) and ranked (IR)
searches
Avoid the need for “human (probabilistic) joins”
21.
22. Search Intermediaries
Travel agency
Task complexity
Real estate agents
Recruiters
Librarians
Archivists
Digital forensics detectives
Patent information specialists
23. Exploratory Search
Search & (Faceted) Browsing
Help discover schema, ontology, etc.
Help discover the relevant sources
Within-collection (by year/location, by type, …)
Across multiple collections (by source)
24. Probabilistic faceted browsing
Traditional (boolean
filters) Probabilistic
Price Price
• 100K - 200K • 100K - 200K
• 200K - 300K • 200K - 300K
• 300K - 400K • 300K - 400K
Rooms Rooms
• 3 • 3
• 4 • 4
• 5 • 5
Size Size
• 100 - 150 m2 • 100 - 150 m2
• 150 - 200 m2 • 150 - 200 m2
• 200 - 250 m2 • 200 - 250 m2
• Good when user knows exactly • Good for exploratory search
which filters to apply
• Will see perfect-match results
• Will see perfect-match results
• Won’t see “interesting” results • Will also see “interesting” results
25. Dynamic facets
Pre-indexed Dynamic
Price Price
• 100K - 200K • 100K - 200K
• 200K - 300K • 200K - 300K
• 300K - 400K • 300K - 400K
Rooms Rooms
• 3 • 3
• 4 • 4
• 5 • 5
Size Size
• 100 - 150 m2 • 100 - 150 m2
• 150 - 200 m2 • 150 - 200 m2
• 200 - 250 m2 • 200 - 250 m2
• Pre-defined ad-hoc indices • Facets decided from result set
intersected with result set • Challenge: dynamically adapt granularity
• Challenge: many indices to maintain • Different price ranges for villa/garage!
• Challenge: heavy concurrent queries to DB
27. Limitations Search & Browse
Faceted exploration does not include joins
Cannot construct new data sources from
existing ones!
Only the pre-defined paths through the
information space can actually be traversed
28. Who needs a Join?
You!!!
… whenever „relevance cues‟ are typed:
People (e.g., inventors)
Companies (e.g., assignees)
Categories (e.g., IPTC)
Time (e.g., expiry date)
Location (e.g., country)
… or whenever multiple sources are to be
combined
E.g., patents & news, patents & Wikipedia, …
30. Interactive Information Access
Feedback:
Interaction improves information
representation
Faceted Browsing:
Interaction can let user take over where
machine would fail
Search by Strategy:
Interaction can let user take over where
system designer would fail
33. From Strategies to DB Queries
in1 in2 in3
Strategy
• Data flow
BB1(in1,in2,in3, u1,u2)
out
in1
BB2(in1)
Spinque: strategy
out
CREATE VIEW a AS
SELECT .. • Query: strategy made operational
CREATE VIEW b AS
SELECT ..
CREATE VIEW c AS
Spinque: PRA
SELECT ..
Database
Spinque: RDBMS (MonetDB)
Relational DB
34. Probabilistic Relational Algebra
Strategy
x = Project DISTINCT
• PRA: probabilistic
[$1,$3](y); relational algebra
(Fuhr and
Roelleke, TOIS 2001)
CREATE VIEW x AS
SELECT a1, a3, • SQL
1-prod(1-prob) AS prob
FROM y explicit probabilities
GROUP BY a1, a3;
Relational DB
35. What‟s in the DB?
Text-based ranking T D f
term-doc-freq relations (inverted file) t0 d3 3
One per language, stemming, section t0 d5 10
Domain-independent, click and index t1 d2 4
Entity ranking subj pred/attr obj/value p
Probabilistic triples Arjen speaks_to you 0.95
Domain-aware you follow Arjen 0.5
speech minutes 45 0.8
Needs supervised indexing
Content-based (MM) retrieval Img_id f1 … fN
…
Feature vectors, click and index
0 0.12 0.84
1 0.54 … 0.31
2 0.23 … 0.1
36. VIEWS and TABLES
User
Stored relation parameter
CREATE VIEW
TABLE a AS SELECT … FROM term-doc … ;
CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ;
CREATE VIEW
TABLE c AS SELECT … FROM a WHERE a.x = 42 ;
CREATE VIEW d AS SELECT … FROM b … ; No user
parameter
Pre-computable
BB content: sequence of VIEW definitions relation
A VIEW is pre-computable when
All the relations addressed are pre-computable / stored
No dependency on user parameters
Pre-computable VIEWs can become TABLEs (or MATERIALIZED
VIEWs)
Query-independent computations are performed only once, then
read from TABLEs at each query
Recognition of these patterns is fully automatic
Extends MonetDB‟s per-session caching to across-sessions caching
38. Current Situation
index ; Schema definition
repeat {
specify ;
retrieve Search & explore
} until
39. Traditional Indexing
Preprocessing determines to large extend how
search request form will be processed
Especially regarding tokenization, stemming, etc.
Fast and scalable, but inflexible
E.g., entity search hard-coded on top of engine,
advertisements matched on different data, etc.
40. Search by Strategy
Flexible: generate arbitrary engine on the fly
Not as fast as highly optimized and very well
engineered inverted file based systems
42. Non-Indexed Search
Grep
Very flexible
Use it all the time on my mh mail folders when gmail
fails me!
Not scalable, little or no structure
43. Minimal Indexing
How to reduce pre-processing necessary to
create a search engine over a new collection?
Can we do without a keyword index?
Can we avoid hardwired decisions for tokenization,
language detection, stemming, …
44. Suffix Array
Pro's:
provides many core search functions: term
statistics, keyword search, phrase search.
no upfront tokenization needed (access at
character level)
no upfront language detection needed
Con's:
difficult to build for large corpora
expensive w.r.t. disk space
50. CREATE TABLE s__STRATEGY___filter_DOC_with_NE_nes AS
SELECT
tmp_1814091754.a2 AS a1,
tmp_1814091754.a3 AS a2,
tmp_1814091754.prob AS prob
FROM
(
SELECT
s__STRATEGY___clef_ip_patents_DATA_result.a1 AS a1,
tmp__1652836708.a1 AS a2,
tmp__1652836708.a2 AS a3,
s__STRATEGY___clef_ip_patents_DATA_result.prob
* tmp__1652836708.prob AS prob
FROM
s__STRATEGY___clef_ip_patents_DATA_result,
(
SELECT
tmp_1444787941.a1 AS a1,
tmp_1444787941.a3 AS a2,
tmp_1444787941.prob AS prob
FROM
(
SELECT
s__STRATEGY___clef_ip_patents_DATA_ne_doc.a1 AS a1,
s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 AS a2,
s__STRATEGY___clef_ip_patents_DATA_ne_doc.a3 AS a3,
s__STRATEGY___clef_ip_patents_DATA_ne_doc.prob AS prob
FROM
s__STRATEGY___clef_ip_patents_DATA_ne_doc
WHERE
s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2
=‘ipcr-classification’
) AS tmp_1444787941
) AS tmp__1652836708
WHERE
s__STRATEGY___clef_ip_patents_DATA_result.a1
= tmp__1652836708.a2
) AS tmp_1814091754
ORDER BY a1
WITH DATA;