Turning search upside down

Turning Search Upside Down
using Lucene for very fast stored queries

Charlie Hull & Alan Woodward

Managing Director & Principal Developer at Flax

www.flax.co.uk

@FlaxSearch

Who are Flax?
Search engine specialists with decades of experience

Developers, innovators and strategists

Based in Cambridge, UK

Technology agnostic – but open source exponents

UK Authorized Partner of LucidWorks

Customers include Reed Specialist Recruitment, NLA, Gorkana, Financial
Times, News UK, UK Cabinet Office, Australian Associated Press, Accenture,
University of Cambridge...

We also run a Meetup group:


Who are we?


Charlie Hull
– Managing Director and co-founder at Flax
– I've been working in search since 1999
– When I'm not running Flax I'm sometimes a stiltwalker &
juggler

Who are we?




Charlie Hull
– Managing Director and co-founder at Flax
– I've been working in search since 1999
– When I'm not running Flax I'm sometimes a stiltwalker &
juggler
Alan Woodward
– Principal developer at Flax
– I used to run a 500m document FAST ESP index for Proquest
– Lucene/Solr committer
– When I'm not hacking I sing in the Cambridge University
Musical Society Chorus

Media monitoring


Standard search – few queries over many documents



Monitoring search – many queries over each document



Customers interests manually turned into queries



Humans probably still have the final say on relevance



Eventual result is a list of articles emailed - or even printed



Exact position of a match is important to the customer

Media monitoring - parameters


Tens of thousands of stored expressions or 'Keywords'
– can't rewrite these so must use same syntax!



Hundreds of thousands of articles to monitor every day



Source data can sometimes be scanned & OCR'd



False positives cost human operator time: false negatives cost customers!



Traditional approach is brute force using standard search engine software

Media monitoring – a Keyword
("PALM BEACH COUNTY"; W/48 (("TOURIS*"; OR "TOUR"; OR "TOURS"; OR "TRAVEL*"; OR
"HOLIDAY*"; OR "HOL"; OR "HOLS"; OR "HOTEL*"; OR "VISIT*"; OR "TRIP"; OR "TRIPS";
OR "DAYTRIP*"; OR "BEACH"; OR "!BEACHES"; OR "COAST"; OR "!COASTLINE*"; OR "ABTA";
OR "DAY TRIP*"; OR “SUITE"; OR "SUITES"; OR "A%CCOMMODATION"; OR "BED AND !
BREAKFAST"; OR "B&B"; OR "BED & !BREAKFAST"; OR "!BREAKFAST AND BOARD"; OR
"FULL BOARD"; OR "HALF BOARD"; OR "ALL !INCLUSIVE"; OR "THINGS TO DO"; OR "HOSP?
TALITY"; OR "SHORT BREAK*"; OR "!WEEKEND BREAK"; OR "CITY BREAK*"; OR "!SIGHTSEE*";
OR "!VACATION*"; OR "E%XCURSION*"; OR "FLY* WITH"; OR "FLY* THERE"; OR "FLY*
DRIVE"; OR "!GETAWAY"; OR "!BACKPACK*"; OR "BACK PACK*"; OR "!ECOTOURIS*"; OR "!
WATERSPORT*"; OR "WATER SPORT*"; OR "FESTIVAL*"; OR "RESORT* & SPA"; OR
"RESORT* AND SPA"; OR "WHALE WATCH*"; OR "GET THERE"; OR "WHERE TO STAY"; OR
"GETTING THERE"; OR "STAYCATION*"; OR "VILLA"; OR "VILLAS"; OR "AIRPORT*"; OR
"SPA"; OR "SPAS"; OR "OUTDOOR EVENT*"; OR "OUTDOOR ADVENTURE*"; OR "OUTDOOR
PURSUIT*"; OR "OUTDOOR ACTIVIT*"; OR "CLIMBING WALL*"; OR "CLIMBING CENTRE*"; OR
"ROCK CLIMB*"; OR "WHITE WATER RAFTING";) OR ("PLACES"; W/4 ("TO STAY"; OR "TO
SEE"; OR "TO EAT";)) OR (("FLIGHT*"; OR "FLY"; OR "FLYING"; OR "CRUISE*";) W/4
("OFFER"; OR "AVAILABLE"; OR "DEPART*"; OR "FROM"; OR "TRANSFER*";))))

Media monitoring – another Keyword
(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR
";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE
COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR
";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O";
OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR
";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR
";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*";
OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !
WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR
";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR
";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*";
OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*";
OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!
EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR
";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR
";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MISSELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR
";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR
";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS";
OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR
";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6
(";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))

Media monitoring – an article
<?xml version="1.0" encoding="UTF-8"?>
<document>
<attributes>
<field name="ARTICLEID">2277480</field>
<field name="HEADLINE" type="TEXT">Holding a man of integrity, honour</field>
<field name="SOURCEID" type="STRING">2</field>
<field name="ARTAREA" type="TEXT">161.82</field>
<field name="PAGESUM" type="STRING">1</field>
<field name="SDATE" type="STRING">2011-08-02</field>
<field name="PAGE" type="STRING">A03 NEW NAA</field>
<field name="CDATE" type="DATE">2011-08-02 01:08:16</field>
<field name="AUTHORNAME" type="STRING">MICHELLE GRATTAN</field>
<field name="ARTAREAPERCENT" type="STRING">7.66</field>
<field name="SENTENCES" type="STRING">LONG-time Victorian Labor leader and Hawke government minister
Clyde Holding, who has died aged 80, was hailed yesterday as a policy and party reformer who kept in touch
with his local community.</field>
<field name="TEXT">Holding a man of integrity, honour By MICHELLE GRATTAN LONG-time Victorian Labor
leader and Hawke government minister Clyde Holding, who has died aged 80, was hailed yesterday as a policy
and party reformer who kept in touch with his local community. Mr Holding, sitting in the seat of Richmond,
was Victorian opposition leader from 1967 to 1977 before entering federal politics as MP for Melbourne Ports
in 1977. He held several portfolios in the Hawke government...</field>
<field name="ORGFILENAME" type="STRING">AAAGE20110802NAANEWA03.pdf</field>
<field name="SUBTITLE" type="STRING"></field>
</attributes>
</document>

So how does it work?


Incoming article is added to a Lucene MemoryIndex



We parse the original Keyword syntax into a store of Lucene queries



Stored queries are run against this single-document index



Matches from each query are gathered and emitted



Simple!

It can't be that easy?
Problems:



Doesn't scale very well with increasing numbers of queries
Complex queries need to be rewritten for every document.

It can't be that easy?
Problems:



Doesn't scale very well with increasing numbers of queries
Complex queries need to be rewritten for every document.

Solution:



Cut down the number of queries run over each document
Important not to get false negatives

“Turning search upside down”










Narrow down our query space by selecting queries likely to match a given
document.
Sounds a bit like... a search?
Create an index of queries, using terms extracted from the query objects
themselves.
Incoming documents are converted to a disjunction query using the
MemoryIndex's terms list.
Specialised Collector then runs each matching query against the document.
Massively reduces the number of queries run against a given document.

Indexing Queries








TermQueries are simple – just extract the term
For Disjunction queries, we extract terms from all subclauses
For Conjunction queries, we can just use a single subclause
Positional queries are treated as conjunctions
Simple Numeric queries treated as terms
Empty term sets are indexed with special terms to match all docs
NumericRange queries are ignored

Indexing Queries


RegexpQueries are a bit more complex:





Extract the longest exact substring from the regex and index that
Create ngrams from the MemoryIndex termlist when building the document
disjunction query
Reduces the effectiveness of the query filtering, but still allows significant
speedups.

Hacking Lucene & Solr
To allow MemoryIndex to expose positions we fixed for Lucene 4.x:
LUCENE-3831 - Passing a null fieldname to MemoryFields#terms in
throws a NPE
LUCENE-3827 - Make term offsets work in MemoryIndex

MemoryIndex

Contributed to the work on adding interval iterators to Lucene Scorers:
LUCENE 2878 - Allow Scorer to expose positions and payloads aka. nuke spans

...and then we rebuilt Solr on top of these modifications.

Why not use Elasticsearch percolators?


Need to emit exact hit positions using Interval Iterators

Percolator doesn't allow such efficient optimisations based on document-specific
filters – we need to cut down the number of queries we run as much as possible.


Why not use Elasticsearch percolators?


Need to emit exact hit positions using Interval Iterators

Percolator doesn't allow such efficient optimisations based on document-specific
filters – we need to cut down the number of queries we run as much as possible.


The fastest way to run a query is not to run it at all!

Introducing 'Luwak'
A library for running many stored queries, fast.
Features

Works with standard Lucene queries (TermQuery, BooleanQuery,
RegexpQuery...)

Can be extended to work with custom query types

Run multiple instances of it to scale for document volume

Based on our fork of Lucene (should be merged into trunk soon though...)

Available Real Soon Now on Github

Using Luwak
Monitor monitor = new Monitor();
monitor.addQueries(querylist);
InputDocument doc = source.getNextDocument();
DocumentMatches matches = monitor.match(doc);
DocumentMatches object contains exact hit positions per-field for each query that
matched the InputDocument, as well as some metrics (querycount, preptime,
querytime).

Luwak output
{
"stats": {"preptime":3,"querytime":23,"querycount":1},
"id":"C:TESTBMA20130417e10a0737.xml",
"uniqueId":"e10a0737",
"Content":"… lots of xml …",
"QueryMatches":[
{"QueryId":"1",
"FieldHighlights":[
{"Highlights":[
{"WordOffset":11,"Offset":80,"Length":7,"Word":"quibble"},
{"WordOffset":115,"Offset":739,"Length":7,"Word":"quibble"},
{"WordOffset":117,"Offset":749,"Length":7,"Word":"quibble"}],
"Field":"bodytext"}]}]}

How fast is Luwak?
For simple keywords (<20 terms):

70,000 keywords applied per second to an article

Tested on a Macbook

20 times faster than our own previous implementation (using Xapian)
For more complex keywords (some run to three pages!)

20,000 keywords applied in 0.5 seconds

Processes approximately 2000 articles/hour

Building Luwak into a monitoring system
Monitors are packaged up into a service using Dropwizard

We run multiple instances of the monitoring system to handle the required article
throughput

Queries held in a central database, and updates pushed out to each monitor
instance using notification queues

Articles are supplied in NewsML format, either on disk or via message brokers.

We build a separate Solr index of articles – a searchable archive – for testing
new Keywords and for end customers to use

Everything uses RESTful Web Service APIs


What next?


3 clients are live with one in progress

Replacing systems built on dtSearch, Verity & Autonomy

Can we create query parsers for other legacy systems?

What next?






Improve Luwak performance by sharding the Keyword set

We know of cases of 1.5m keywords, hundreds of articles/sec

What next?









Get Lucene intervals code into trunk, so Luwak isn't based on a fork

What next?











Get Lucene intervals code into trunk, so Luwak isn't based on a fork
Find other uses for Luwak?

Add metadata to items based on their content

Classification into a taxonomy, nodes defined by Keywords

Thankyou!
Any questions?
charlie@flax.co.uk
@FlaxSearch
alan@flax.co.uk
@romseygeek
www.flax.co.uk/blog
+44 (0) 8700 118334

Turning search upside down

Recomendados

Recomendados

Más contenido relacionado

Similar a Turning search upside down

Similar a Turning search upside down (20)

Más de lucenerevolution

Más de lucenerevolution (20)

Último

Último (20)

Turning search upside down