As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about how we built intelligent filters to reduce the number of actual queries applied and how we extended Lucene to extract the exact hit positions of matches, the challenges of implementation, and how it can be used, including applications that monitor hundreds of thousands of news stories every day.
Generative AI for Technical Writer or Information Developers
Turning search upside down
1.
2. Turning Search Upside Down
using Lucene for very fast stored queries
Charlie Hull & Alan Woodward
Managing Director & Principal Developer at Flax
www.flax.co.uk
@FlaxSearch
3. Who are Flax?
Search engine specialists with decades of experience
Developers, innovators and strategists
Based in Cambridge, UK
Technology agnostic – but open source exponents
UK Authorized Partner of LucidWorks
Customers include Reed Specialist Recruitment, NLA, Gorkana, Financial
Times, News UK, UK Cabinet Office, Australian Associated Press, Accenture,
University of Cambridge...
We also run a Meetup group:
4. Who are we?
Charlie Hull
– Managing Director and co-founder at Flax
– I've been working in search since 1999
– When I'm not running Flax I'm sometimes a stiltwalker &
juggler
5. Who are we?
Charlie Hull
– Managing Director and co-founder at Flax
– I've been working in search since 1999
– When I'm not running Flax I'm sometimes a stiltwalker &
juggler
Alan Woodward
– Principal developer at Flax
– I used to run a 500m document FAST ESP index for Proquest
– Lucene/Solr committer
– When I'm not hacking I sing in the Cambridge University
Musical Society Chorus
6. Media monitoring
Standard search – few queries over many documents
Monitoring search – many queries over each document
Customers interests manually turned into queries
Humans probably still have the final say on relevance
Eventual result is a list of articles emailed - or even printed
Exact position of a match is important to the customer
7. Media monitoring - parameters
Tens of thousands of stored expressions or 'Keywords'
– can't rewrite these so must use same syntax!
Hundreds of thousands of articles to monitor every day
Source data can sometimes be scanned & OCR'd
False positives cost human operator time: false negatives cost customers!
Traditional approach is brute force using standard search engine software
8. Media monitoring – a Keyword
("PALM BEACH COUNTY"; W/48 (("TOURIS*"; OR "TOUR"; OR "TOURS"; OR "TRAVEL*"; OR
"HOLIDAY*"; OR "HOL"; OR "HOLS"; OR "HOTEL*"; OR "VISIT*"; OR "TRIP"; OR "TRIPS";
OR "DAYTRIP*"; OR "BEACH"; OR "!BEACHES"; OR "COAST"; OR "!COASTLINE*"; OR "ABTA";
OR "DAY TRIP*"; OR “SUITE"; OR "SUITES"; OR "A%CCOMMODATION"; OR "BED AND !
BREAKFAST"; OR "B&B"; OR "BED & !BREAKFAST"; OR "!BREAKFAST AND BOARD"; OR
"FULL BOARD"; OR "HALF BOARD"; OR "ALL !INCLUSIVE"; OR "THINGS TO DO"; OR "HOSP?
TALITY"; OR "SHORT BREAK*"; OR "!WEEKEND BREAK"; OR "CITY BREAK*"; OR "!SIGHTSEE*";
OR "!VACATION*"; OR "E%XCURSION*"; OR "FLY* WITH"; OR "FLY* THERE"; OR "FLY*
DRIVE"; OR "!GETAWAY"; OR "!BACKPACK*"; OR "BACK PACK*"; OR "!ECOTOURIS*"; OR "!
WATERSPORT*"; OR "WATER SPORT*"; OR "FESTIVAL*"; OR "RESORT* & SPA"; OR
"RESORT* AND SPA"; OR "WHALE WATCH*"; OR "GET THERE"; OR "WHERE TO STAY"; OR
"GETTING THERE"; OR "STAYCATION*"; OR "VILLA"; OR "VILLAS"; OR "AIRPORT*"; OR
"SPA"; OR "SPAS"; OR "OUTDOOR EVENT*"; OR "OUTDOOR ADVENTURE*"; OR "OUTDOOR
PURSUIT*"; OR "OUTDOOR ACTIVIT*"; OR "CLIMBING WALL*"; OR "CLIMBING CENTRE*"; OR
"ROCK CLIMB*"; OR "WHITE WATER RAFTING";) OR ("PLACES"; W/4 ("TO STAY"; OR "TO
SEE"; OR "TO EAT";)) OR (("FLIGHT*"; OR "FLY"; OR "FLYING"; OR "CRUISE*";) W/4
("OFFER"; OR "AVAILABLE"; OR "DEPART*"; OR "FROM"; OR "TRANSFER*";))))
9. Media monitoring – another Keyword
(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR
";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE
COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR
";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O";
OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR
";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR
";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*";
OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !
WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR
";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR
";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*";
OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*";
OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!
EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR
";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR
";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MISSELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR
";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR
";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS";
OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR
";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6
(";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))
10. Media monitoring – an article
<?xml version="1.0" encoding="UTF-8"?>
<document>
<attributes>
<field name="ARTICLEID">2277480</field>
<field name="HEADLINE" type="TEXT">Holding a man of integrity, honour</field>
<field name="SOURCEID" type="STRING">2</field>
<field name="ARTAREA" type="TEXT">161.82</field>
<field name="PAGESUM" type="STRING">1</field>
<field name="SDATE" type="STRING">2011-08-02</field>
<field name="PAGE" type="STRING">A03 NEW NAA</field>
<field name="CDATE" type="DATE">2011-08-02 01:08:16</field>
<field name="AUTHORNAME" type="STRING">MICHELLE GRATTAN</field>
<field name="ARTAREAPERCENT" type="STRING">7.66</field>
<field name="SENTENCES" type="STRING">LONG-time Victorian Labor leader and Hawke government minister
Clyde Holding, who has died aged 80, was hailed yesterday as a policy and party reformer who kept in touch
with his local community.</field>
<field name="TEXT">Holding a man of integrity, honour By MICHELLE GRATTAN LONG-time Victorian Labor
leader and Hawke government minister Clyde Holding, who has died aged 80, was hailed yesterday as a policy
and party reformer who kept in touch with his local community. Mr Holding, sitting in the seat of Richmond,
was Victorian opposition leader from 1967 to 1977 before entering federal politics as MP for Melbourne Ports
in 1977. He held several portfolios in the Hawke government...</field>
<field name="ORGFILENAME" type="STRING">AAAGE20110802NAANEWA03.pdf</field>
<field name="SUBTITLE" type="STRING"></field>
</attributes>
</document>
11. So how does it work?
Incoming article is added to a Lucene MemoryIndex
We parse the original Keyword syntax into a store of Lucene queries
Stored queries are run against this single-document index
Matches from each query are gathered and emitted
Simple!
12. It can't be that easy?
Problems:
Doesn't scale very well with increasing numbers of queries
Complex queries need to be rewritten for every document.
13. It can't be that easy?
Problems:
Doesn't scale very well with increasing numbers of queries
Complex queries need to be rewritten for every document.
Solution:
Cut down the number of queries run over each document
Important not to get false negatives
14. “Turning search upside down”
Narrow down our query space by selecting queries likely to match a given
document.
Sounds a bit like... a search?
Create an index of queries, using terms extracted from the query objects
themselves.
Incoming documents are converted to a disjunction query using the
MemoryIndex's terms list.
Specialised Collector then runs each matching query against the document.
Massively reduces the number of queries run against a given document.
15. Indexing Queries
TermQueries are simple – just extract the term
For Disjunction queries, we extract terms from all subclauses
For Conjunction queries, we can just use a single subclause
Positional queries are treated as conjunctions
Simple Numeric queries treated as terms
Empty term sets are indexed with special terms to match all docs
NumericRange queries are ignored
16. Indexing Queries
RegexpQueries are a bit more complex:
Extract the longest exact substring from the regex and index that
Create ngrams from the MemoryIndex termlist when building the document
disjunction query
Reduces the effectiveness of the query filtering, but still allows significant
speedups.
17. Hacking Lucene & Solr
To allow MemoryIndex to expose positions we fixed for Lucene 4.x:
LUCENE-3831 - Passing a null fieldname to MemoryFields#terms in
throws a NPE
LUCENE-3827 - Make term offsets work in MemoryIndex
MemoryIndex
Contributed to the work on adding interval iterators to Lucene Scorers:
LUCENE 2878 - Allow Scorer to expose positions and payloads aka. nuke spans
...and then we rebuilt Solr on top of these modifications.
18. Why not use Elasticsearch percolators?
Need to emit exact hit positions using Interval Iterators
Percolator doesn't allow such efficient optimisations based on document-specific
filters – we need to cut down the number of queries we run as much as possible.
19. Why not use Elasticsearch percolators?
Need to emit exact hit positions using Interval Iterators
Percolator doesn't allow such efficient optimisations based on document-specific
filters – we need to cut down the number of queries we run as much as possible.
The fastest way to run a query is not to run it at all!
20. Introducing 'Luwak'
A library for running many stored queries, fast.
Features
Works with standard Lucene queries (TermQuery, BooleanQuery,
RegexpQuery...)
Can be extended to work with custom query types
Run multiple instances of it to scale for document volume
Based on our fork of Lucene (should be merged into trunk soon though...)
Available Real Soon Now on Github
21. Using Luwak
Monitor monitor = new Monitor();
monitor.addQueries(querylist);
InputDocument doc = source.getNextDocument();
DocumentMatches matches = monitor.match(doc);
DocumentMatches object contains exact hit positions per-field for each query that
matched the InputDocument, as well as some metrics (querycount, preptime,
querytime).
23. How fast is Luwak?
For simple keywords (<20 terms):
70,000 keywords applied per second to an article
Tested on a Macbook
20 times faster than our own previous implementation (using Xapian)
For more complex keywords (some run to three pages!)
20,000 keywords applied in 0.5 seconds
Processes approximately 2000 articles/hour
24. Building Luwak into a monitoring system
Monitors are packaged up into a service using Dropwizard
We run multiple instances of the monitoring system to handle the required article
throughput
Queries held in a central database, and updates pushed out to each monitor
instance using notification queues
Articles are supplied in NewsML format, either on disk or via message brokers.
We build a separate Solr index of articles – a searchable archive – for testing
new Keywords and for end customers to use
Everything uses RESTful Web Service APIs
25. What next?
3 clients are live with one in progress
Replacing systems built on dtSearch, Verity & Autonomy
Can we create query parsers for other legacy systems?
26. What next?
3 clients are live with one in progress
Replacing systems built on dtSearch, Verity & Autonomy
Can we create query parsers for other legacy systems?
Improve Luwak performance by sharding the Keyword set
We know of cases of 1.5m keywords, hundreds of articles/sec
27. What next?
3 clients are live with one in progress
Replacing systems built on dtSearch, Verity & Autonomy
Can we create query parsers for other legacy systems?
Improve Luwak performance by sharding the Keyword set
We know of cases of 1.5m keywords, hundreds of articles/sec
Get Lucene intervals code into trunk, so Luwak isn't based on a fork
28. What next?
3 clients are live with one in progress
Replacing systems built on dtSearch, Verity & Autonomy
Can we create query parsers for other legacy systems?
Improve Luwak performance by sharding the Keyword set
We know of cases of 1.5m keywords, hundreds of articles/sec
Get Lucene intervals code into trunk, so Luwak isn't based on a fork
Find other uses for Luwak?
Add metadata to items based on their content
Classification into a taxonomy, nodes defined by Keywords