SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Turning Search Upside Down
using Lucene for very fast stored queries

Charlie Hull & Alan Woodward

Managing Director & Principal Developer at Flax

www.flax.co.uk

@FlaxSearch
Who are Flax?
Search engine specialists with decades of experience

Developers, innovators and strategists

Based in Cambridge, UK

Technology agnostic – but open source exponents

UK Authorized Partner of LucidWorks

Customers include Reed Specialist Recruitment, NLA, Gorkana, Financial
Times, News UK, UK Cabinet Office, Australian Associated Press, Accenture,
University of Cambridge...

We also run a Meetup group:

Who are we?


Charlie Hull
– Managing Director and co-founder at Flax
– I've been working in search since 1999
– When I'm not running Flax I'm sometimes a stiltwalker &
juggler
Who are we?




Charlie Hull
– Managing Director and co-founder at Flax
– I've been working in search since 1999
– When I'm not running Flax I'm sometimes a stiltwalker &
juggler
Alan Woodward
– Principal developer at Flax
– I used to run a 500m document FAST ESP index for Proquest
– Lucene/Solr committer
– When I'm not hacking I sing in the Cambridge University
Musical Society Chorus
Media monitoring


Standard search – few queries over many documents



Monitoring search – many queries over each document



Customers interests manually turned into queries



Humans probably still have the final say on relevance



Eventual result is a list of articles emailed - or even printed



Exact position of a match is important to the customer
Media monitoring - parameters


Tens of thousands of stored expressions or 'Keywords'
– can't rewrite these so must use same syntax!



Hundreds of thousands of articles to monitor every day



Source data can sometimes be scanned & OCR'd



False positives cost human operator time: false negatives cost customers!



Traditional approach is brute force using standard search engine software
Media monitoring – a Keyword
("PALM BEACH COUNTY"; W/48 (("TOURIS*"; OR "TOUR"; OR "TOURS"; OR "TRAVEL*"; OR
"HOLIDAY*"; OR "HOL"; OR "HOLS"; OR "HOTEL*"; OR "VISIT*"; OR "TRIP"; OR "TRIPS";
OR "DAYTRIP*"; OR "BEACH"; OR "!BEACHES"; OR "COAST"; OR "!COASTLINE*"; OR "ABTA";
OR "DAY TRIP*"; OR “SUITE"; OR "SUITES"; OR "A%CCOMMODATION"; OR "BED AND !
BREAKFAST"; OR "B&B"; OR "BED & !BREAKFAST"; OR "!BREAKFAST AND BOARD"; OR
"FULL BOARD"; OR "HALF BOARD"; OR "ALL !INCLUSIVE"; OR "THINGS TO DO"; OR "HOSP?
TALITY"; OR "SHORT BREAK*"; OR "!WEEKEND BREAK"; OR "CITY BREAK*"; OR "!SIGHTSEE*";
OR "!VACATION*"; OR "E%XCURSION*"; OR "FLY* WITH"; OR "FLY* THERE"; OR "FLY*
DRIVE"; OR "!GETAWAY"; OR "!BACKPACK*"; OR "BACK PACK*"; OR "!ECOTOURIS*"; OR "!
WATERSPORT*"; OR "WATER SPORT*"; OR "FESTIVAL*"; OR "RESORT* & SPA"; OR
"RESORT* AND SPA"; OR "WHALE WATCH*"; OR "GET THERE"; OR "WHERE TO STAY"; OR
"GETTING THERE"; OR "STAYCATION*"; OR "VILLA"; OR "VILLAS"; OR "AIRPORT*"; OR
"SPA"; OR "SPAS"; OR "OUTDOOR EVENT*"; OR "OUTDOOR ADVENTURE*"; OR "OUTDOOR
PURSUIT*"; OR "OUTDOOR ACTIVIT*"; OR "CLIMBING WALL*"; OR "CLIMBING CENTRE*"; OR
"ROCK CLIMB*"; OR "WHITE WATER RAFTING";) OR ("PLACES"; W/4 ("TO STAY"; OR "TO
SEE"; OR "TO EAT";)) OR (("FLIGHT*"; OR "FLY"; OR "FLYING"; OR "CRUISE*";) W/4
("OFFER"; OR "AVAILABLE"; OR "DEPART*"; OR "FROM"; OR "TRANSFER*";))))
Media monitoring – another Keyword
(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR
";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE
COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR
";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O";
OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR
";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR
";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*";
OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !
WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR
";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR
";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*";
OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*";
OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!
EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR
";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR
";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MISSELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR
";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR
";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS";
OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR
";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6
(";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))
Media monitoring – an article
<?xml version="1.0" encoding="UTF-8"?>
<document>
<attributes>
<field name="ARTICLEID">2277480</field>
<field name="HEADLINE" type="TEXT">Holding a man of integrity, honour</field>
<field name="SOURCEID" type="STRING">2</field>
<field name="ARTAREA" type="TEXT">161.82</field>
<field name="PAGESUM" type="STRING">1</field>
<field name="SDATE" type="STRING">2011-08-02</field>
<field name="PAGE" type="STRING">A03 NEW NAA</field>
<field name="CDATE" type="DATE">2011-08-02 01:08:16</field>
<field name="AUTHORNAME" type="STRING">MICHELLE GRATTAN</field>
<field name="ARTAREAPERCENT" type="STRING">7.66</field>
<field name="SENTENCES" type="STRING">LONG-time Victorian Labor leader and Hawke government minister
Clyde Holding, who has died aged 80, was hailed yesterday as a policy and party reformer who kept in touch
with his local community.</field>
<field name="TEXT">Holding a man of integrity, honour By MICHELLE GRATTAN LONG-time Victorian Labor
leader and Hawke government minister Clyde Holding, who has died aged 80, was hailed yesterday as a policy
and party reformer who kept in touch with his local community. Mr Holding, sitting in the seat of Richmond,
was Victorian opposition leader from 1967 to 1977 before entering federal politics as MP for Melbourne Ports
in 1977. He held several portfolios in the Hawke government...</field>
<field name="ORGFILENAME" type="STRING">AAAGE20110802NAANEWA03.pdf</field>
<field name="SUBTITLE" type="STRING"></field>
</attributes>
</document>
So how does it work?


Incoming article is added to a Lucene MemoryIndex



We parse the original Keyword syntax into a store of Lucene queries



Stored queries are run against this single-document index



Matches from each query are gathered and emitted



Simple!
It can't be that easy?
Problems:



Doesn't scale very well with increasing numbers of queries
Complex queries need to be rewritten for every document.
It can't be that easy?
Problems:



Doesn't scale very well with increasing numbers of queries
Complex queries need to be rewritten for every document.

Solution:



Cut down the number of queries run over each document
Important not to get false negatives
“Turning search upside down”










Narrow down our query space by selecting queries likely to match a given
document.
Sounds a bit like... a search?
Create an index of queries, using terms extracted from the query objects
themselves.
Incoming documents are converted to a disjunction query using the
MemoryIndex's terms list.
Specialised Collector then runs each matching query against the document.
Massively reduces the number of queries run against a given document.
Indexing Queries








TermQueries are simple – just extract the term
For Disjunction queries, we extract terms from all subclauses
For Conjunction queries, we can just use a single subclause
Positional queries are treated as conjunctions
Simple Numeric queries treated as terms
Empty term sets are indexed with special terms to match all docs
NumericRange queries are ignored
Indexing Queries


RegexpQueries are a bit more complex:





Extract the longest exact substring from the regex and index that
Create ngrams from the MemoryIndex termlist when building the document
disjunction query
Reduces the effectiveness of the query filtering, but still allows significant
speedups.
Hacking Lucene & Solr
To allow MemoryIndex to expose positions we fixed for Lucene 4.x:
LUCENE-3831 - Passing a null fieldname to MemoryFields#terms in
throws a NPE
LUCENE-3827 - Make term offsets work in MemoryIndex

MemoryIndex

Contributed to the work on adding interval iterators to Lucene Scorers:
LUCENE 2878 - Allow Scorer to expose positions and payloads aka. nuke spans

...and then we rebuilt Solr on top of these modifications.
Why not use Elasticsearch percolators?


Need to emit exact hit positions using Interval Iterators

Percolator doesn't allow such efficient optimisations based on document-specific
filters – we need to cut down the number of queries we run as much as possible.

Why not use Elasticsearch percolators?


Need to emit exact hit positions using Interval Iterators

Percolator doesn't allow such efficient optimisations based on document-specific
filters – we need to cut down the number of queries we run as much as possible.


The fastest way to run a query is not to run it at all!
Introducing 'Luwak'
A library for running many stored queries, fast.
Features

Works with standard Lucene queries (TermQuery, BooleanQuery,
RegexpQuery...)

Can be extended to work with custom query types

Run multiple instances of it to scale for document volume

Based on our fork of Lucene (should be merged into trunk soon though...)

Available Real Soon Now on Github
Using Luwak
Monitor monitor = new Monitor();
monitor.addQueries(querylist);
InputDocument doc = source.getNextDocument();
DocumentMatches matches = monitor.match(doc);
DocumentMatches object contains exact hit positions per-field for each query that
matched the InputDocument, as well as some metrics (querycount, preptime,
querytime).
Luwak output
{
"stats": {"preptime":3,"querytime":23,"querycount":1},
"id":"C:TESTBMA20130417e10a0737.xml",
"uniqueId":"e10a0737",
"Content":"… lots of xml …",
"QueryMatches":[
{"QueryId":"1",
"FieldHighlights":[
{"Highlights":[
{"WordOffset":11,"Offset":80,"Length":7,"Word":"quibble"},
{"WordOffset":115,"Offset":739,"Length":7,"Word":"quibble"},
{"WordOffset":117,"Offset":749,"Length":7,"Word":"quibble"}],
"Field":"bodytext"}]}]}
How fast is Luwak?
For simple keywords (<20 terms):

70,000 keywords applied per second to an article

Tested on a Macbook

20 times faster than our own previous implementation (using Xapian)
For more complex keywords (some run to three pages!)

20,000 keywords applied in 0.5 seconds

Processes approximately 2000 articles/hour
Building Luwak into a monitoring system
Monitors are packaged up into a service using Dropwizard

We run multiple instances of the monitoring system to handle the required article
throughput

Queries held in a central database, and updates pushed out to each monitor
instance using notification queues

Articles are supplied in NewsML format, either on disk or via message brokers.

We build a separate Solr index of articles – a searchable archive – for testing
new Keywords and for end customers to use

Everything uses RESTful Web Service APIs

What next?


3 clients are live with one in progress

Replacing systems built on dtSearch, Verity & Autonomy

Can we create query parsers for other legacy systems?
What next?




3 clients are live with one in progress

Replacing systems built on dtSearch, Verity & Autonomy

Can we create query parsers for other legacy systems?
Improve Luwak performance by sharding the Keyword set

We know of cases of 1.5m keywords, hundreds of articles/sec
What next?






3 clients are live with one in progress

Replacing systems built on dtSearch, Verity & Autonomy

Can we create query parsers for other legacy systems?
Improve Luwak performance by sharding the Keyword set

We know of cases of 1.5m keywords, hundreds of articles/sec
Get Lucene intervals code into trunk, so Luwak isn't based on a fork
What next?








3 clients are live with one in progress

Replacing systems built on dtSearch, Verity & Autonomy

Can we create query parsers for other legacy systems?
Improve Luwak performance by sharding the Keyword set

We know of cases of 1.5m keywords, hundreds of articles/sec
Get Lucene intervals code into trunk, so Luwak isn't based on a fork
Find other uses for Luwak?

Add metadata to items based on their content

Classification into a taxonomy, nodes defined by Keywords
Thankyou!
Any questions?
charlie@flax.co.uk
@FlaxSearch
alan@flax.co.uk
@romseygeek
www.flax.co.uk/blog
+44 (0) 8700 118334

Más contenido relacionado

Similar a Turning search upside down

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Real-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @MoldcampReal-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @MoldcampAlexei Gorobets
 
DG Group - Active Or Passive Website
DG Group - Active Or Passive WebsiteDG Group - Active Or Passive Website
DG Group - Active Or Passive WebsiteFranco De Bonis
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internetdrgath
 
Unique packers contact
Unique packers contactUnique packers contact
Unique packers contactSelf-Employed
 
Cross Media from 2001 | a good vision?
Cross Media from 2001 | a good vision?Cross Media from 2001 | a good vision?
Cross Media from 2001 | a good vision?Blockchainizator
 
Developments in catalogues and data sharing
Developments in catalogues and data sharingDevelopments in catalogues and data sharing
Developments in catalogues and data sharingEdmund Chamberlain
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internetdrgath
 
Digital + Container List
Digital + Container ListDigital + Container List
Digital + Container Listguest53eac8
 
Website Accessibility 101
Website Accessibility 101Website Accessibility 101
Website Accessibility 101MadouPDX
 
You’ll Never Believe How Easy Deployments Can Really Be... (PHPSW November 2015)
You’ll Never Believe How Easy Deployments Can Really Be... (PHPSW November 2015)You’ll Never Believe How Easy Deployments Can Really Be... (PHPSW November 2015)
You’ll Never Believe How Easy Deployments Can Really Be... (PHPSW November 2015)James Titcumb
 
Getting started with apache solr
Getting started with apache solrGetting started with apache solr
Getting started with apache solrHumayun Kabir
 
Intro to Jquery Mobile
Intro to Jquery MobileIntro to Jquery Mobile
Intro to Jquery MobileJames Quick
 
Linked Data and Search: Thomas Steiner (Google Inc, Germany)
Linked Data and Search:  Thomas Steiner (Google Inc, Germany)Linked Data and Search:  Thomas Steiner (Google Inc, Germany)
Linked Data and Search: Thomas Steiner (Google Inc, Germany)FIA2010
 

Similar a Turning search upside down (20)

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Real-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @MoldcampReal-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @Moldcamp
 
DG Group - Active Or Passive Website
DG Group - Active Or Passive WebsiteDG Group - Active Or Passive Website
DG Group - Active Or Passive Website
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
Html 5
Html 5Html 5
Html 5
 
Unique packers contact
Unique packers contactUnique packers contact
Unique packers contact
 
Cross Media from 2001 | a good vision?
Cross Media from 2001 | a good vision?Cross Media from 2001 | a good vision?
Cross Media from 2001 | a good vision?
 
Developments in catalogues and data sharing
Developments in catalogues and data sharingDevelopments in catalogues and data sharing
Developments in catalogues and data sharing
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Digital + Container List
Digital + Container ListDigital + Container List
Digital + Container List
 
Accessing Treasure on lands and peoples
Accessing Treasure on lands and peoplesAccessing Treasure on lands and peoples
Accessing Treasure on lands and peoples
 
Xml Overview
Xml OverviewXml Overview
Xml Overview
 
Website Accessibility 101
Website Accessibility 101Website Accessibility 101
Website Accessibility 101
 
Greenberg, Starr, Kunze, and Hammond, "Show Me the Data: Managing Data Sets f...
Greenberg, Starr, Kunze, and Hammond, "Show Me the Data: Managing Data Sets f...Greenberg, Starr, Kunze, and Hammond, "Show Me the Data: Managing Data Sets f...
Greenberg, Starr, Kunze, and Hammond, "Show Me the Data: Managing Data Sets f...
 
You’ll Never Believe How Easy Deployments Can Really Be... (PHPSW November 2015)
You’ll Never Believe How Easy Deployments Can Really Be... (PHPSW November 2015)You’ll Never Believe How Easy Deployments Can Really Be... (PHPSW November 2015)
You’ll Never Believe How Easy Deployments Can Really Be... (PHPSW November 2015)
 
Getting started with apache solr
Getting started with apache solrGetting started with apache solr
Getting started with apache solr
 
Intro to Jquery Mobile
Intro to Jquery MobileIntro to Jquery Mobile
Intro to Jquery Mobile
 
APEX Themes and Templates
APEX Themes and TemplatesAPEX Themes and Templates
APEX Themes and Templates
 
Linked Data and Search: Thomas Steiner (Google Inc, Germany)
Linked Data and Search:  Thomas Steiner (Google Inc, Germany)Linked Data and Search:  Thomas Steiner (Google Inc, Germany)
Linked Data and Search: Thomas Steiner (Google Inc, Germany)
 
RESTFul IDEAS
RESTFul IDEASRESTFul IDEAS
RESTFul IDEAS
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Turning search upside down

  • 1.
  • 2. Turning Search Upside Down using Lucene for very fast stored queries Charlie Hull & Alan Woodward Managing Director & Principal Developer at Flax www.flax.co.uk @FlaxSearch
  • 3. Who are Flax? Search engine specialists with decades of experience  Developers, innovators and strategists  Based in Cambridge, UK  Technology agnostic – but open source exponents  UK Authorized Partner of LucidWorks  Customers include Reed Specialist Recruitment, NLA, Gorkana, Financial Times, News UK, UK Cabinet Office, Australian Associated Press, Accenture, University of Cambridge...  We also run a Meetup group: 
  • 4. Who are we?  Charlie Hull – Managing Director and co-founder at Flax – I've been working in search since 1999 – When I'm not running Flax I'm sometimes a stiltwalker & juggler
  • 5. Who are we?   Charlie Hull – Managing Director and co-founder at Flax – I've been working in search since 1999 – When I'm not running Flax I'm sometimes a stiltwalker & juggler Alan Woodward – Principal developer at Flax – I used to run a 500m document FAST ESP index for Proquest – Lucene/Solr committer – When I'm not hacking I sing in the Cambridge University Musical Society Chorus
  • 6. Media monitoring  Standard search – few queries over many documents  Monitoring search – many queries over each document  Customers interests manually turned into queries  Humans probably still have the final say on relevance  Eventual result is a list of articles emailed - or even printed  Exact position of a match is important to the customer
  • 7. Media monitoring - parameters  Tens of thousands of stored expressions or 'Keywords' – can't rewrite these so must use same syntax!  Hundreds of thousands of articles to monitor every day  Source data can sometimes be scanned & OCR'd  False positives cost human operator time: false negatives cost customers!  Traditional approach is brute force using standard search engine software
  • 8. Media monitoring – a Keyword ("PALM BEACH COUNTY"; W/48 (("TOURIS*"; OR "TOUR"; OR "TOURS"; OR "TRAVEL*"; OR "HOLIDAY*"; OR "HOL"; OR "HOLS"; OR "HOTEL*"; OR "VISIT*"; OR "TRIP"; OR "TRIPS"; OR "DAYTRIP*"; OR "BEACH"; OR "!BEACHES"; OR "COAST"; OR "!COASTLINE*"; OR "ABTA"; OR "DAY TRIP*"; OR “SUITE"; OR "SUITES"; OR "A%CCOMMODATION"; OR "BED AND ! BREAKFAST"; OR "B&amp;B"; OR "BED &amp; !BREAKFAST"; OR "!BREAKFAST AND BOARD"; OR "FULL BOARD"; OR "HALF BOARD"; OR "ALL !INCLUSIVE"; OR "THINGS TO DO"; OR "HOSP? TALITY"; OR "SHORT BREAK*"; OR "!WEEKEND BREAK"; OR "CITY BREAK*"; OR "!SIGHTSEE*"; OR "!VACATION*"; OR "E%XCURSION*"; OR "FLY* WITH"; OR "FLY* THERE"; OR "FLY* DRIVE"; OR "!GETAWAY"; OR "!BACKPACK*"; OR "BACK PACK*"; OR "!ECOTOURIS*"; OR "! WATERSPORT*"; OR "WATER SPORT*"; OR "FESTIVAL*"; OR "RESORT* &amp; SPA"; OR "RESORT* AND SPA"; OR "WHALE WATCH*"; OR "GET THERE"; OR "WHERE TO STAY"; OR "GETTING THERE"; OR "STAYCATION*"; OR "VILLA"; OR "VILLAS"; OR "AIRPORT*"; OR "SPA"; OR "SPAS"; OR "OUTDOOR EVENT*"; OR "OUTDOOR ADVENTURE*"; OR "OUTDOOR PURSUIT*"; OR "OUTDOOR ACTIVIT*"; OR "CLIMBING WALL*"; OR "CLIMBING CENTRE*"; OR "ROCK CLIMB*"; OR "WHITE WATER RAFTING";) OR ("PLACES"; W/4 ("TO STAY"; OR "TO SEE"; OR "TO EAT";)) OR (("FLIGHT*"; OR "FLY"; OR "FLYING"; OR "CRUISE*";) W/4 ("OFFER"; OR "AVAILABLE"; OR "DEPART*"; OR "FROM"; OR "TRANSFER*";))))
  • 9. Media monitoring – another Keyword (((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE &amp; !WIRELESS"; OR ";CABLE AND ! WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";! EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MISSELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))
  • 10. Media monitoring – an article <?xml version="1.0" encoding="UTF-8"?> <document> <attributes> <field name="ARTICLEID">2277480</field> <field name="HEADLINE" type="TEXT">Holding a man of integrity, honour</field> <field name="SOURCEID" type="STRING">2</field> <field name="ARTAREA" type="TEXT">161.82</field> <field name="PAGESUM" type="STRING">1</field> <field name="SDATE" type="STRING">2011-08-02</field> <field name="PAGE" type="STRING">A03 NEW NAA</field> <field name="CDATE" type="DATE">2011-08-02 01:08:16</field> <field name="AUTHORNAME" type="STRING">MICHELLE GRATTAN</field> <field name="ARTAREAPERCENT" type="STRING">7.66</field> <field name="SENTENCES" type="STRING">LONG-time Victorian Labor leader and Hawke government minister Clyde Holding, who has died aged 80, was hailed yesterday as a policy and party reformer who kept in touch with his local community.</field> <field name="TEXT">Holding a man of integrity, honour By MICHELLE GRATTAN LONG-time Victorian Labor leader and Hawke government minister Clyde Holding, who has died aged 80, was hailed yesterday as a policy and party reformer who kept in touch with his local community. Mr Holding, sitting in the seat of Richmond, was Victorian opposition leader from 1967 to 1977 before entering federal politics as MP for Melbourne Ports in 1977. He held several portfolios in the Hawke government...</field> <field name="ORGFILENAME" type="STRING">AAAGE20110802NAANEWA03.pdf</field> <field name="SUBTITLE" type="STRING"></field> </attributes> </document>
  • 11. So how does it work?  Incoming article is added to a Lucene MemoryIndex  We parse the original Keyword syntax into a store of Lucene queries  Stored queries are run against this single-document index  Matches from each query are gathered and emitted  Simple!
  • 12. It can't be that easy? Problems:   Doesn't scale very well with increasing numbers of queries Complex queries need to be rewritten for every document.
  • 13. It can't be that easy? Problems:   Doesn't scale very well with increasing numbers of queries Complex queries need to be rewritten for every document. Solution:   Cut down the number of queries run over each document Important not to get false negatives
  • 14. “Turning search upside down”       Narrow down our query space by selecting queries likely to match a given document. Sounds a bit like... a search? Create an index of queries, using terms extracted from the query objects themselves. Incoming documents are converted to a disjunction query using the MemoryIndex's terms list. Specialised Collector then runs each matching query against the document. Massively reduces the number of queries run against a given document.
  • 15. Indexing Queries        TermQueries are simple – just extract the term For Disjunction queries, we extract terms from all subclauses For Conjunction queries, we can just use a single subclause Positional queries are treated as conjunctions Simple Numeric queries treated as terms Empty term sets are indexed with special terms to match all docs NumericRange queries are ignored
  • 16. Indexing Queries  RegexpQueries are a bit more complex:    Extract the longest exact substring from the regex and index that Create ngrams from the MemoryIndex termlist when building the document disjunction query Reduces the effectiveness of the query filtering, but still allows significant speedups.
  • 17. Hacking Lucene & Solr To allow MemoryIndex to expose positions we fixed for Lucene 4.x: LUCENE-3831 - Passing a null fieldname to MemoryFields#terms in throws a NPE LUCENE-3827 - Make term offsets work in MemoryIndex MemoryIndex Contributed to the work on adding interval iterators to Lucene Scorers: LUCENE 2878 - Allow Scorer to expose positions and payloads aka. nuke spans ...and then we rebuilt Solr on top of these modifications.
  • 18. Why not use Elasticsearch percolators?  Need to emit exact hit positions using Interval Iterators Percolator doesn't allow such efficient optimisations based on document-specific filters – we need to cut down the number of queries we run as much as possible. 
  • 19. Why not use Elasticsearch percolators?  Need to emit exact hit positions using Interval Iterators Percolator doesn't allow such efficient optimisations based on document-specific filters – we need to cut down the number of queries we run as much as possible.  The fastest way to run a query is not to run it at all!
  • 20. Introducing 'Luwak' A library for running many stored queries, fast. Features  Works with standard Lucene queries (TermQuery, BooleanQuery, RegexpQuery...)  Can be extended to work with custom query types  Run multiple instances of it to scale for document volume  Based on our fork of Lucene (should be merged into trunk soon though...)  Available Real Soon Now on Github
  • 21. Using Luwak Monitor monitor = new Monitor(); monitor.addQueries(querylist); InputDocument doc = source.getNextDocument(); DocumentMatches matches = monitor.match(doc); DocumentMatches object contains exact hit positions per-field for each query that matched the InputDocument, as well as some metrics (querycount, preptime, querytime).
  • 22. Luwak output { "stats": {"preptime":3,"querytime":23,"querycount":1}, "id":"C:TESTBMA20130417e10a0737.xml", "uniqueId":"e10a0737", "Content":"… lots of xml …", "QueryMatches":[ {"QueryId":"1", "FieldHighlights":[ {"Highlights":[ {"WordOffset":11,"Offset":80,"Length":7,"Word":"quibble"}, {"WordOffset":115,"Offset":739,"Length":7,"Word":"quibble"}, {"WordOffset":117,"Offset":749,"Length":7,"Word":"quibble"}], "Field":"bodytext"}]}]}
  • 23. How fast is Luwak? For simple keywords (<20 terms):  70,000 keywords applied per second to an article  Tested on a Macbook  20 times faster than our own previous implementation (using Xapian) For more complex keywords (some run to three pages!)  20,000 keywords applied in 0.5 seconds  Processes approximately 2000 articles/hour
  • 24. Building Luwak into a monitoring system Monitors are packaged up into a service using Dropwizard  We run multiple instances of the monitoring system to handle the required article throughput  Queries held in a central database, and updates pushed out to each monitor instance using notification queues  Articles are supplied in NewsML format, either on disk or via message brokers.  We build a separate Solr index of articles – a searchable archive – for testing new Keywords and for end customers to use  Everything uses RESTful Web Service APIs 
  • 25. What next?  3 clients are live with one in progress  Replacing systems built on dtSearch, Verity & Autonomy  Can we create query parsers for other legacy systems?
  • 26. What next?   3 clients are live with one in progress  Replacing systems built on dtSearch, Verity & Autonomy  Can we create query parsers for other legacy systems? Improve Luwak performance by sharding the Keyword set  We know of cases of 1.5m keywords, hundreds of articles/sec
  • 27. What next?    3 clients are live with one in progress  Replacing systems built on dtSearch, Verity & Autonomy  Can we create query parsers for other legacy systems? Improve Luwak performance by sharding the Keyword set  We know of cases of 1.5m keywords, hundreds of articles/sec Get Lucene intervals code into trunk, so Luwak isn't based on a fork
  • 28. What next?     3 clients are live with one in progress  Replacing systems built on dtSearch, Verity & Autonomy  Can we create query parsers for other legacy systems? Improve Luwak performance by sharding the Keyword set  We know of cases of 1.5m keywords, hundreds of articles/sec Get Lucene intervals code into trunk, so Luwak isn't based on a fork Find other uses for Luwak?  Add metadata to items based on their content  Classification into a taxonomy, nodes defined by Keywords