Lots of facets, fast

lots of facets, fast

Anne Veling, BeyondTrees
anne@beyondtrees.com, May 26th 2011

introduction
 Anne Veling
• Freelance Search Architect
• Lucene Trainer

 Proquest
 New York Times

3

visualization
 data
• 1851 up to 2006: almost 60k newspapers
 How to give semantic overview
• Context, where am I
• Detail
 Exploration and Discovery

4

zoom
 Present all newspapers on one canvas
 Dynamic zooming and panning
 Search interface
• for discovery

 Front-end by Q42
• HTML5 app
• iPad app

 Not yet live

5

architecture

Tile Web
images tiles
Generator Server

client

text solr solr
Indexer
index server
facet
plugin

6

tiling
 Newspaper images, old ones scanned
• TIFF form
• Wrinkles, coffee stains
 Tile generator
• Convert to jpg
• One virtual canvas of 512Gpixel
• Multilayers 3M tiles: ~100Gb in 11 levels

7

search
 25,072,989 articles
 867M solr index
 DataImportHandler
• Issue with memory: load all XML URLs in
memory first
• Solved by indexing in batches
 Special
• Nothing stored, not even IDs
• We need nothing returned from search…

8

results facets
0

query

…

maxDoc
4 2

9

faceting memory
 Store each facet as BitSet over 25M articles
• 58k facets x 25M docs x 1 bit = 169Gb (memory!)
 So we use DocSet from Solr
• Scarce bitarray -> now fits in 1Gb memory

10

faceting performance
query
 Facet initialization
• Takes ~1.5minute
• Cached

 Facet evaluation
• Runtime!
• #docs x #facets

11

performance
 Facet initialization/creation
 Runtime faceting

 Solr LRU cache
 Creation of all facets ~72s
 Runtime evaluation ootb: 71 seconds…
/select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on
&facet.date=thedate
&facet.date.start=1850-01-01T00:00:00Z
&facet.date.end=2007-01-01T00:00:00Z
&facet.date.gap=%2B1DAY
&facet=true

 Client-side bottleneck vs Server-side
12

<filterCache class="solr.FastLRUCache" size="70000"
initialSize="512" autowarmCount="0"/>
 Improved performance to ~300ms for
“Amsterdam” [1825] query!
• 2.3Mb output…
<requestHandler name="/zoomr"
class="com.proquest.zoom.ZoomrRequestHandler">
</requestHandler>
 Custom json output
• Base 36 encoded heatmap
01111111111111111122111222777986878768885568855899beddbce
bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi
mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000

13

runtime facet optimization

16 decades

160 years

1,920 months

58,560 days

 60,656 facets
 Worst case facet #DocSet.exists(doc)
• Originally: 25M x 60k = 1.5E12 checks, 60k per
doc
• Now: average 0.5x for each level = 34.5 per doc
14

optimization
 Custom facet runtime Collector
• Break if facet matched
 single value per doc per facet
 each doc has only 1 day
• Top-down facet selection
 decade – year – month – day
 Performance for 1850 docs and 60k docs
improved from 300ms to 10ms
 Custom optimized heatmap json
 Bottleneck now in the client/canvas/js

15

show us or it didn’t happen
 Web Application
 iPad App

16

facet heatmap

“television”

“inflation”

18

conclusions
 Great exploratory UI
 Use domain knowledge to optimize for
performance
• If you can
 Next
• Bring it live on the Web and in App Store
• Using it for 1.2M books/CDs/DVDs of Belgium
• More search options
• Multipage

19

enhancement suggestions
 Lucene Collector
• def collect(doc: Int):Boolean
class ExistsCollector extends Collector {
var exists = false

def collect(doc: Int) = {
exists = true
false
}

def acceptsDocsOutOfOrder() = true
def setNextReader(reader: IndexReader, base: Int) {}
def setScorer(scorer: Scorer) {}
}

 Solr SingleValueFacet
 Break after first find
 Automatic order based on #counts?

20

lessons learned
 Java Graphics has limitations for large fonts
(>26,000)
 Handling large data sets is tricky
• Indexing
• Copying
 There’s technology and there’s corporate
agendas
 You can always make things 10x faster
• Lucene is ridiculously fast
 If you configure it well
• Using domain knowledge can get you far
21

thank you

anne@beyondtrees.com
@anneveling

22

Lots of facets, fast

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Lots of facets, fast

Similar a Lots of facets, fast (20)

Último

Último (20)

Lots of facets, fast