We created a web application for a well-known US newspaper, to create a maps-like zooming application on top of the 60,000 newspapers since 1850 and using Solr over the 28,000,000 articles to create an interactive heatmap over it. The out-of-the-box faceting solution was optimized using domain knowledge by order-of-magnitude which allowed us to create a great visual way of exploring trends in historical newspapers.
3. visualization
data
• 1851 up to 2006: almost 60k newspapers
How to give semantic overview
• Context, where am I
• Detail
Exploration and Discovery
4
4. zoom
Present all newspapers on one canvas
Dynamic zooming and panning
Search interface
• for discovery
Front-end by Q42
• HTML5 app
• iPad app
Not yet live
5
5. architecture
Tile Web
images tiles
Generator Server
client
text solr solr
Indexer
index server
facet
plugin
6
6. tiling
Newspaper images, old ones scanned
• TIFF form
• Wrinkles, coffee stains
Tile generator
• Convert to jpg
• One virtual canvas of 512Gpixel
• Multilayers 3M tiles: ~100Gb in 11 levels
7
7. search
25,072,989 articles
867M solr index
DataImportHandler
• Issue with memory: load all XML URLs in
memory first
• Solved by indexing in batches
Special
• Nothing stored, not even IDs
• We need nothing returned from search…
8
9. faceting memory
Store each facet as BitSet over 25M articles
• 58k facets x 25M docs x 1 bit = 169Gb (memory!)
So we use DocSet from Solr
• Scarce bitarray -> now fits in 1Gb memory
10
13. runtime facet optimization
16 decades
160 years
1,920 months
58,560 days
60,656 facets
Worst case facet #DocSet.exists(doc)
• Originally: 25M x 60k = 1.5E12 checks, 60k per
doc
• Now: average 0.5x for each level = 34.5 per doc
14
14. optimization
Custom facet runtime Collector
• Break if facet matched
single value per doc per facet
each doc has only 1 day
• Top-down facet selection
decade – year – month – day
Performance for 1850 docs and 60k docs
improved from 300ms to 10ms
Custom optimized heatmap json
Bottleneck now in the client/canvas/js
15
15. show us or it didn’t happen
Web Application
iPad App
16
18. conclusions
Great exploratory UI
Use domain knowledge to optimize for
performance
• If you can
Next
• Bring it live on the Web and in App Store
• Using it for 1.2M books/CDs/DVDs of Belgium
• More search options
• Multipage
19
19. enhancement suggestions
Lucene Collector
• def collect(doc: Int):Boolean
class ExistsCollector extends Collector {
var exists = false
def collect(doc: Int) = {
exists = true
false
}
def acceptsDocsOutOfOrder() = true
def setNextReader(reader: IndexReader, base: Int) {}
def setScorer(scorer: Scorer) {}
}
Solr SingleValueFacet
Break after first find
Automatic order based on #counts?
20
20. lessons learned
Java Graphics has limitations for large fonts
(>26,000)
Handling large data sets is tricky
• Indexing
• Copying
There’s technology and there’s corporate
agendas
You can always make things 10x faster
• Lucene is ridiculously fast
If you configure it well
• Using domain knowledge can get you far
21