SlideShare una empresa de Scribd logo
1 de 21
lots of facets, fast

     Anne Veling, BeyondTrees
anne@beyondtrees.com, May 26th 2011
introduction
 Anne Veling
  • Freelance Search Architect
  • Lucene Trainer


 Proquest
 New York Times




                                 3
visualization
 data
  • 1851 up to 2006: almost 60k newspapers
 How to give semantic overview
  • Context, where am I
  • Detail
 Exploration and Discovery




                                             4
zoom
 Present all newspapers on one canvas
 Dynamic zooming and panning
 Search interface
   • for discovery


 Front-end by Q42
   • HTML5 app
   • iPad app


 Not yet live

                                         5
architecture




           Tile                   Web
images                   tiles
         Generator               Server

                                              client

 text                     solr    solr
          Indexer
                         index   server
                                      facet
                                     plugin




                                                       6
tiling
 Newspaper images, old ones scanned
  • TIFF form
  • Wrinkles, coffee stains
 Tile generator
  • Convert to jpg
  • One virtual canvas of 512Gpixel
  • Multilayers 3M tiles: ~100Gb in 11 levels




                                                7
search
 25,072,989 articles
 867M solr index
 DataImportHandler
  • Issue with memory: load all XML URLs in
    memory first
  • Solved by indexing in batches
 Special
  • Nothing stored, not even IDs
  • We need nothing returned from search…



                                              8
results   facets
             0

query




                           …




        maxDoc
                  4   2


                               9
faceting memory
 Store each facet as BitSet over 25M articles
  • 58k facets x 25M docs x 1 bit = 169Gb (memory!)
 So we use DocSet from Solr
  • Scarce bitarray -> now fits in 1Gb memory




                                                      10
faceting performance
query
                     Facet initialization
                       • Takes ~1.5minute
                       • Cached


                     Facet evaluation
                       • Runtime!
                       • #docs x #facets




                                             11
performance
 Facet initialization/creation
 Runtime faceting

 Solr LRU cache
 Creation of all facets ~72s
 Runtime evaluation ootb: 71 seconds…
  /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on
  &facet.date=thedate
  &facet.date.start=1850-01-01T00:00:00Z
  &facet.date.end=2007-01-01T00:00:00Z
  &facet.date.gap=%2B1DAY
  &facet=true


 Client-side bottleneck vs Server-side
                                                               12
<filterCache class="solr.FastLRUCache" size="70000"
initialSize="512" autowarmCount="0"/>
 Improved performance to ~300ms for
  “Amsterdam” [1825] query!
   • 2.3Mb output…
<requestHandler name="/zoomr"
class="com.proquest.zoom.ZoomrRequestHandler">
</requestHandler>
 Custom json output
   • Base 36 encoded heatmap
                          01111111111111111122111222777986878768885568855899beddbce
                          bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi
                          mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000




                                                                                      13
runtime facet optimization

                     16 decades

               160 years

      1,920 months

58,560 days

   60,656 facets
   Worst case facet #DocSet.exists(doc)
      • Originally: 25M x 60k = 1.5E12 checks, 60k per
        doc
      • Now: average 0.5x for each level = 34.5 per doc
                                                          14
optimization
 Custom facet runtime Collector
  • Break if facet matched
      single value per doc per facet
      each doc has only 1 day
  • Top-down facet selection
      decade – year – month – day
 Performance for 1850 docs and 60k docs
  improved from 300ms to 10ms
 Custom optimized heatmap json
 Bottleneck now in the client/canvas/js


                                           15
show us or it didn’t happen
 Web Application
 iPad App




                                 16
zooming




          17
facet heatmap

        “television”




                       “inflation”




                                     18
conclusions
 Great exploratory UI
 Use domain knowledge to optimize for
  performance
  • If you can
 Next
  •   Bring it live on the Web and in App Store
  •   Using it for 1.2M books/CDs/DVDs of Belgium
  •   More search options
  •   Multipage



                                                    19
enhancement suggestions
 Lucene Collector
  • def collect(doc: Int):Boolean
                           class ExistsCollector extends Collector {
                             var exists = false

                               def collect(doc: Int) = {
                                 exists = true
                                 false
                               }

                               def acceptsDocsOutOfOrder() = true
                               def setNextReader(reader: IndexReader, base: Int) {}
                               def setScorer(scorer: Scorer) {}
                           }



 Solr SingleValueFacet
      Break after first find
      Automatic order based on #counts?


                                                                                  20
lessons learned
 Java Graphics has limitations for large fonts
  (>26,000)
 Handling large data sets is tricky
  • Indexing
  • Copying
 There’s technology and there’s corporate
  agendas
 You can always make things 10x faster
  • Lucene is ridiculously fast
      If you configure it well
  • Using domain knowledge can get you far
                                                  21
thank you




      anne@beyondtrees.com
              @anneveling



                             22

Más contenido relacionado

La actualidad más candente

Staying friendly with the gc
Staying friendly with the gcStaying friendly with the gc
Staying friendly with the gcOren Eini
 
To Cloud or Not To Cloud?
To Cloud or Not To Cloud?To Cloud or Not To Cloud?
To Cloud or Not To Cloud?Greg Lindahl
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to YouAmazon Web Services
 
The Wix Microservice Stack
The Wix Microservice StackThe Wix Microservice Stack
The Wix Microservice StackTomer Gabel
 
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...Amazon Web Services
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)emiltamas
 
Deep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block StoreDeep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block StoreAmazon Web Services
 
MongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combinationMongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combinationSteven Francia
 
Accelerating NoSQL
Accelerating NoSQLAccelerating NoSQL
Accelerating NoSQLsunnygleason
 
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...Amazon Web Services
 
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)Amazon Web Services
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloudboorad
 
Rebooting design in RavenDB
Rebooting design in RavenDBRebooting design in RavenDB
Rebooting design in RavenDBOren Eini
 
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)Amazon Web Services
 
Running Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows AzureRunning Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows AzureSimon Evans
 
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech TalksDeep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech TalksAmazon Web Services
 
Odnoklassniki.ru Architecture
Odnoklassniki.ru ArchitectureOdnoklassniki.ru Architecture
Odnoklassniki.ru ArchitectureDmitry Buzdin
 
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...Malin Weiss
 
Spring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using CouchbaseSpring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using CouchbaseIntae Kim
 

La actualidad más candente (20)

Staying friendly with the gc
Staying friendly with the gcStaying friendly with the gc
Staying friendly with the gc
 
To Cloud or Not To Cloud?
To Cloud or Not To Cloud?To Cloud or Not To Cloud?
To Cloud or Not To Cloud?
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
The Wix Microservice Stack
The Wix Microservice StackThe Wix Microservice Stack
The Wix Microservice Stack
 
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)
 
Deep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block StoreDeep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block Store
 
MongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combinationMongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combination
 
Accelerating NoSQL
Accelerating NoSQLAccelerating NoSQL
Accelerating NoSQL
 
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
 
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
 
Rebooting design in RavenDB
Rebooting design in RavenDBRebooting design in RavenDB
Rebooting design in RavenDB
 
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
 
Running Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows AzureRunning Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows Azure
 
ECS위에 Log Server 구축하기
ECS위에 Log Server 구축하기ECS위에 Log Server 구축하기
ECS위에 Log Server 구축하기
 
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech TalksDeep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
 
Odnoklassniki.ru Architecture
Odnoklassniki.ru ArchitectureOdnoklassniki.ru Architecture
Odnoklassniki.ru Architecture
 
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
 
Spring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using CouchbaseSpring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using Couchbase
 

Similar a Lots of facets, fast

A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?DATAVERSITY
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼NAVER D2
 
Restful web services with nodejs
Restful web services with nodejsRestful web services with nodejs
Restful web services with nodejsAspenware
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Lucidworks
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceChin Huang
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Eureka Moment UKLUG
Eureka Moment UKLUGEureka Moment UKLUG
Eureka Moment UKLUGPaul Withers
 
How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?VMware Tanzu
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...Amit Gupta
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's ArchitectureTony Tam
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Javier Cantón Ferrero
 
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0Plain Concepts
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSBrett McLain
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 

Similar a Lots of facets, fast (20)

Loom promises: be there!
Loom promises: be there!Loom promises: be there!
Loom promises: be there!
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Restful web services with nodejs
Restful web services with nodejsRestful web services with nodejs
Restful web services with nodejs
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Eureka Moment UKLUG
Eureka Moment UKLUGEureka Moment UKLUG
Eureka Moment UKLUG
 
How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0
 
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 

Último

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Último (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Lots of facets, fast

  • 1. lots of facets, fast Anne Veling, BeyondTrees anne@beyondtrees.com, May 26th 2011
  • 2. introduction  Anne Veling • Freelance Search Architect • Lucene Trainer  Proquest  New York Times 3
  • 3. visualization  data • 1851 up to 2006: almost 60k newspapers  How to give semantic overview • Context, where am I • Detail  Exploration and Discovery 4
  • 4. zoom  Present all newspapers on one canvas  Dynamic zooming and panning  Search interface • for discovery  Front-end by Q42 • HTML5 app • iPad app  Not yet live 5
  • 5. architecture Tile Web images tiles Generator Server client text solr solr Indexer index server facet plugin 6
  • 6. tiling  Newspaper images, old ones scanned • TIFF form • Wrinkles, coffee stains  Tile generator • Convert to jpg • One virtual canvas of 512Gpixel • Multilayers 3M tiles: ~100Gb in 11 levels 7
  • 7. search  25,072,989 articles  867M solr index  DataImportHandler • Issue with memory: load all XML URLs in memory first • Solved by indexing in batches  Special • Nothing stored, not even IDs • We need nothing returned from search… 8
  • 8. results facets 0 query … maxDoc 4 2 9
  • 9. faceting memory  Store each facet as BitSet over 25M articles • 58k facets x 25M docs x 1 bit = 169Gb (memory!)  So we use DocSet from Solr • Scarce bitarray -> now fits in 1Gb memory 10
  • 10. faceting performance query  Facet initialization • Takes ~1.5minute • Cached  Facet evaluation • Runtime! • #docs x #facets 11
  • 11. performance  Facet initialization/creation  Runtime faceting  Solr LRU cache  Creation of all facets ~72s  Runtime evaluation ootb: 71 seconds… /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on &facet.date=thedate &facet.date.start=1850-01-01T00:00:00Z &facet.date.end=2007-01-01T00:00:00Z &facet.date.gap=%2B1DAY &facet=true  Client-side bottleneck vs Server-side 12
  • 12. <filterCache class="solr.FastLRUCache" size="70000" initialSize="512" autowarmCount="0"/>  Improved performance to ~300ms for “Amsterdam” [1825] query! • 2.3Mb output… <requestHandler name="/zoomr" class="com.proquest.zoom.ZoomrRequestHandler"> </requestHandler>  Custom json output • Base 36 encoded heatmap 01111111111111111122111222777986878768885568855899beddbce bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000 13
  • 13. runtime facet optimization 16 decades 160 years 1,920 months 58,560 days  60,656 facets  Worst case facet #DocSet.exists(doc) • Originally: 25M x 60k = 1.5E12 checks, 60k per doc • Now: average 0.5x for each level = 34.5 per doc 14
  • 14. optimization  Custom facet runtime Collector • Break if facet matched  single value per doc per facet  each doc has only 1 day • Top-down facet selection  decade – year – month – day  Performance for 1850 docs and 60k docs improved from 300ms to 10ms  Custom optimized heatmap json  Bottleneck now in the client/canvas/js 15
  • 15. show us or it didn’t happen  Web Application  iPad App 16
  • 16. zooming 17
  • 17. facet heatmap “television” “inflation” 18
  • 18. conclusions  Great exploratory UI  Use domain knowledge to optimize for performance • If you can  Next • Bring it live on the Web and in App Store • Using it for 1.2M books/CDs/DVDs of Belgium • More search options • Multipage 19
  • 19. enhancement suggestions  Lucene Collector • def collect(doc: Int):Boolean class ExistsCollector extends Collector { var exists = false def collect(doc: Int) = { exists = true false } def acceptsDocsOutOfOrder() = true def setNextReader(reader: IndexReader, base: Int) {} def setScorer(scorer: Scorer) {} }  Solr SingleValueFacet  Break after first find  Automatic order based on #counts? 20
  • 20. lessons learned  Java Graphics has limitations for large fonts (>26,000)  Handling large data sets is tricky • Indexing • Copying  There’s technology and there’s corporate agendas  You can always make things 10x faster • Lucene is ridiculously fast  If you configure it well • Using domain knowledge can get you far 21
  • 21. thank you anne@beyondtrees.com @anneveling 22