SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
APACHE SLING & FRIENDS TECH MEETUP
   BERLIN, 26-28 SEPTEMBER 2012




  Oak / Solr integration
    Tommaso Teofili
Agenda


         §  Why
         §  Search on Oak with Solr
         §  Solr based QueryIndex
         §  Solr based MK
         §  Benchmarks




adaptTo() 2012
Why


         §  Common need:
             §  Once you have content
             §  You (usually also) want search
             §  Both need to be:
                 §  Fast	
  
                 §  Scalable	
  
                 §  Faul	
  tolerant	
  
         §  Less common need:
             §  Substitute / decorate internal default query
                 engine for enhanced performance / expressivity

adaptTo() 2012
Why




                                  Solr	
  Search	
  
                                    Engine	
  




                 Solr	
  MK	
  




adaptTo() 2012
Apache Solr 101


         §  Apache project
         §  Enterprise search server
         §  Based on Apache Lucene
         §  HTTP API
         §  Easy and quick setup
         §  Scaling architectures




adaptTo() 2012
Solr simplest architecture




                                                         HTTP	
  API	
  


                 Container	
                                  Solr	
  
                                 C1	
       C2	
  

                                                       Solr	
  Core	
  API	
  




                                   I1	
       I2	
  

                                                         Lucene	
  API	
  

adaptTo() 2012
Solr sharded architecture


          §  Split a collection in N shards when indexing
          §  Search on both with distributed search:
              §  http://.../select?q=...&shards=10.1.1.10/solr/
                  c1a,10.1.1.11/solr/c1b
       docY	
                                     docX	
  
                     C1a	
     C2a	
                          C1b	
      C2b	
  
          docW	
                                   docZ	
  


                     Solr	
  @	
  10.1.1.10	
                 Solr	
  @	
  10.1.1.11	
  




adaptTo() 2012
Solr replicated architecture

                                              RR	
  Load	
  balancer	
  


                  C1	
      C2	
                                            C1	
      C2	
  



                 Solr	
  @	
  10.1.1.20	
                                  Solr	
  @	
  10.1.1.21	
  




                                                C1	
      C2	
  



                                               Solr	
  @	
  10.1.1.22	
  



adaptTo() 2012
SolrCloud*




                      *: Sorry I was too lazy to re picture this from scratch




adaptTo() 2012
Search on Oak with Solr


         §  Content is created inside Oak
         §  Oak repository gets synchronized with Solr
         §  Solr handles search requests
             §  Separation of concerns
             §  Scaling independently:
                 §  Oak	
  can	
  scale	
  for	
  CRUD	
  operaIons	
  on	
  repository	
  
                 §  Solr	
  can	
  scale	
  for	
  indexing	
  /	
  search	
  




adaptTo() 2012
Synchronizing Oak and Solr


         §  Known options
             §  Push from Oak
             §  Pull with Solr (use Oak HTTP API with Solr DIH)
             §  Apache ManifoldCF ?
             §  ...




adaptTo() 2012
Push Oak commits to Solr


         §  CommitHook API
             §  evaluate repository status before and after a
                 specific commit
             §  create a “diff”
             §  do something with the diff
                 §  -­‐>	
  send	
  adds	
  /	
  deletes	
  to	
  Solr	
  
         §  The CommitHook is executed right before
             content is actually persisted
                 §  See	
  hOps://github.com/apache/jackrabbit-­‐oak/blob/
                     trunk/doc/nodestate.md	
  

adaptTo() 2012
Push Oak commits to Solr

         // add the SolrCommitHook to the node store
         store.setHook(new SolrCommitHook());
         // obtain the node store root for a specific workspace
         Root root = new RootImpl(store, “solr-test”, ...);
         // add a node with path ‘doc1’ and a text property
         root.getTree("/").addChild("doc1").setProperty("text",
         valueFactory.createValue("hit that hot hat tattoo"));
         // commit the change to the node store
         r.commit(DefaultConflictHandler.OURS);




adaptTo() 2012
Push Oak commits to Solr




adaptTo() 2012
Search on Oak with Solr - Scaling


         §  SoftCommitHook uses SolrJ’s SolrServer
             §  SolrServer solrServer = ...
                 §  new	
  H-pSolrServer(“h-p://..:8983/solr”)	
  ;	
  //	
  standalone	
  
                 §  new	
  LBH-pSolrServer(“h-p://..:8983/solr”,	
  “h-p://..:
                     7574/solr”);	
  //	
  replicaIon	
  –	
  sw	
  load	
  balancing	
  
                 §  new	
  CloudSolrServer(“100.10.13.14:9983”);	
  //	
  SolrCloud	
  




adaptTo() 2012
What if commit fails?


         §  Could lead to not consistent status
             §  Oak commits not going through
             §  Related data indexed on Solr
         §  Use Observer
             §  contentChanged(NodeStore store, NodeState
                 before, NodeState after);
                 §  Node	
  state	
  is	
  just	
  read	
  
                 §  Only	
  triggered	
  on	
  successfully	
  persisted	
  commits	
  




adaptTo() 2012
Solr based QueryIndex


         §  QueryIndex API
             §  Evaluate the query (Filter) cost
             §  Eventually view the query “plan”
             §  Execute the query against a specific revision
                 and root node state




adaptTo() 2012
Solr based QueryIndex


         §  getCost(Filter);
             §  Property restrictions:
                 §  Each	
  property	
  is	
  mapped	
  as	
  a	
  field	
  
                 §  Can	
  use	
  term	
  queries	
  for	
  simple	
  value	
  matching	
  
                 §  Can	
  use	
  range	
  queries	
  (with	
  trie	
  fields)	
  for	
  “first	
  to	
  last”	
  




adaptTo() 2012
Solr based QueryIndex


         §  getCost(Filter);
             §  Path restrictions
                 §    Indexed	
  as	
  strings	
  	
  
                 §    And	
  as	
  paths	
  with	
  PathHierarchyTokenizerFactory	
  
                 §    Ancestor	
  /	
  descendant	
  is	
  as	
  fast	
  as	
  exact	
  match	
  
                 §    Direct	
  children	
  needs	
  special	
  handling	
  




adaptTo() 2012
Solr based QueryIndex


         §  ancestors




adaptTo() 2012
Solr based QueryIndex


         §  descendants




adaptTo() 2012
Solr based QueryIndex


         §  getCost(Filter);
             §  Full text conditions
                 §  Easiest	
  use	
  case	
  
                 §  Can	
  use	
  (E)DisMax	
  Solr	
  query	
  parser	
  
                    –  Q=+term1	
  +term2	
  term3	
  
                    –  And	
  fields	
  are	
  defined	
  by	
  Solr	
  configuraIon	
  




adaptTo() 2012
Solr based QueryIndex


         §  query(Filter filter, String revisionId,
             NodeState root);
             §  filter gets transformed in a SolrQuery
             §  root and revisionId can be used to map Oak
                 revisions to Lucene / Solr commit points
             §  the resulting Cursor wraps the Solr query
                 response




adaptTo() 2012
Custom indexes configuration


         §  Discussion is going on at :
                 §  hOp://markmail.org/message/bdgi77dd6wy2hkbp	
  
         §  Something like:
           /data [jcr:mixinTypes = oak:indexed]
                    /oak:indexes
                        /solr [jcr:primaryType = oak:solrIndex, oak:nodeType = nt:file, url = ...]




adaptTo() 2012
Extending default query syntax


         §  Combining JCR-SQL2 / XPath queries with
             full text search on Solr
         §  i.e.: select * from parent as p inner join
             child as c on issamenode(p, c, ['/a/b/c'])
             and solr(‘(+hit +tattoo^2) OR “hot
             tattoo”~2’)
             §  First part can be handled in Oak as usual
             §  Second part can be run on Solr
             §  Need to extend the Oak AST

adaptTo() 2012
Solr based MK


         §  MicroKernel API
             §  MVCC model
             §  Json based data model
             §  Possibly use some retention policy
             §  Basically noSQL


         §  Can Solr fit ?



adaptTo() 2012
Solr based MK


         §  MVCC
             §  Solr
                 §    SVCC	
  
                 §    per	
  document	
  opImisIc	
  lock	
  –	
  free	
  mechanism	
  
                 §    related	
  issues	
  :	
  SOLR-­‐3178,	
  SOLR-­‐3173	
  
                 §    SoluIons:	
  
                       –  map	
  Oak	
  revisions	
  to	
  Lucene	
  commit	
  points	
  
                       –  store	
  revisions	
  inside	
  Solr	
  documents	
  

         §  Json based data model
             §  already available in Solr

adaptTo() 2012
Solr based MK


         §  Retention policy
             §  Already available in Solr using Lucene commit
                 points for revisions
             §  Leveraging IndexDeletionPolicy API




adaptTo() 2012
Solr based MK


         §  noSQL
             §  Solr
                 §    not	
  relaIonal,	
  avoid	
  joins	
  
                 §    no	
  fixed	
  schema	
  (	
  !=	
  schema.xml	
  )	
  
                 §    scalability	
  
                 §    See	
  	
  
                       –  hOp://searchhub.org/dev/2010/04/30/nosql-­‐lucene-­‐and-­‐solr/	
  
                       –  hOp://searchhub.org/dev/2010/04/29/for-­‐the-­‐guardian-­‐solr-­‐is-­‐
                            the-­‐new-­‐database/	
  
                       	
  




adaptTo() 2012
Solr based MK


         §  Still a prototype implementation
         §  Pros
             §  Most features exist out of the box
             §  Scaling nicely
         §  Cons
             §  MVCC handling is not straightforward




adaptTo() 2012
Benchmarking


         §  SolrCommitHook with commit():
             §  CreateManyChildNodes : some impact (1.4x)
             §  DescendantSearch : almost no impact
             §  SetProperty : huge impact (3x)
             §  SmallFileWrite : some impact (1.3x)
             §  UpdateManyChildNodes : almost no impact




adaptTo() 2012
Benchmarking


         §  SolrCommitHook with autoSoftCommit:
             §  CreateManyChildNodes : some impact (1.2x)
             §  DescendantSearch : almost no impact
             §  SetProperty : no impact
             §  SmallFileWrite : almost no impact
             §  UpdateManyChildNodes : almost no impact




adaptTo() 2012
Let’s improve it


§  It’s work in progress !
   §  Oak/Solr integration Jira:
      §  hOps://issues.apache.org/jira/browse/OAK-­‐307	
  
   §  Oak Github fork:
      §  hOps://github.com/Oeofili/jackrabbit-­‐oak	
  




adaptTo() 2012

Más contenido relacionado

La actualidad más candente

Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solrKnoldus Inc.
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
An Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrAn Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)
 
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...Lucidworks
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0Erik Hatcher
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 

La actualidad más candente (20)

Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
An Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrAn Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache Solr
 
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
EVOLVE'14 | Enhance | Justin Edelson & Darin Kuntze | Demystifying Oak Search
EVOLVE'14 | Enhance | Justin Edelson & Darin Kuntze | Demystifying Oak SearchEVOLVE'14 | Enhance | Justin Edelson & Darin Kuntze | Demystifying Oak Search
EVOLVE'14 | Enhance | Justin Edelson & Darin Kuntze | Demystifying Oak Search
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 

Similar a Oak / Solr integration

BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchNetConstructor, Inc.
 
Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2longkeyy
 
ApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataOpenSource Connections
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesRahul Singh
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesAnant Corporation
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text searchPaul Borgermans
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr CloudCominvent AS
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdutionXuan-Chao Huang
 
Oracle RAC One Node 12c Overview
Oracle RAC One Node 12c OverviewOracle RAC One Node 12c Overview
Oracle RAC One Node 12c OverviewMarkus Michalewicz
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyCominvent AS
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher lucenerevolution
 
JavaOne-2013: Save Scarce Resources by Managing Terabytes of Objects off-heap...
JavaOne-2013: Save Scarce Resources by Managing Terabytes of Objects off-heap...JavaOne-2013: Save Scarce Resources by Managing Terabytes of Objects off-heap...
JavaOne-2013: Save Scarce Resources by Managing Terabytes of Objects off-heap...harvraja
 
CQ5 QueryBuilder - .adaptTo(Berlin) 2011
CQ5 QueryBuilder - .adaptTo(Berlin) 2011CQ5 QueryBuilder - .adaptTo(Berlin) 2011
CQ5 QueryBuilder - .adaptTo(Berlin) 2011Alexander Klimetschek
 
GlassFish REST Administration Backend at JavaOne India 2012
GlassFish REST Administration Backend at JavaOne India 2012GlassFish REST Administration Backend at JavaOne India 2012
GlassFish REST Administration Backend at JavaOne India 2012Arun Gupta
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 

Similar a Oak / Solr integration (20)

BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
 
Big Search with Big Data Principles
Big Search with Big Data PrinciplesBig Search with Big Data Principles
Big Search with Big Data Principles
 
Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2
 
ApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big Data
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text search
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdution
 
Oracle RAC One Node 12c Overview
Oracle RAC One Node 12c OverviewOracle RAC One Node 12c Overview
Oracle RAC One Node 12c Overview
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
JavaOne-2013: Save Scarce Resources by Managing Terabytes of Objects off-heap...
JavaOne-2013: Save Scarce Resources by Managing Terabytes of Objects off-heap...JavaOne-2013: Save Scarce Resources by Managing Terabytes of Objects off-heap...
JavaOne-2013: Save Scarce Resources by Managing Terabytes of Objects off-heap...
 
CQ5 QueryBuilder - .adaptTo(Berlin) 2011
CQ5 QueryBuilder - .adaptTo(Berlin) 2011CQ5 QueryBuilder - .adaptTo(Berlin) 2011
CQ5 QueryBuilder - .adaptTo(Berlin) 2011
 
GlassFish REST Administration Backend at JavaOne India 2012
GlassFish REST Administration Backend at JavaOne India 2012GlassFish REST Administration Backend at JavaOne India 2012
GlassFish REST Administration Backend at JavaOne India 2012
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 

Más de Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 

Más de Tommaso Teofili (16)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Último

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Oak / Solr integration

  • 1. APACHE SLING & FRIENDS TECH MEETUP BERLIN, 26-28 SEPTEMBER 2012 Oak / Solr integration Tommaso Teofili
  • 2. Agenda §  Why §  Search on Oak with Solr §  Solr based QueryIndex §  Solr based MK §  Benchmarks adaptTo() 2012
  • 3. Why §  Common need: §  Once you have content §  You (usually also) want search §  Both need to be: §  Fast   §  Scalable   §  Faul  tolerant   §  Less common need: §  Substitute / decorate internal default query engine for enhanced performance / expressivity adaptTo() 2012
  • 4. Why Solr  Search   Engine   Solr  MK   adaptTo() 2012
  • 5. Apache Solr 101 §  Apache project §  Enterprise search server §  Based on Apache Lucene §  HTTP API §  Easy and quick setup §  Scaling architectures adaptTo() 2012
  • 6. Solr simplest architecture HTTP  API   Container   Solr   C1   C2   Solr  Core  API   I1   I2   Lucene  API   adaptTo() 2012
  • 7. Solr sharded architecture §  Split a collection in N shards when indexing §  Search on both with distributed search: §  http://.../select?q=...&shards=10.1.1.10/solr/ c1a,10.1.1.11/solr/c1b docY   docX   C1a   C2a   C1b   C2b   docW   docZ   Solr  @  10.1.1.10   Solr  @  10.1.1.11   adaptTo() 2012
  • 8. Solr replicated architecture RR  Load  balancer   C1   C2   C1   C2   Solr  @  10.1.1.20   Solr  @  10.1.1.21   C1   C2   Solr  @  10.1.1.22   adaptTo() 2012
  • 9. SolrCloud* *: Sorry I was too lazy to re picture this from scratch adaptTo() 2012
  • 10. Search on Oak with Solr §  Content is created inside Oak §  Oak repository gets synchronized with Solr §  Solr handles search requests §  Separation of concerns §  Scaling independently: §  Oak  can  scale  for  CRUD  operaIons  on  repository   §  Solr  can  scale  for  indexing  /  search   adaptTo() 2012
  • 11. Synchronizing Oak and Solr §  Known options §  Push from Oak §  Pull with Solr (use Oak HTTP API with Solr DIH) §  Apache ManifoldCF ? §  ... adaptTo() 2012
  • 12. Push Oak commits to Solr §  CommitHook API §  evaluate repository status before and after a specific commit §  create a “diff” §  do something with the diff §  -­‐>  send  adds  /  deletes  to  Solr   §  The CommitHook is executed right before content is actually persisted §  See  hOps://github.com/apache/jackrabbit-­‐oak/blob/ trunk/doc/nodestate.md   adaptTo() 2012
  • 13. Push Oak commits to Solr // add the SolrCommitHook to the node store store.setHook(new SolrCommitHook()); // obtain the node store root for a specific workspace Root root = new RootImpl(store, “solr-test”, ...); // add a node with path ‘doc1’ and a text property root.getTree("/").addChild("doc1").setProperty("text", valueFactory.createValue("hit that hot hat tattoo")); // commit the change to the node store r.commit(DefaultConflictHandler.OURS); adaptTo() 2012
  • 14. Push Oak commits to Solr adaptTo() 2012
  • 15. Search on Oak with Solr - Scaling §  SoftCommitHook uses SolrJ’s SolrServer §  SolrServer solrServer = ... §  new  H-pSolrServer(“h-p://..:8983/solr”)  ;  //  standalone   §  new  LBH-pSolrServer(“h-p://..:8983/solr”,  “h-p://..: 7574/solr”);  //  replicaIon  –  sw  load  balancing   §  new  CloudSolrServer(“100.10.13.14:9983”);  //  SolrCloud   adaptTo() 2012
  • 16. What if commit fails? §  Could lead to not consistent status §  Oak commits not going through §  Related data indexed on Solr §  Use Observer §  contentChanged(NodeStore store, NodeState before, NodeState after); §  Node  state  is  just  read   §  Only  triggered  on  successfully  persisted  commits   adaptTo() 2012
  • 17. Solr based QueryIndex §  QueryIndex API §  Evaluate the query (Filter) cost §  Eventually view the query “plan” §  Execute the query against a specific revision and root node state adaptTo() 2012
  • 18. Solr based QueryIndex §  getCost(Filter); §  Property restrictions: §  Each  property  is  mapped  as  a  field   §  Can  use  term  queries  for  simple  value  matching   §  Can  use  range  queries  (with  trie  fields)  for  “first  to  last”   adaptTo() 2012
  • 19. Solr based QueryIndex §  getCost(Filter); §  Path restrictions §  Indexed  as  strings     §  And  as  paths  with  PathHierarchyTokenizerFactory   §  Ancestor  /  descendant  is  as  fast  as  exact  match   §  Direct  children  needs  special  handling   adaptTo() 2012
  • 20. Solr based QueryIndex §  ancestors adaptTo() 2012
  • 21. Solr based QueryIndex §  descendants adaptTo() 2012
  • 22. Solr based QueryIndex §  getCost(Filter); §  Full text conditions §  Easiest  use  case   §  Can  use  (E)DisMax  Solr  query  parser   –  Q=+term1  +term2  term3   –  And  fields  are  defined  by  Solr  configuraIon   adaptTo() 2012
  • 23. Solr based QueryIndex §  query(Filter filter, String revisionId, NodeState root); §  filter gets transformed in a SolrQuery §  root and revisionId can be used to map Oak revisions to Lucene / Solr commit points §  the resulting Cursor wraps the Solr query response adaptTo() 2012
  • 24. Custom indexes configuration §  Discussion is going on at : §  hOp://markmail.org/message/bdgi77dd6wy2hkbp   §  Something like:   /data [jcr:mixinTypes = oak:indexed]            /oak:indexes                /solr [jcr:primaryType = oak:solrIndex, oak:nodeType = nt:file, url = ...] adaptTo() 2012
  • 25. Extending default query syntax §  Combining JCR-SQL2 / XPath queries with full text search on Solr §  i.e.: select * from parent as p inner join child as c on issamenode(p, c, ['/a/b/c']) and solr(‘(+hit +tattoo^2) OR “hot tattoo”~2’) §  First part can be handled in Oak as usual §  Second part can be run on Solr §  Need to extend the Oak AST adaptTo() 2012
  • 26. Solr based MK §  MicroKernel API §  MVCC model §  Json based data model §  Possibly use some retention policy §  Basically noSQL §  Can Solr fit ? adaptTo() 2012
  • 27. Solr based MK §  MVCC §  Solr §  SVCC   §  per  document  opImisIc  lock  –  free  mechanism   §  related  issues  :  SOLR-­‐3178,  SOLR-­‐3173   §  SoluIons:   –  map  Oak  revisions  to  Lucene  commit  points   –  store  revisions  inside  Solr  documents   §  Json based data model §  already available in Solr adaptTo() 2012
  • 28. Solr based MK §  Retention policy §  Already available in Solr using Lucene commit points for revisions §  Leveraging IndexDeletionPolicy API adaptTo() 2012
  • 29. Solr based MK §  noSQL §  Solr §  not  relaIonal,  avoid  joins   §  no  fixed  schema  (  !=  schema.xml  )   §  scalability   §  See     –  hOp://searchhub.org/dev/2010/04/30/nosql-­‐lucene-­‐and-­‐solr/   –  hOp://searchhub.org/dev/2010/04/29/for-­‐the-­‐guardian-­‐solr-­‐is-­‐ the-­‐new-­‐database/     adaptTo() 2012
  • 30. Solr based MK §  Still a prototype implementation §  Pros §  Most features exist out of the box §  Scaling nicely §  Cons §  MVCC handling is not straightforward adaptTo() 2012
  • 31. Benchmarking §  SolrCommitHook with commit(): §  CreateManyChildNodes : some impact (1.4x) §  DescendantSearch : almost no impact §  SetProperty : huge impact (3x) §  SmallFileWrite : some impact (1.3x) §  UpdateManyChildNodes : almost no impact adaptTo() 2012
  • 32. Benchmarking §  SolrCommitHook with autoSoftCommit: §  CreateManyChildNodes : some impact (1.2x) §  DescendantSearch : almost no impact §  SetProperty : no impact §  SmallFileWrite : almost no impact §  UpdateManyChildNodes : almost no impact adaptTo() 2012
  • 33. Let’s improve it §  It’s work in progress ! §  Oak/Solr integration Jira: §  hOps://issues.apache.org/jira/browse/OAK-­‐307   §  Oak Github fork: §  hOps://github.com/Oeofili/jackrabbit-­‐oak   adaptTo() 2012