SlideShare una empresa de Scribd logo
1 de 45
Taking eZ Find beyond
    full-text search
        Paul Borgermans




    eZ Summer Conference

    London, June 16-17, 2011




                               © 2011 Paul Borgermans
About me
●   10 years in the eZ ecosystem
         –   eZ Lucene → eZ Solr → eZ Find
         –   3.5 years with eZ Systems (2007-2010)
         –   Independent consultant since 2011
●   Fancying
         –   Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..)
         –   NoSQL (Not only SQL) and scalable architectures
         –   eZ Publish & CMS systems in general
         –   Semantic web




                                                                © 2011 Paul Borgermans
Large sites?




               © 2011 Paul Borgermans
Lots of traffic?




                   © 2011 Paul Borgermans
Per user navigation needs?




                        © 2011 Paul Borgermans
Complex pages?

Slow attribute filters?




                          © 2011 Paul Borgermans
Need to integrate data from other
            sources?



    ERP


              DB



                            © 2011 Paul Borgermans
eZ Find is your friend!




Although sometimes more like a rough diamond




                                               © 2011 Paul Borgermans
Preludium




 Meet the beast ….




                     © 2011 Paul Borgermans
eZ Find

     RESTful




               © 2011 Paul Borgermans
Solr in a nutshell
●   State of the art, advanced full text search and
    information retrieval engine
●   Fast, scalable with native replication features
●   Flexible configuration
●   Extensible
●   Document oriented storage
●   Geospatial search (Solr 3.1+)
●   Native cloud features*
    * under active development, almost complete (Solr 4.0)
                                                             © 2011 Paul Borgermans
Solr
                   HTTP Request Servlet                     Update Servlet


  Admin            Disjunction                   XML/PHP         XML
          Standard             Custom
Interface Request      Max                       JSON/...      Update
                               Request
                    Request                      Response     Interface
          Handler              Handler
                     Handler                      Writer



  Config       Schema                             Caching
                                                              Update
                        Solr Core
                                                              Handler
        Analysis                    Concurrency


                                                                             Replication
                            Lucene




                            Figure credit: Yonik Seeley
                                                                             © 2011 Paul Borgermans
Performance!
●   The backend Solr employs intelligent caches
        –   filters
        –   queries
        –   internal indexes
●   Optimized for search/retrieval
        –   Slower writing
●   When updates are done, caches are
    reconstructed on the fly in the background
●   Horizonthal & vertical scaling

                                             © 2011 Paul Borgermans
Using eZ Find/Solr beyond
         search




                        © 2011 Paul Borgermans
eZ Find alter egos
●   eZ Find/Solr as a scalable IR engine/layer
        –   Remove the burden on your DB
        –   Significant speedups also for regular content
        –   Clustering built-in
●   eZ Find/Solr as a content and integration
    engine
        –   Document oriented storage system
             (hello NoSQL)
        –   Archive use-case
        –   External content
                                                     © 2011 Paul Borgermans
eZ Find alter egos (...)
●   Alternate navigation interfaces
        –   Facets, filtering, sorting
        –   Function queries (!)

●   Document clustering
        –   More Like This
        –   Tag based
              (and more semantic stuff coming up)
        –   Carrot2 based


                                                    © 2011 Paul Borgermans
Provisions in eZ Find
●   Attribute storage (serialized content)
        – Less DB queries
●   Multi-core setup
●   Distributed search in fetch(ezfind,        search)
        –   Query parameters
        –   Filter parameters
        –   Fields to return (for rendering)



                                                    © 2011 Paul Borgermans
© 2011 Paul Borgermans
Getting external data into Solr




                            © 2011 Paul Borgermans
Tools

●   Solr Data Import Handler (DIH)
●   Apache Manifold Connector framework
●   Using API's
       –   eZ Find
       –   Zeta Components Search




                                          © 2011 Paul Borgermans
Integrating external data: Solr DIH
    http://wiki.apache.org/solr/DataImportHandler
●   Goals
       –   Read data residing in relational databases and
            XML files
       –   Build Solr documents according to configuration
            (joins, views, ...)
       –   Update Solr with such documents
       –   Provide ability to do full imports ..
       –   .. as well as delta imports


                                                   © 2011 Paul Borgermans
Configuring DIH
●   Need a more complete Solr: add DIH jars
●   solrconfig.xml:
    <requestHandler name="/dataimport"
    class="org.apache.solr.handler.dataimport.DataImportHandler">
        <lst name="defaults">
          <str name="config">/home/username/data-config.xml</str>
        </lst>
      </requestHandler>




●   Configure data sources (RDBMS, XML files)
         –   data-config.xml with connection and schema
              information

                                                                    © 2011 Paul Borgermans
Using DIH
●   Send commands to DIH request handler
            http://<host>:<port>/solr/dataimport?command=<command>



       –   full-import
       –   delta-import
       –   Status
●   You can use eZ Find raw Solr request API




                                                                     © 2011 Paul Borgermans
Apache Manifold CF
    http://incubator.apache.org/connectors/
●   ManifoldCF is a crawler framework
●   Supports:
       –   File System, Windows Shares
       –   JDBC, RSS
       –   Web, LiveLink (OpenText)
       –   Documentum (EMC)
       –   SharePoint (MSFT)
       –   Meridio (Autonomy)
       –   FileNet (IBM)
                                              © 2011 Paul Borgermans
With eZ Find API...

<?php
$solr = new eZSolrBase('http://localhost:8983/solr');

$documents = array( array( 'id' => '1135',...
                           'tags_lk' =>
                              array('London','2011')));


foreach ($documents as $doc){
    ezfSolrUtils::addDocument($solr, $doc);
}

$solr->commit();

?>



                                                          © 2011 Paul Borgermans
Or With Zeta Components Search
●   http://incubator.apache.org/zetacomponents/
       <?php
       require_once 'tutorial_autoload.php';
       // on localhost with the default port
       $handler = new ezcSearchSolrHandler;
       // on another host with a different port
       $handler = new ezcSearchSolrHandler
       ( '10.0.2.184', 9123 );
       ?>




                                                  © 2011 Paul Borgermans
Indexing workflow
●   Assemble documents in the correct XML format
●   Send one or more documents at a time
●   Commit => it becomes searchable
●   Optional parameters
       –   Boosting at the document level
       –   Boosting at the field level
       –   Auto-commit heartbeat interval
              (commitWithin, millisecs)


                                            © 2011 Paul Borgermans
Indexing workflow:
           important properties (...)
●   Update = Add with same global id
●   Deleting
       –   An individual document (id)
       –   A collection of documents (using a Solr query
             expression)
       –   Needs a commit() to really disappear from
            search results




                                                   © 2011 Paul Borgermans
Indexing: performance
                 considerations
●   Commits can become expensive
       –   Use them wisely: in batches where you can
       –   Delay options
               ●   cron job
               ●   CommitWithin parameter
●   From time to time, also need an optimize()
    command
       –   Deletes leave “holes”
       –   File fragmentation with adding/updating
       –   Daily, weekly for very large indexes (multi GB)
                                                    © 2011 Paul Borgermans
But you will also need to configure Solr




                                      © 2011 Paul Borgermans
Field definitions: schema.xml
●   Field types
        –   text
        –   numerical
        –   dates
        –   location
        –   … (about 25 in total)
●   Actual fields (name, definition, properties)
●   Dynamic fields
●   Copy fields (as aggregators)

                                                   © 2011 Paul Borgermans
schema.xml: simple field type examples
    <fieldType name="string" class="solr.StrField"
 sortMissingLast="true" omitNorms="true"/>

     <!-- boolean type: "true" or "false" -->
     <fieldType name="boolean" class="solr.BoolField"
 sortMissingLast="true" omitNorms="true"/>

    <!-- A Trie based date field for faster date range
queries and date faceting. -->
    <fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>

  <!-- A text field that only splits on whitespace for exact matching
of words -->
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>



                                                               © 2011 Paul Borgermans
Analysis
●   Solr does not really search your text, but rather
    the terms that result from the analysis of text
●   Typically a chain of
        –   Character filter(s)
        –   Tokenisation
        –   Filter A
        –   Filter B
        –   …



                                               © 2011 Paul Borgermans
Solr comes with many tokenizers and
                   filters

●   Some are language specific
●   Others are very specialised
●   It is very important to get this right

    otherwise, you may not get what you expect!


       Best practice: do like eZ Find, provide multiple incarnations to
       suite facet, filter and search needs




                                                                          © 2011 Paul Borgermans
Semantic aspects




                   © 2011 Paul Borgermans
Semantic aspects:
       using an annotation engine
●   Main use cases for CMS systems
       –   Suggest tags to use for editors
       –   Enhance search engine relevancy
       –   Enhance clustering (related content)
●   Based on
       –   Domain specific ontologies
       –   Public available databases and (RESTful)
            services



                                                  © 2011 Paul Borgermans
Annotation engine: “open”
       databases




                            © 2011 Paul Borgermans
eZ Publish / eZ Find integration
●   Personal initiative
        –   Joined an EC funded project as “early adopter”
●   Initial goals:
        –   eZ Find relevancy optimisation
        –   Annotation suggestions from public data
●   More ambitious
        –   eZ Publish based, domain specific ontology
             definition
        –   TBD, as Apache Stanbol evolves

                                                      © 2011 Paul Borgermans
Something extra ...




                      © 2011 Paul Borgermans
The eZ Publish content model
●   One of the main strengths
●   But
          –   Do you need versioning in all cases?
          –   Translations: quite tightly coupled
          –   Difficulties to have workflows independent of the
                published version
          –   Variability in objects: sometimes too rigid
          –   Want traveling objects (UUID)
          –   ...
●   And of course: scalability of the implementation
    is limited too
                                                              © 2011 Paul Borgermans
So, a call for participation ….




                                  © 2011 Paul Borgermans
A new content repository project
●   Provide a very powerful content model
        –   adaptable to various scenarios and use-cases
●   Exposes a rich service layer, including an
    optional security model
        –   Role / policy based
●   Exposes its content through a variety of ways
        –   Simple to use API PHP
        –   REST-style
        –   Later: various standards (PHPCR, CMIS)

                                                   © 2011 Paul Borgermans
A new content repository ...
●   Builds on top of an IR (information retrieval)
    layer
        –   initially SOLR based
●   Pluggable persistence layer
        –   Traditional RDBMS
        –   Highly scalable NoSQL stores (Hbase,
             MongoDB, CouchDB, ..)




                                                   © 2011 Paul Borgermans
Connects to eZ Publish through ..
●   eZ Find
●   Dedicated modules

    and after refactoring of the kernel

●   Use it as a content store for eZ Publish itself




                                               © 2011 Paul Borgermans
Thank you!

   Questions?


http://joind.in/3443

paul.borgermans@gmail.com
@paulborgermans




                            © 2011 Paul Borgermans

Más contenido relacionado

La actualidad más candente

Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesBrett Meyer
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Orm and hibernate
Orm and hibernateOrm and hibernate
Orm and hibernates4al_com
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Serialization and performance in Java
Serialization and performance in JavaSerialization and performance in Java
Serialization and performance in JavaStrannik_2013
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Databasekendallclark
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
PostgreSQL Advanced Queries
PostgreSQL Advanced QueriesPostgreSQL Advanced Queries
PostgreSQL Advanced QueriesNur Hidayat
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of PigKorea Sdec
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchRafał Kuć
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into SparkAshish kumar
 

La actualidad más candente (20)

Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance Techniques
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Orm and hibernate
Orm and hibernateOrm and hibernate
Orm and hibernate
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Php
PhpPhp
Php
 
Serialization and performance in Java
Serialization and performance in JavaSerialization and performance in Java
Serialization and performance in Java
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Database
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Html5v1
Html5v1Html5v1
Html5v1
 
PostgreSQL Advanced Queries
PostgreSQL Advanced QueriesPostgreSQL Advanced Queries
PostgreSQL Advanced Queries
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of Pig
 
Avik_RailsTutorial
Avik_RailsTutorialAvik_RailsTutorial
Avik_RailsTutorial
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
Content Modeling Behavior
Content Modeling BehaviorContent Modeling Behavior
Content Modeling Behavior
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
 

Similar a Taking eZ Find beyond full-text search

MongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema DesignMongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema DesignDATAVERSITY
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchNetConstructor, Inc.
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunktdthomassld
 
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Michael McIntosh
 
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02TNR Global
 
Expertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use CasesExpertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use CasesMarco Gralike
 
Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrierFlytxt
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecturedrewz lin
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecturemysqlops
 
Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01jgregory1234
 
Facebook的架构
Facebook的架构Facebook的架构
Facebook的架构yiditushe
 
The Solar Framework for PHP
The Solar Framework for PHPThe Solar Framework for PHP
The Solar Framework for PHPConFoo
 

Similar a Taking eZ Find beyond full-text search (20)

MongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema DesignMongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema Design
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunk
 
Hands on-solr
Hands on-solrHands on-solr
Hands on-solr
 
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
 
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
Solr -
Solr - Solr -
Solr -
 
Expertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use CasesExpertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use Cases
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrier
 
Otago vre-overview
Otago vre-overviewOtago vre-overview
Otago vre-overview
 
SOLR
SOLRSOLR
SOLR
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecture
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecture
 
Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01
 
Facebook的架构
Facebook的架构Facebook的架构
Facebook的架构
 
The Solar Framework for PHP
The Solar Framework for PHPThe Solar Framework for PHP
The Solar Framework for PHP
 

Último

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Taking eZ Find beyond full-text search

  • 1. Taking eZ Find beyond full-text search Paul Borgermans eZ Summer Conference London, June 16-17, 2011 © 2011 Paul Borgermans
  • 2. About me ● 10 years in the eZ ecosystem – eZ Lucene → eZ Solr → eZ Find – 3.5 years with eZ Systems (2007-2010) – Independent consultant since 2011 ● Fancying – Apache projects (mainly Solr, Hadoop, Stanbol, Zeta, ..) – NoSQL (Not only SQL) and scalable architectures – eZ Publish & CMS systems in general – Semantic web © 2011 Paul Borgermans
  • 3. Large sites? © 2011 Paul Borgermans
  • 4. Lots of traffic? © 2011 Paul Borgermans
  • 5. Per user navigation needs? © 2011 Paul Borgermans
  • 6. Complex pages? Slow attribute filters? © 2011 Paul Borgermans
  • 7. Need to integrate data from other sources? ERP DB © 2011 Paul Borgermans
  • 8. eZ Find is your friend! Although sometimes more like a rough diamond © 2011 Paul Borgermans
  • 9. Preludium Meet the beast …. © 2011 Paul Borgermans
  • 10. eZ Find RESTful © 2011 Paul Borgermans
  • 11. Solr in a nutshell ● State of the art, advanced full text search and information retrieval engine ● Fast, scalable with native replication features ● Flexible configuration ● Extensible ● Document oriented storage ● Geospatial search (Solr 3.1+) ● Native cloud features* * under active development, almost complete (Solr 4.0) © 2011 Paul Borgermans
  • 12. Solr HTTP Request Servlet Update Servlet Admin Disjunction XML/PHP XML Standard Custom Interface Request Max JSON/... Update Request Request Response Interface Handler Handler Handler Writer Config Schema Caching Update Solr Core Handler Analysis Concurrency Replication Lucene Figure credit: Yonik Seeley © 2011 Paul Borgermans
  • 13. Performance! ● The backend Solr employs intelligent caches – filters – queries – internal indexes ● Optimized for search/retrieval – Slower writing ● When updates are done, caches are reconstructed on the fly in the background ● Horizonthal & vertical scaling © 2011 Paul Borgermans
  • 14. Using eZ Find/Solr beyond search © 2011 Paul Borgermans
  • 15. eZ Find alter egos ● eZ Find/Solr as a scalable IR engine/layer – Remove the burden on your DB – Significant speedups also for regular content – Clustering built-in ● eZ Find/Solr as a content and integration engine – Document oriented storage system (hello NoSQL) – Archive use-case – External content © 2011 Paul Borgermans
  • 16. eZ Find alter egos (...) ● Alternate navigation interfaces – Facets, filtering, sorting – Function queries (!) ● Document clustering – More Like This – Tag based (and more semantic stuff coming up) – Carrot2 based © 2011 Paul Borgermans
  • 17. Provisions in eZ Find ● Attribute storage (serialized content) – Less DB queries ● Multi-core setup ● Distributed search in fetch(ezfind, search) – Query parameters – Filter parameters – Fields to return (for rendering) © 2011 Paul Borgermans
  • 18. © 2011 Paul Borgermans
  • 19. Getting external data into Solr © 2011 Paul Borgermans
  • 20. Tools ● Solr Data Import Handler (DIH) ● Apache Manifold Connector framework ● Using API's – eZ Find – Zeta Components Search © 2011 Paul Borgermans
  • 21. Integrating external data: Solr DIH http://wiki.apache.org/solr/DataImportHandler ● Goals – Read data residing in relational databases and XML files – Build Solr documents according to configuration (joins, views, ...) – Update Solr with such documents – Provide ability to do full imports .. – .. as well as delta imports © 2011 Paul Borgermans
  • 22. Configuring DIH ● Need a more complete Solr: add DIH jars ● solrconfig.xml: <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/home/username/data-config.xml</str> </lst> </requestHandler> ● Configure data sources (RDBMS, XML files) – data-config.xml with connection and schema information © 2011 Paul Borgermans
  • 23. Using DIH ● Send commands to DIH request handler http://<host>:<port>/solr/dataimport?command=<command> – full-import – delta-import – Status ● You can use eZ Find raw Solr request API © 2011 Paul Borgermans
  • 24. Apache Manifold CF http://incubator.apache.org/connectors/ ● ManifoldCF is a crawler framework ● Supports: – File System, Windows Shares – JDBC, RSS – Web, LiveLink (OpenText) – Documentum (EMC) – SharePoint (MSFT) – Meridio (Autonomy) – FileNet (IBM) © 2011 Paul Borgermans
  • 25. With eZ Find API... <?php $solr = new eZSolrBase('http://localhost:8983/solr'); $documents = array( array( 'id' => '1135',... 'tags_lk' => array('London','2011'))); foreach ($documents as $doc){ ezfSolrUtils::addDocument($solr, $doc); } $solr->commit(); ?> © 2011 Paul Borgermans
  • 26. Or With Zeta Components Search ● http://incubator.apache.org/zetacomponents/ <?php require_once 'tutorial_autoload.php'; // on localhost with the default port $handler = new ezcSearchSolrHandler; // on another host with a different port $handler = new ezcSearchSolrHandler ( '10.0.2.184', 9123 ); ?> © 2011 Paul Borgermans
  • 27. Indexing workflow ● Assemble documents in the correct XML format ● Send one or more documents at a time ● Commit => it becomes searchable ● Optional parameters – Boosting at the document level – Boosting at the field level – Auto-commit heartbeat interval (commitWithin, millisecs) © 2011 Paul Borgermans
  • 28. Indexing workflow: important properties (...) ● Update = Add with same global id ● Deleting – An individual document (id) – A collection of documents (using a Solr query expression) – Needs a commit() to really disappear from search results © 2011 Paul Borgermans
  • 29. Indexing: performance considerations ● Commits can become expensive – Use them wisely: in batches where you can – Delay options ● cron job ● CommitWithin parameter ● From time to time, also need an optimize() command – Deletes leave “holes” – File fragmentation with adding/updating – Daily, weekly for very large indexes (multi GB) © 2011 Paul Borgermans
  • 30. But you will also need to configure Solr © 2011 Paul Borgermans
  • 31. Field definitions: schema.xml ● Field types – text – numerical – dates – location – … (about 25 in total) ● Actual fields (name, definition, properties) ● Dynamic fields ● Copy fields (as aggregators) © 2011 Paul Borgermans
  • 32. schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date range queries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matching of words --> <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> © 2011 Paul Borgermans
  • 33. Analysis ● Solr does not really search your text, but rather the terms that result from the analysis of text ● Typically a chain of – Character filter(s) – Tokenisation – Filter A – Filter B – … © 2011 Paul Borgermans
  • 34. Solr comes with many tokenizers and filters ● Some are language specific ● Others are very specialised ● It is very important to get this right otherwise, you may not get what you expect! Best practice: do like eZ Find, provide multiple incarnations to suite facet, filter and search needs © 2011 Paul Borgermans
  • 35. Semantic aspects © 2011 Paul Borgermans
  • 36. Semantic aspects: using an annotation engine ● Main use cases for CMS systems – Suggest tags to use for editors – Enhance search engine relevancy – Enhance clustering (related content) ● Based on – Domain specific ontologies – Public available databases and (RESTful) services © 2011 Paul Borgermans
  • 37. Annotation engine: “open” databases © 2011 Paul Borgermans
  • 38. eZ Publish / eZ Find integration ● Personal initiative – Joined an EC funded project as “early adopter” ● Initial goals: – eZ Find relevancy optimisation – Annotation suggestions from public data ● More ambitious – eZ Publish based, domain specific ontology definition – TBD, as Apache Stanbol evolves © 2011 Paul Borgermans
  • 39. Something extra ... © 2011 Paul Borgermans
  • 40. The eZ Publish content model ● One of the main strengths ● But – Do you need versioning in all cases? – Translations: quite tightly coupled – Difficulties to have workflows independent of the published version – Variability in objects: sometimes too rigid – Want traveling objects (UUID) – ... ● And of course: scalability of the implementation is limited too © 2011 Paul Borgermans
  • 41. So, a call for participation …. © 2011 Paul Borgermans
  • 42. A new content repository project ● Provide a very powerful content model – adaptable to various scenarios and use-cases ● Exposes a rich service layer, including an optional security model – Role / policy based ● Exposes its content through a variety of ways – Simple to use API PHP – REST-style – Later: various standards (PHPCR, CMIS) © 2011 Paul Borgermans
  • 43. A new content repository ... ● Builds on top of an IR (information retrieval) layer – initially SOLR based ● Pluggable persistence layer – Traditional RDBMS – Highly scalable NoSQL stores (Hbase, MongoDB, CouchDB, ..) © 2011 Paul Borgermans
  • 44. Connects to eZ Publish through .. ● eZ Find ● Dedicated modules and after refactoring of the kernel ● Use it as a content store for eZ Publish itself © 2011 Paul Borgermans
  • 45. Thank you! Questions? http://joind.in/3443 paul.borgermans@gmail.com @paulborgermans © 2011 Paul Borgermans