SlideShare una empresa de Scribd logo
1 de 43
Apache Solr 1.4
Faster, Easier, and More Versatile than Ever




                                               1
About Erik

• Member of Technical Staff,
  Lucid Imagination
• Co-author "Lucene in Action"
• Frequent speaker at industry conferences
• Committer: Lucene and Solr
• Apache Lucene PMC member

                                             2
Lucene



         3
Lucene 2.9
• IndexReader#reopen()
• Faster filter performance, by 300% in some
  cases
• Per-segment FieldCache
• Reusable token streams
• Faster numeric/date range queries, thanks to
  trie
• and tons more, see Lucene 2.9's CHANGES.txt
                                                 4
reopen()




           5
Trie fields

• Trie* fields index multiple precisions steps
• Works for numerics & dates
• Result: up to 40x faster than standard range
  queries
• Configurable precision step


                                                 6
Trie Example




Trie-Range Prefix Tree Encoding and relevant nodes for range [215 TO 977]




                                                                           7
Solr



       8
New Logo!




            9
Performance Improvements

• Caching
• Concurrent file access
• Per-segment index updates
• Faceting
• DocSet generation, avoids scoring
• Streaming updates for SolrJ

                                      10
Caching
• LRU cache now based on
  ConcurrentHashMap
• Reads are lockless, writes are partitioned
• Minimizes overhead of synchronization
• Improves filterCache, queryCache, and
  documentCache
• Anecdotal evidence: can double query
  throughput in some circumstances


                                               11
Concurrent file access


• Defaults to NIOFSDirectory on non-
  Windows platforms for better concurrency
• Pluggable DirectoryProvider



                                             12
Per-segment caching


• FieldCache per segment
• Improves searching, sorting, and function
  queries




                                              13
Faceting performance


• Major performance improvements on
  multi-valued fields!
• http://yonik.wordpress.com/2008/11/25/
  solr-faceted-search-performance-
  improvements/




                                           14
SolrJ
• Binary updates (bye bye XML)
• StreamingUpdateSolrServer
 • Streams multiple documents over
    multiple connections
  • Simple test went from 231 docs/sec to
    25000 docs/sec!
• LBHttpSolrServer
 • load balancing / failover
                                            15
Feature Improvements
•   Rich document indexing

•   DataImportHandler enhancements

•   Smoother replication

•   More choices for logging

•   Multi-select faceting

•   Speedier range queries

•   Duplicate detection

•   New request handler components


                                     16
Solr Cell

• Content Extraction Library L ?
• Richly extracts text from Microsoft Office,
  PDF, HTML, and many other formats
• Powered by Apache Tika
• http://wiki.apache.org/solr/
  ExtractingRequestHandler



                                               17
DataImportHandler
•   SQL deltaImportQuery

•   abort, skip, continue options

•   FieldReader

    •   example: can read XML from CLOB

•   HTML strip transformer

•   Event listener API (import start/end)

•   ContentStreamDataSource

•   MailEntityProcessor

•   LineEntityProcessor, FileListEntityProcessor




                                                   18
Replication
•   Solr 1.3: rsync-based
•   Solr 1.4: Java-based, request handler
•   Copy config files, and optionally rename
•   Basic authentication support
•   Custom Solr index deletion policy enables
    deletion of commit points on various criteria
    such as number of commits, age of commit
    point and optimized status (pluggable)


                                                    19
Replication Configuration
Master
<requestHandler name="/replication" class="solr.ReplicationHandler">
  <lst name="master">
    <str name="replicateAfter">commit</str>
    <str name="confFiles">schema.xml,stopwords.txt</str>
  </lst>
</requestHandler>

Slave
<requestHandler name="/replication" class="solr.ReplicationHandler">
  <lst name="slave">
    <str name="masterUrl">
      http://masterhostname:8983/solr/replication
    </str>
    <str name="pollInterval">00:00:60</str>
  </lst>
</requestHandler>
                                                                       20
New faceting features


• multi-select
 • tag a filter query (fq)
 • exclude a tagged filter from facet counts
• key - for labeling response structure


                                              21
q=solr+1.4&facet=true&fq={!tag=proj}project:(lucene OR solr)&facet.field={!ex=proj}project

                                                                                             22
Deduplication

• Detect duplicates during indexing and
  handle them
• Adds a signature field to the document
  (could be uniqueKey)
• Exact (hash on certain fields) or Fuzzy
  duplicate detection
• http://wiki.apache.org/solr/Deduplication

                                              23
commit/rollback



• commitWithin
• rollback



                         24
Analysis
•   CharFilter
                                      •   avoid splitting or
                                          dropping international
•   DoubleMetaphone
                                          non-letter characters
•   PersianAnalyzer,                      such as non spacing
                                          marks.
    ArabicAnalyzer,
    SmartChineseAnalyzer
                                  •   PositionFilter support
•   WordDelimiterFilter
                                  •   SnowballPorterFilterFactory
    •   splitOnNumerics               - protected words support

    •   protected words support   •   HTMLStripCharFilter

    •   stemEnglishPossessive     •   CommonGrams




                                                                    25
omitTermFreqAndPositions


• Omits number of terms in that specific field
  & list of positions
• Saves time and index space for non-text
  fields




                                               26
Trie Numeric Field Types
<fieldType name="tint" class="solr.TrieIntField"
           precisionStep="8" omitNorms="true"
           positionIncrementGap="0"/>

<fieldType name="tfloat" class="solr.TrieFloatField"
           precisionStep="8" omitNorms="true"
           positionIncrementGap="0"/>

<fieldType name="tlong" class="solr.TrieLongField"
           precisionStep="8" omitNorms="true"
           positionIncrementGap="0"/>

<fieldType name="tdouble" class="solr.TrieDoubleField"
           precisionStep="8" omitNorms="true"
           positionIncrementGap="0"/>

                                                       27
Trie Date Field Types


<fieldType name="date" class="solr.TrieDateField"
           omitNorms="true" precisionStep="0"
           positionIncrementGap="0"/>

<fieldType name="tdate" class="solr.TrieDateField"
           omitNorms="true" precisionStep="6"
           positionIncrementGap="0"/>




                                                     28
Wildcard handling
•   ReversedWildcardFilterFactory

•   Index-time reversal, query-time handling

•   Uses special marker for end

•   Example:

    •   Document: <field   name="text_rev">Solr</field>

    •   Indexed: #rlos

    •   Query: *olr -> #rlo*

    •   # used here to denote marker character



                                                          29
Function queries

• milliseconds: ms()
• subtraction: sub()
• {!frange l=6 u=9}sqrt(sum(a,b))
 • http://yonik.wordpress.com/2009/07/06/
    ranges-over-functions-in-solr-1-4/




                                            30
Stats Component

• min, max, sum, sumOfSquares, count,
    missing, mean, stddev
• numeric fields only
•   http://localhost:8983/solr/select?
    q=*:*&stats=true&stats.field=price&stats.field=popularity
    &rows=0&indent=true




                                                              31
Terms Component


• Return indexed terms+docfreq in a field,
    use for auto-suggest, etc
•   http://localhost:8983/solr/terms?
    terms.fl=name&terms.lower=a&terms.sort=index




                                                  32
Term Vector Component


• Returns term info per document (tf,
    positions)
•   http://localhost:8983/solr/select/?q=*
    %3A*&version=2.2&start=0&rows=10&indent=on&qt=tvrh&
    tv=true&tv.tf=true&tv.df=true&tv.positions&tv.offsets=true




                                                                 33
Field & Document Request Handlers




• Provides similar capabilites as Solr's admin
  analysis tool, but with response in Solr's
  flexible formats




                                                 34
Clustering

• Implemented with Carrot2
• Uses Carrot2 to dynamically cluster the
  top N search results
• Like dynamically discovered facets
• http://wiki.apache.org/solr/
  ClusteringComponent



                                            35
Example Clustering Output
    <lst name="cluster">
      <lst name="labels">
             <str name="label">Car Power Adapter</str>
      </lst>
      <lst name="docs">
             <str name="doc">F8V7067-APL-KIT</str>
             <str name="doc">IW-02</str>
      </lst>
     </lst>
     <lst name="cluster">
      <lst name="labels">
             <str name="label">Display</str>
      </lst>
      <lst name="docs">
             <str name="doc">MA147LL/A</str>
             <str name="doc">VA902B</str>
      </lst>
     </lst>
     <lst name="cluster">
      <lst name="labels">
             <str name="label">Hard Drive</str>
      </lst>
      <lst name="docs">
             <str name="doc">SP2514N</str>
             <str name="doc">6H500F0</str>
      </lst>
     </lst>
     <lst name="cluster">
      <lst name="labels">
             <str name="label">Retail</str>
      </lst>
      <lst name="docs">
             <str name="doc">TWINX2048-3200PRO</str>
             <str name="doc">VS1GB400C3</str>            36
Distributed Search


• facet.sort=lex
• timeout support: shard-socket-timeout &
  shard-connection-timeout




                                            37
VelocityResponseWriter

• celeritas: swiftness, speed (Latin), origin of
  the symbol "c" for the speed of light
• solritas:Velocity template rendering of Solr
  responses
• Useful for rapid prototyping and more


                                                   38
AJAX-Solr


• Newest Solr/AJAX library
• Improves upon the old SolrJS library that
  was to be in Solr 1.4
• http://github.com/evolvingweb/AJAX-Solr/


                                              39
Odds & Ends
•   omitHeader
                                •   binary field (Base64)
•   logging switched to SLF4J
                                •   Highlighter:
•   spellcheck rebuild on
                                    •   field globbing:
    optimize, if configured
                                        hl.fl=*_text
•   maxChars on copyField
                                    •   now supports range/
•   Nested query support in             wildcard/fuzzy/prefix
    function query parser
                                    •   FieldCache stats
•   expungeDeletes                      exposed to stats.jsp
                                        and JMX
•   XInclude support

•   multicore merge
                                    •   plugins: enable flag



                                                               40
Upgrading from 1.3
• omitTermFreqAndPositions
 • set version from 1.1 to 1.2 in schema.xml
• default query parser syntax no longer
  supports ;sort options, use &sort= instead,
  or switch defType=lucenePlusSort
• Potential analysis differences when using
  WordDelimiterFilterFactory (SOLR-1078)
• Reindexing not required, but can't hurt.
                                                41
Book




       42
lucidimagination.com

                       43

Más contenido relacionado

La actualidad más candente

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 

La actualidad más candente (20)

Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
ERRest and Dojo
ERRest and DojoERRest and Dojo
ERRest and Dojo
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
 
ERRest - Designing a good REST service
ERRest - Designing a good REST serviceERRest - Designing a good REST service
ERRest - Designing a good REST service
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdution
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
ActiveRecord Query Interface (1), Season 1
ActiveRecord Query Interface (1), Season 1ActiveRecord Query Interface (1), Season 1
ActiveRecord Query Interface (1), Season 1
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
ERRest in Depth
ERRest in DepthERRest in Depth
ERRest in Depth
 
Analytics and Graph Traversal with Solr - Yonik Seeley, Cloudera
Analytics and Graph Traversal with Solr - Yonik Seeley, ClouderaAnalytics and Graph Traversal with Solr - Yonik Seeley, Cloudera
Analytics and Graph Traversal with Solr - Yonik Seeley, Cloudera
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 

Destacado

Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search Problem
Lucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
Lucidworks (Archived)
 

Destacado (20)

Beyond text similarity
Beyond text similarityBeyond text similarity
Beyond text similarity
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search Problem
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Updated: Preparing an investor presentation
Updated:  Preparing an investor presentationUpdated:  Preparing an investor presentation
Updated: Preparing an investor presentation
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Open Source Search Applications
Open Source Search ApplicationsOpen Source Search Applications
Open Source Search Applications
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
 
Solr lucene search revolution
Solr lucene search revolutionSolr lucene search revolution
Solr lucene search revolution
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutes
 
Updated: You Have An Idea ... Do You Have A Business?
Updated: You Have An Idea ...  Do You Have A Business?Updated: You Have An Idea ...  Do You Have A Business?
Updated: You Have An Idea ... Do You Have A Business?
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
 
Discover the new techniques about search application
Discover the new techniques about search applicationDiscover the new techniques about search application
Discover the new techniques about search application
 
Already, just, still, yet
Already, just, still, yetAlready, just, still, yet
Already, just, still, yet
 
IE のサポート変更が Azure に及ぼす影響
IE のサポート変更が Azure に及ぼす影響IE のサポート変更が Azure に及ぼす影響
IE のサポート変更が Azure に及ぼす影響
 
E learning At The Library
E learning At The LibraryE learning At The Library
E learning At The Library
 
Windows 8 で魅力的なWeb サイトを作る
Windows 8 で魅力的なWeb サイトを作るWindows 8 で魅力的なWeb サイトを作る
Windows 8 で魅力的なWeb サイトを作る
 
Adobe Photoshop
Adobe PhotoshopAdobe Photoshop
Adobe Photoshop
 

Similar a Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever

Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
JSGB
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
Erik Hatcher
 

Similar a Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever (20)

Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
OrigoDB - take the red pill
OrigoDB - take the red pillOrigoDB - take the red pill
OrigoDB - take the red pill
 
What's New in Apache Solr 4.10
What's New in Apache Solr 4.10What's New in Apache Solr 4.10
What's New in Apache Solr 4.10
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Oracle by Muhammad Iqbal
Oracle by Muhammad IqbalOracle by Muhammad Iqbal
Oracle by Muhammad Iqbal
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 

Más de Lucidworks (Archived)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Lucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 

Más de Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever

  • 1. Apache Solr 1.4 Faster, Easier, and More Versatile than Ever 1
  • 2. About Erik • Member of Technical Staff, Lucid Imagination • Co-author "Lucene in Action" • Frequent speaker at industry conferences • Committer: Lucene and Solr • Apache Lucene PMC member 2
  • 3. Lucene 3
  • 4. Lucene 2.9 • IndexReader#reopen() • Faster filter performance, by 300% in some cases • Per-segment FieldCache • Reusable token streams • Faster numeric/date range queries, thanks to trie • and tons more, see Lucene 2.9's CHANGES.txt 4
  • 6. Trie fields • Trie* fields index multiple precisions steps • Works for numerics & dates • Result: up to 40x faster than standard range queries • Configurable precision step 6
  • 7. Trie Example Trie-Range Prefix Tree Encoding and relevant nodes for range [215 TO 977] 7
  • 8. Solr 8
  • 10. Performance Improvements • Caching • Concurrent file access • Per-segment index updates • Faceting • DocSet generation, avoids scoring • Streaming updates for SolrJ 10
  • 11. Caching • LRU cache now based on ConcurrentHashMap • Reads are lockless, writes are partitioned • Minimizes overhead of synchronization • Improves filterCache, queryCache, and documentCache • Anecdotal evidence: can double query throughput in some circumstances 11
  • 12. Concurrent file access • Defaults to NIOFSDirectory on non- Windows platforms for better concurrency • Pluggable DirectoryProvider 12
  • 13. Per-segment caching • FieldCache per segment • Improves searching, sorting, and function queries 13
  • 14. Faceting performance • Major performance improvements on multi-valued fields! • http://yonik.wordpress.com/2008/11/25/ solr-faceted-search-performance- improvements/ 14
  • 15. SolrJ • Binary updates (bye bye XML) • StreamingUpdateSolrServer • Streams multiple documents over multiple connections • Simple test went from 231 docs/sec to 25000 docs/sec! • LBHttpSolrServer • load balancing / failover 15
  • 16. Feature Improvements • Rich document indexing • DataImportHandler enhancements • Smoother replication • More choices for logging • Multi-select faceting • Speedier range queries • Duplicate detection • New request handler components 16
  • 17. Solr Cell • Content Extraction Library L ? • Richly extracts text from Microsoft Office, PDF, HTML, and many other formats • Powered by Apache Tika • http://wiki.apache.org/solr/ ExtractingRequestHandler 17
  • 18. DataImportHandler • SQL deltaImportQuery • abort, skip, continue options • FieldReader • example: can read XML from CLOB • HTML strip transformer • Event listener API (import start/end) • ContentStreamDataSource • MailEntityProcessor • LineEntityProcessor, FileListEntityProcessor 18
  • 19. Replication • Solr 1.3: rsync-based • Solr 1.4: Java-based, request handler • Copy config files, and optionally rename • Basic authentication support • Custom Solr index deletion policy enables deletion of commit points on various criteria such as number of commits, age of commit point and optimized status (pluggable) 19
  • 20. Replication Configuration Master <requestHandler name="/replication" class="solr.ReplicationHandler"> <lst name="master"> <str name="replicateAfter">commit</str> <str name="confFiles">schema.xml,stopwords.txt</str> </lst> </requestHandler> Slave <requestHandler name="/replication" class="solr.ReplicationHandler"> <lst name="slave"> <str name="masterUrl"> http://masterhostname:8983/solr/replication </str> <str name="pollInterval">00:00:60</str> </lst> </requestHandler> 20
  • 21. New faceting features • multi-select • tag a filter query (fq) • exclude a tagged filter from facet counts • key - for labeling response structure 21
  • 23. Deduplication • Detect duplicates during indexing and handle them • Adds a signature field to the document (could be uniqueKey) • Exact (hash on certain fields) or Fuzzy duplicate detection • http://wiki.apache.org/solr/Deduplication 23
  • 25. Analysis • CharFilter • avoid splitting or dropping international • DoubleMetaphone non-letter characters • PersianAnalyzer, such as non spacing marks. ArabicAnalyzer, SmartChineseAnalyzer • PositionFilter support • WordDelimiterFilter • SnowballPorterFilterFactory • splitOnNumerics - protected words support • protected words support • HTMLStripCharFilter • stemEnglishPossessive • CommonGrams 25
  • 26. omitTermFreqAndPositions • Omits number of terms in that specific field & list of positions • Saves time and index space for non-text fields 26
  • 27. Trie Numeric Field Types <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/> <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/> <fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/> <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/> 27
  • 28. Trie Date Field Types <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/> <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> 28
  • 29. Wildcard handling • ReversedWildcardFilterFactory • Index-time reversal, query-time handling • Uses special marker for end • Example: • Document: <field name="text_rev">Solr</field> • Indexed: #rlos • Query: *olr -> #rlo* • # used here to denote marker character 29
  • 30. Function queries • milliseconds: ms() • subtraction: sub() • {!frange l=6 u=9}sqrt(sum(a,b)) • http://yonik.wordpress.com/2009/07/06/ ranges-over-functions-in-solr-1-4/ 30
  • 31. Stats Component • min, max, sum, sumOfSquares, count, missing, mean, stddev • numeric fields only • http://localhost:8983/solr/select? q=*:*&stats=true&stats.field=price&stats.field=popularity &rows=0&indent=true 31
  • 32. Terms Component • Return indexed terms+docfreq in a field, use for auto-suggest, etc • http://localhost:8983/solr/terms? terms.fl=name&terms.lower=a&terms.sort=index 32
  • 33. Term Vector Component • Returns term info per document (tf, positions) • http://localhost:8983/solr/select/?q=* %3A*&version=2.2&start=0&rows=10&indent=on&qt=tvrh& tv=true&tv.tf=true&tv.df=true&tv.positions&tv.offsets=true 33
  • 34. Field & Document Request Handlers • Provides similar capabilites as Solr's admin analysis tool, but with response in Solr's flexible formats 34
  • 35. Clustering • Implemented with Carrot2 • Uses Carrot2 to dynamically cluster the top N search results • Like dynamically discovered facets • http://wiki.apache.org/solr/ ClusteringComponent 35
  • 36. Example Clustering Output <lst name="cluster"> <lst name="labels"> <str name="label">Car Power Adapter</str> </lst> <lst name="docs"> <str name="doc">F8V7067-APL-KIT</str> <str name="doc">IW-02</str> </lst> </lst> <lst name="cluster"> <lst name="labels"> <str name="label">Display</str> </lst> <lst name="docs"> <str name="doc">MA147LL/A</str> <str name="doc">VA902B</str> </lst> </lst> <lst name="cluster"> <lst name="labels"> <str name="label">Hard Drive</str> </lst> <lst name="docs"> <str name="doc">SP2514N</str> <str name="doc">6H500F0</str> </lst> </lst> <lst name="cluster"> <lst name="labels"> <str name="label">Retail</str> </lst> <lst name="docs"> <str name="doc">TWINX2048-3200PRO</str> <str name="doc">VS1GB400C3</str> 36
  • 37. Distributed Search • facet.sort=lex • timeout support: shard-socket-timeout & shard-connection-timeout 37
  • 38. VelocityResponseWriter • celeritas: swiftness, speed (Latin), origin of the symbol "c" for the speed of light • solritas:Velocity template rendering of Solr responses • Useful for rapid prototyping and more 38
  • 39. AJAX-Solr • Newest Solr/AJAX library • Improves upon the old SolrJS library that was to be in Solr 1.4 • http://github.com/evolvingweb/AJAX-Solr/ 39
  • 40. Odds & Ends • omitHeader • binary field (Base64) • logging switched to SLF4J • Highlighter: • spellcheck rebuild on • field globbing: optimize, if configured hl.fl=*_text • maxChars on copyField • now supports range/ • Nested query support in wildcard/fuzzy/prefix function query parser • FieldCache stats • expungeDeletes exposed to stats.jsp and JMX • XInclude support • multicore merge • plugins: enable flag 40
  • 41. Upgrading from 1.3 • omitTermFreqAndPositions • set version from 1.1 to 1.2 in schema.xml • default query parser syntax no longer supports ;sort options, use &sort= instead, or switch defType=lucenePlusSort • Potential analysis differences when using WordDelimiterFilterFactory (SOLR-1078) • Reindexing not required, but can't hurt. 41
  • 42. Book 42