Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Solr Presentation

Open source Search

  • Inicia sesión para ver los comentarios

Solr Presentation

  1. 1. Solr
  2. 2. Solr Windows Installation <ul><li>Download and install Tomcat for Windows using the MSI installer. Install it with the tcnative.dll file. Say you installed it in c: omcat </li></ul><ul><li>Check if Tomcat is installed correctly by going to http://localhost:8080/ </li></ul><ul><li>Change the c: omcatconfserver.xml file to add the URIEncoding Connector element as shown below. </li></ul><ul><li><Connector port=&quot;8080&quot; maxHttpHeaderSize=&quot;8192&quot; URIEncoding=&quot;UTF-8&quot; </li></ul><ul><li>maxThreads=&quot;150&quot; minSpareThreads=&quot;25&quot; maxSpareThreads=&quot;75&quot; </li></ul><ul><li>enableLookups=&quot;false&quot; redirectPort=&quot;8443&quot; acceptCount=&quot;100&quot; </li></ul><ul><li>connectionTimeout=&quot;20000&quot; disableUploadTimeout=&quot;true&quot; /> </li></ul><ul><li>Download and unzip the Solr distribution zip file into (say) c: empsolrZip </li></ul>
  3. 3. Solr Windows Installation <ul><li>Make a directory called solr where you intend the application server to function, say c:websolr </li></ul><ul><li>Copy the contents of the examplesolr directory c: empsolrZipexamplesolr to c:websolr </li></ul><ul><li>Stop the Tomcat service </li></ul><ul><li>Copy the *solr*.war file from c: empsolrZipdist to the Tomcat webapps directory c: omcatwebapps </li></ul><ul><li>Rename the *solr*.war file solr.war </li></ul><ul><li>Use the system tray icon to configure Tomcat to start with the following Java option: -Dsolr.solr.home=c:websolr </li></ul><ul><li>Alternative to the previous step goto C: omcatconfCatalinalocalhost and create a file named ”solr.xml” having this line of code (see below in the notes) . But to run the server this way you will not keep the solr.war in webapps folder of tomcat but rather in some other folder like in this case I have kept it in ${catalina.home}/newsolr/solr.war. </li></ul><ul><li>Start the Tomcat service </li></ul><ul><li>Go to the solr admin page to verify that the installation is working. It will be at http://localhost:8080/solr/admin </li></ul>
  4. 4. <ul><li>In Solr and Lucene, an index is built of one or more Documents. A Document consists of one or more Fields. A Field consists of a name, content, and metadata telling Solr how to handle the content. For instance, Fields can contain strings, numbers, booleans, or dates, as well as any types you wish to add. A Field can be described using a number of options that tell Solr how to treat the content during indexing and searching. </li></ul>
  5. 5. Field attributes The contents of a stored Field are saved in the index. This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search. For example, many applications store pointers to the location of contents rather than the actual contents of a file. stored Indexed Fields are searchable and sortable. You also can run Solr's analysis process on indexed Fields, which can alter the content to improve or change results. The following section provides more information about Solr's analysis process. indexed Description Attribute name
  6. 6. Example &quot;Solr Home&quot; Directory ============================= This directory is provided as an example of what a &quot;Solr Home&quot; directory should look like. It's not strictly necessary that you copy all of the files in this directory when setting up a new instance of Solr, but it is recommended. <ul><li>Basic Directory Structure </li></ul><ul><li>------------------------- </li></ul><ul><li>The Solr Home directory typically contains the following subdirectories... </li></ul><ul><li>conf/ </li></ul><ul><li>This directory is mandatory and must contain your solrconfig.xml </li></ul><ul><li>and schema.xml. Any other optional configuration files would also </li></ul><ul><li>be kept here. </li></ul><ul><li>data/ </li></ul><ul><li>This directory is the default location where Solr will keep your </li></ul><ul><li>index, and is used by the replication scripts for dealing with </li></ul><ul><li>snapshots. You can override this location in the solrconfig.xml </li></ul><ul><li>and scripts.conf files. Solr will create this directory if it </li></ul><ul><li>does not already exist. </li></ul><ul><li>lib/ </li></ul><ul><li>This directory is optional. If it exists, Solr will load any Jars </li></ul><ul><li>found in this directory and use them to resolve any &quot;plugins&quot; </li></ul><ul><li>specified in your solrconfig.xml or schema.xml (ie: Analyzers, </li></ul><ul><li>Request Handlers, etc...) </li></ul><ul><li>bin/ </li></ul><ul><li>This directory is optional. It is the default location used for </li></ul><ul><li>keeping the replication scripts. </li></ul>
  7. 7. What Is Solr <ul><li>SOLR is a REST layer for Lucene </li></ul><ul><li>Began life at CNET to provide a robust search system </li></ul><ul><li>Joined Apache Incubator in January 2006 </li></ul><ul><li>Graduated to Lucene sub-project status in January 2007 </li></ul><ul><li>A full text search server based on Lucene </li></ul><ul><li>XML/HTTP Interfaces </li></ul><ul><li>Loose Schema to define types and fields </li></ul><ul><li>Web Administration Interface </li></ul><ul><li>Extensive Caching </li></ul><ul><li>Index Replication </li></ul><ul><li>Extensible Open Architecture </li></ul><ul><li>Written in Java5, deployable as a WAR </li></ul>
  8. 8. Why use SOLR? <ul><li>Easy to set up and get started </li></ul><ul><li>Powerful full text searching </li></ul><ul><li>Cross platform - Java and REST </li></ul><ul><li>Under active development </li></ul><ul><li>Fast </li></ul><ul><li>Adds extra functionality on top of Lucene: </li></ul><ul><ul><li>replication </li></ul></ul><ul><ul><li>CSV importing </li></ul></ul><ul><ul><li>JSON results </li></ul></ul><ul><ul><li>results highlighting </li></ul></ul><ul><ul><li>synonym support </li></ul></ul>
  9. 10. Adding Documents <ul><li>HTTP POST to /update </li></ul><ul><li><add><doc boost=“2”> </li></ul><ul><li><field name=“article”>05991</field> </li></ul><ul><li><field name=“title”>Apache Solr</field> </li></ul><ul><li><field name=“subject”>An intro...</field> </li></ul><ul><li><field name=“category”>search</field> </li></ul><ul><li><field name=“category”>lucene</field> </li></ul><ul><li><field name=“body”>Solr is a full...</field> </li></ul><ul><li></doc></add> </li></ul>
  10. 11. Deleting Documents <ul><li>Delete by Id </li></ul><ul><li><delete><id>05591</id></delete> </li></ul><ul><li>Delete by Query (multiple documents) </li></ul><ul><li><delete> </li></ul><ul><li><query>manufacturer:microsoft</query> </li></ul><ul><li></delete> </li></ul>
  11. 12. Commit <ul><li><commit/> makes changes visible </li></ul><ul><ul><li>closes IndexWriter </li></ul></ul><ul><ul><li>removes duplicates </li></ul></ul><ul><ul><li>opens new IndexSearcher </li></ul></ul><ul><ul><li>newSearcher/firstSearcher events </li></ul></ul><ul><ul><li>cache warming </li></ul></ul><ul><ul><li>“ register” the new IndexSearcher </li></ul></ul><ul><li><optimize/> same as commit, merges all </li></ul><ul><ul><li>index segments. </li></ul></ul>
  12. 13. Lucene syntax <ul><li>Required search term – use a “+” </li></ul><ul><ul><li>+ipod </li></ul></ul><ul><ul><li>+belkin </li></ul></ul><ul><li>Field-specific searching – use fieldName </li></ul><ul><ul><li>name:ipod </li></ul></ul><ul><ul><li>manu:belkin </li></ul></ul><ul><li>Wildcard searching – use * or ? </li></ul><ul><ul><li>ip?d </li></ul></ul><ul><ul><li>belk* </li></ul></ul><ul><ul><li>*deo (currently requires modifying solr source) </li></ul></ul><ul><li>Range searching </li></ul><ul><ul><li>timestamp:[2006-07-16T12:30:00Z to *] </li></ul></ul><ul><ul><li>Time needs to be full ISO </li></ul></ul><ul><li>Proximity searching – use a “~” </li></ul><ul><ul><li>&quot;video ipod&quot;~3 – up to 3 words apart </li></ul></ul><ul><li>Fuzzy searches – use a “~” </li></ul><ul><ul><li>ipod~ will find ipod and ipods </li></ul></ul><ul><ul><li>belkin~0.7 will find words close spellings </li></ul></ul>
  13. 14. Default Query Syntax <ul><li>Lucene Query Syntax [; sort specification] </li></ul><ul><li>1. mission impossible; releaseDate desc </li></ul><ul><li>2. +mission +impossible –actor:cruise </li></ul><ul><li>3. “mission impossible” –actor:cruise </li></ul><ul><li>4. title:spiderman^10 description:spiderman </li></ul><ul><li>5. description:“spiderman movie”~10 </li></ul><ul><li>6. +HDTV +weight:[0 TO 100] </li></ul><ul><li>7. Wildcard queries: te?t, te*t, test* </li></ul>
  14. 15. Full control panel interface <ul><li>Start row/max rows – pagination </li></ul><ul><li>Output type </li></ul><ul><ul><li>standard (xml), python, json, ruby, xslt </li></ul></ul><ul><li>Enable highlighting </li></ul><ul><ul><li>fields to highlight </li></ul></ul><ul><ul><li>works on wildcard matches </li></ul></ul>
  15. 16. Default Parameters <ul><li>Query Arguments for HTTP GET/POST to /select </li></ul>
  16. 17. Search Results <ul><li>http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price </li></ul><ul><li><response><responseHeader><status>0</status> </li></ul><ul><li><QTime>1</QTime></responseHeader> </li></ul><ul><li><result numFound=&quot;16173&quot; start=&quot;0&quot;> </li></ul><ul><li><doc> </li></ul><ul><li><str name=&quot;name&quot;>Apple 60 GB iPod with Video</str> </li></ul><ul><li><float name=&quot;price&quot;>399.0</float> </li></ul><ul><li></doc> </li></ul><ul><li><doc> </li></ul><ul><li><str name=&quot;name&quot;>ASUS Extreme N7800GTX/2DHTV</str> </li></ul><ul><li><float name=&quot;price&quot;>479.95</float> </li></ul><ul><li></doc> </li></ul><ul><li></result> </li></ul><ul><li></response> </li></ul>
  17. 18. Caching <ul><li>IndexSearcher’s view of an index is fixed </li></ul><ul><li>• Aggressive caching possible </li></ul><ul><li>• Consistency for multi-query requests </li></ul><ul><li>filterCache – unordered set of document ids </li></ul><ul><li>matching a query </li></ul><ul><li>resultCache – ordered subset of document ids </li></ul><ul><li>matching a query </li></ul><ul><li>documentCache – the stored fields of documents </li></ul><ul><li>userCaches – application specific, custom query </li></ul><ul><li>handlers </li></ul>
  18. 19. Warming for Speed <ul><li>Lucene IndexReader warming </li></ul><ul><ul><li>field norms, FieldCache, tii – the term index </li></ul></ul><ul><li>Static Cache warming </li></ul><ul><ul><li>Configurable static requests to warm new </li></ul></ul><ul><li> Searchers </li></ul><ul><li>Smart Cache Warming (autowarming) </li></ul><ul><ul><li>Using MRU items in the current cache to prepopulate </li></ul></ul><ul><li>the new cache </li></ul><ul><li>Warming in parallel with live requests </li></ul>
  19. 20. Smart Cache Warming
  20. 21. Schema <ul><li>Lucene has no notion of a schema </li></ul><ul><ul><li>Sorting - string vs. numeric </li></ul></ul><ul><ul><li>Ranges - val:42 included in val:[1 TO 5] ? </li></ul></ul><ul><ul><li>Lucene QueryParser has date-range support, </li></ul></ul><ul><li> but must guess. </li></ul><ul><li>Defines fields, their types, properties </li></ul><ul><li>Defines unique key field, default search </li></ul><ul><li>field, Similarity implementation </li></ul>
  21. 22. Field Definitions <ul><li>Field Attributes: name, type, indexed, stored, multiValued, </li></ul><ul><li>omitNorms </li></ul><ul><li><field name=&quot;id“ type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;true&quot;/> </li></ul><ul><li><field name=&quot;sku“ type=&quot;textTight” indexed=&quot;true&quot; stored=&quot;true&quot;/> </li></ul><ul><li><field name=&quot;name“ type=&quot;text“ indexed=&quot;true&quot; stored=&quot;true&quot;/> </li></ul><ul><li><field name=“reviews“ type=&quot;text“ indexed=&quot;true“ stored=“false&quot;/> </li></ul><ul><li><field name=&quot;category“ type=&quot;text_ws“ indexed=&quot;true&quot; stored=&quot;true“ </li></ul><ul><li>multiValued=&quot;true&quot;/> </li></ul><ul><li>Dynamic Fields, in the spirit of Lucene! </li></ul><ul><li><dynamicField name=&quot;*_i&quot; type=&quot;sint“ indexed=&quot;true&quot; stored=&quot;true&quot;/> </li></ul><ul><li><dynamicField name=&quot;*_s&quot; type=&quot;string“ indexed=&quot;true&quot; stored=&quot;true&quot;/> </li></ul><ul><li><dynamicField name=&quot;*_t&quot; type=&quot;text“ indexed=&quot;true&quot; stored=&quot;true&quot;/> </li></ul>
  22. 23. Search Relevancy
  23. 24. Configuring Relevancy <ul><li><fieldtype name=&quot;text&quot; class=&quot;solr.TextField&quot;> </li></ul><ul><li><analyzer> </li></ul><ul><li><tokenizer class=&quot;solr.WhitespaceTokenizerFactory&quot;/> </li></ul><ul><li><filter class=&quot;solr.LowerCaseFilterFactory&quot;/> </li></ul><ul><li><filter class=&quot;solr.SynonymFilterFactory&quot; </li></ul><ul><li>synonyms=&quot;synonyms.txt“/> </li></ul><ul><li><filter class=&quot;solr.StopFilterFactory“ </li></ul><ul><li>words=“stopwords.txt”/> </li></ul><ul><li><filter class=&quot;solr.EnglishPorterFilterFactory&quot; </li></ul><ul><li>protected=&quot;protwords.txt&quot;/> </li></ul><ul><li></analyzer> </li></ul><ul><li></fieldtype> </li></ul>
  24. 25. copyField <ul><li>Copies one field to another at index time </li></ul><ul><li>Usecase: Analyze same field different ways </li></ul><ul><ul><li>copy into a field with a different analyzer </li></ul></ul><ul><ul><li>boost exact-case, exact-punctuation matches </li></ul></ul><ul><ul><li>language translations, thesaurus, soundex </li></ul></ul><ul><li><field name=“title” type=“text”/> </li></ul><ul><li><field name=“title_exact” type=“text_exact” stored=“false”/> </li></ul><ul><li><copyField source=“title” dest=“title_exact”/> </li></ul><ul><li>Usecase: Index multiple fields into single searchable </li></ul><ul><li>field </li></ul>
  25. 30. Web Admin Interface <ul><li>Show Config, Schema, Distribution info </li></ul><ul><li>Query Interface </li></ul><ul><li>Statistics </li></ul><ul><ul><li>Caches: lookups, hits, hitratio, inserts, evictions, size </li></ul></ul><ul><ul><li>RequestHandlers: requests, errors </li></ul></ul><ul><ul><li>UpdateHandler: adds, deletes, commits, optimizes </li></ul></ul><ul><ul><li>IndexReader, open-time, index-version, numDocs, </li></ul></ul><ul><li> maxDocs, </li></ul><ul><li>Analysis Debugger </li></ul><ul><ul><li>Shows tokens after each Analyzer stage </li></ul></ul><ul><ul><li>Shows token matches for query vs index </li></ul></ul>
  26. 31. Selling Points <ul><li>Fast </li></ul><ul><li>Powerful & Configurable </li></ul><ul><li>High Relevancy </li></ul><ul><li>Mature Product </li></ul><ul><li>Same features as software costing $$$ </li></ul><ul><li>Leverage Community </li></ul><ul><ul><li>Lucene committers, IR experts </li></ul></ul><ul><ul><li>Free consulting: shared problems & solutions </li></ul></ul>
  27. 32. Where are we going? <ul><li>OOTB Simple Faceted Browsing </li></ul><ul><li>Automatic Database Indexing </li></ul><ul><li>Federated Search </li></ul><ul><li>HA with failover </li></ul><ul><li>Alternate output formats (JSON, Ruby) </li></ul><ul><li>Highlighter integration </li></ul><ul><li>Spellchecker </li></ul><ul><li>Alternate APIs (Google Data, OpenSearch) </li></ul>