48. Custom Analyzers WhitespaceTokenizer Tokenize at white spaces KeywordTokenizer Tokenize input as a single token StandardTokenizer Tokenize at white spaces but keeping high-level entity as token (email, etc TODO) LowerCaseFilter Lowercases token text StopFilter Removes words that exist in a provided set of words PorterStemFilter Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri . Some descriptions from Lucene In Action 2 nd Edition
80. Solr Indexation < add > < doc > < field name = "id" > 002 </ field > < field name = "title" > Lucene And Solr Introduction </ field > < field name = "presenter" > Pascal Dimassimo </ field > < field name = "date" > 2010-11-18T00:00:00Z </ field > < field name = "abstract" > ... </ field > </ doc > <doc>...</doc> </ add > curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary @add.xml
83. Response in XML by default, but other formats are supported (json, php, ruby)
84. Solr Query curl http://localhost:8983/solr/select?q=title:Lucene < response > < lst name = "responseHeader" > < int name = "status" > 0 </ int > < int name = "QTime" > 269 </ int > < lst name = "params" > < str name = "q" > title:Lucene </ str > </ lst > </ lst > < result name = "response" numFound = "1" start = "0" > < doc > < str name = "id" > 002 </ str > < str name = "title" > Lucene And Solr Introduction </ str > < str name = "presenter" > Pascal Dimassimo </ str > < date name = "date" > 2010-11-18T00:00:00Z </ date > < str name = "abstract" > ... </ str > </ doc > </ result > </ response >
85. Solr Query Parameters q Lucene Query sort Field to sort on. Defaut to score start Offset for the results page to display. Default 0 rows Numbers of results to display per page. Default 10 fq Filter Query. Default to all documents fl List of fields to display per document. Default to all fields wt Format to display result. Default to xml
Do one thing well Apache Licence 10 years Version 3.0 It is fast!
Analyze documents: split each words Get documents in. Lucene returns a list of documents as search result.
Exemple livre: on recherche du début à chaque fois qu'on recherche un mot Beacoup plus simple d'utiliser un index Inverted index: for a word, list documents that contains it
Analyse: transformer le contenu en termes Un terme pourrait être plus d'un mot: “New York” Position is also stored Binary Search: O(log n) -> logarithmic Boolean Search Wildcard Search
Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
Document: email, article, usager Email fields: expéditeur, destinataire, titre, contenu, attachement Article fields: auteur, titre, catégorie, contenu, date de publication Analogie BD: document = rangée, field = colonne On peut stocker des documents avec des champs différents.
Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
Lucene can returns results sorted by a field
Terms almost synonym of words
Basic Query instance: TermQuery Use PerFieldAnalyzerWrapper to specify the specific analyzer for each field
Terms stored in alphabetical order. Using String.compareTo. Returns all docs for each terms in range
Supports AND, OR, NOT Supports +, -
Supports AND, OR, NOT Supports +, -
CNET l'a utilisé pour permettre aux utilisateurs de mieux retrouver les produits