Lucene And Solr Intro

1. Lucene And Solr Introduction By Pascal Dimassimo [email_address]

3. Working for OpenText/Nstein on Semantic Navigation application

4. http://semanticnavigation.opentext.com/

6. Solr launches in 2006

10. “Largely responsible for significant decline in commercial OEM revenue” Source http://lucenerevolution.com/sites/default/files/slides/Lucene%20Rev%20Preso%20IDC_MarketTrends_Reynolds.pdf

12. NOT an application

13. Text indexing and searching

14. Open Source

15. Mature

16. Easy to learn API

17. Typical Search App Taken from Lucene In Action 2 nd Edition Lucene

19. O(n) -> Slow...

20. You want to find a word in a book: how do you do it?

21. Inverted Index

22. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )

23. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )

24. Lucene Document FSDirectory dir = FSDirectory. open ( new File( "./index" )); SimpleAnalyzer analyzer = new SimpleAnalyzer(); MaxFieldLength len = IndexWriter.MaxFieldLength. UNLIMITED ; IndexWriter writer = new IndexWriter(dir, analyzer, true , len); String content = "The old night keeper keeps the keep in the town" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); writer.addDocument(doc); writer.commit();

26. Organized in fields. A field must be specified at query time!

27. Schema-less

28. Plain text

30. Analyzed: split the content into terms to be added to the inverted index. Normalized terms.

31. Stored: Keep the original content on disk

32. Multivalued: Repeat the same field multiple times in the same document with different values

33. Lucene Document String content = "The old night keeper keeps the keep in the town" ; String author = "Peter Smith" ; String category1 = "Fiction" ; String category2 = "Canadian" ; String isbn = "978-1-933988-17-7" ; String id = "ABY123" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "author" , author, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category1, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category2, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "isbn" , isbn, Field.Store. YES , Field.Index. NOT_ANALYZED )); doc.add( new Field( "id" , id, Field.Store. YES , Field.Index. NO )); writer.addDocument(doc); writer.commit();

37. Score represents how close the vectors are

38. Tf-idf (term frequency–inverse document frequency)

39. Documents with many of the search terms are scored higher

40. Smaller documents are scored higher

41. Analyzer Taken from Lucene In Action 2 nd Edition

43. Used when indexing and querying

44. Tokenizer + Filters

45. Custom analyzers

46. Analyzer "The quick brown fox jumped over the lazy dog" WhitespaceAnalyzer [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] SimpleAnalyzer [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Example from Lucene In Action 2 nd Edition

47. Analyzer "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer [xy&z] [corporation] [xyz@example.com] Example from Lucene In Action 2 nd Edition

48. Custom Analyzers WhitespaceTokenizer Tokenize at white spaces KeywordTokenizer Tokenize input as a single token StandardTokenizer Tokenize at white spaces but keeping high-level entity as token (email, etc TODO) LowerCaseFilter Lowercases token text StopFilter Removes words that exist in a provided set of words PorterStemFilter Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri . Some descriptions from Lucene In Action 2 nd Edition

50. Lucene applied an Analyzer to each word queried

51. Query can be programmatically build

52. Powerful Query Syntax

53. Query code SimpleAnalyzer analyzer = new SimpleAnalyzer(); QueryParser parser = new QueryParser(Version. LUCENE_30 , "content" , analyzer); Query query = parser.parse( "big" ); TopDocs docs = searcher.search(query, 10);

54. Query Syntax: Basic title:montreal text field

55. Query Syntax: Range name:[a TO k] range field

56. Query Syntax: Boolean title:(java AND programming) operator field

57. Query Syntax: Boolean title:java OR name:pascal operator field field

58. Query Syntax: Phrase title:”Lucene in Action” phrase field

59. Query Syntax: Wildcard title:program* Term prefix field

62. Relevancy scoring

63. Powerful query syntax

65. HTTP application built around Lucene

66. Makes it easy to develop search solutions

67. Advanced features develop on top of Lucene

68. As of 2010, Lucene and Solr are merged

70. Each index has its own schema

71. Lists all fields allowed for an index

72. Defines the analyzers for each field

73. Solr Schema < field name = "id" type = "string" indexed = "true" stored = "true" required = "true" /> < field name = "title" type = "text" indexed = "true" stored = "true" /> < field name = "presenter" type = "text_ws" indexed = "true" stored = "true" /> < field name = "date" type = "date" indexed = "true" stored = "true" /> < field name = "abstract" type = "text" indexed = "true" stored = "true" />

74. Solr Schema < fieldType name = "text" class = "solr.TextField" positionIncrementGap = "100" > < analyzer type = "index" > < tokenizer class = "solr.WhitespaceTokenizerFactory" /> < filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "stopwords.txt" /> < filter class = "solr.LowerCaseFilterFactory" /> < filter class = "solr.ISOLatin1AccentFilterFactory" /> < filter class = "solr.SnowballPorterFilterFactory" language = "English" protected = "protwords.txt" /> </ analyzer > < analyzer type = "query" > < tokenizer class = "solr.WhitespaceTokenizerFactory" /> < filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "stopwords.txt" /> < filter class = "solr.LowerCaseFilterFactory" /> < filter class = "solr.ISOLatin1AccentFilterFactory" /> < filter class = "solr.SnowballPorterFilterFactory" language = "English" protected = "protwords.txt" /> </ analyzer > </ fieldType >

75. Solr Schema < fieldType name = "text_ws" class = "solr.TextField" positionIncrementGap = "100" > < analyzer type = "index" > < tokenizer class = "solr.WhitespaceTokenizerFactory" /> < filter class = "solr.LowerCaseFilterFactory" /> </ analyzer > < analyzer type = "query" > < tokenizer class = "solr.WhitespaceTokenizerFactory" /> < filter class = "solr.LowerCaseFilterFactory" /> </ analyzer > </ fieldType >

77. XML by default, but also CSV

78. Multi threaded

79. Advanced features: binary document extraction, DB plugin

80. Solr Indexation < add > < doc > < field name = "id" > 002 </ field > < field name = "title" > Lucene And Solr Introduction </ field > < field name = "presenter" > Pascal Dimassimo </ field > < field name = "date" > 2010-11-18T00:00:00Z </ field > < field name = "abstract" > ... </ field > </ doc > <doc>...</doc> </ add > curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary @add.xml

82. Query Parameters

83. Response in XML by default, but other formats are supported (json, php, ruby)

84. Solr Query curl http://localhost:8983/solr/select?q=title:Lucene < response > < lst name = "responseHeader" > < int name = "status" > 0 </ int > < int name = "QTime" > 269 </ int > < lst name = "params" > < str name = "q" > title:Lucene </ str > </ lst > </ lst > < result name = "response" numFound = "1" start = "0" > < doc > < str name = "id" > 002 </ str > < str name = "title" > Lucene And Solr Introduction </ str > < str name = "presenter" > Pascal Dimassimo </ str > < date name = "date" > 2010-11-18T00:00:00Z </ date > < str name = "abstract" > ... </ str > </ doc > </ result > </ response >

85. Solr Query Parameters q Lucene Query sort Field to sort on. Defaut to score start Offset for the results page to display. Default 0 rows Numbers of results to display per page. Default 10 fq Filter Query. Default to all documents fl List of fields to display per document. Default to all fields wt Format to display result. Default to xml

87. Useful for drilling down in results set

90. Indexed using solr-feeder

91. https://github.com/pascaldimassimo/solr-feeder

93. Spell Checking

94. Forced Placements

95. More Like This

96. Replication

97. Database connector

98. Geo-location

99. More Information

Lucene And Solr Intro

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Lucene And Solr Intro

Similar to Lucene And Solr Intro (20)

Recently uploaded

Recently uploaded (20)

Lucene And Solr Intro

Editor's Notes