SlideShare una empresa de Scribd logo
1 de 48
Illuminating Lucene.Net:
Bringing Full-Text Search to Light
W. Dean Thrasher
14 May 2013
Agenda
• About the presenter
• About Lucene.Net
– What it is
– What it does
– How it works
– Who uses it
– Why you should care
More Agenda
• Core concepts
– Lucene structure
– Luke
– Terminology
• Code examples
• Things to know
• Recap
• References
W. Dean Thrasher
Dean.thrasher@infovark.com
www.infovark.com
www.linkedin.com/in/deanthrasher
@DThrasher
@infovark
BACKGROUND
Illuminating Lucene.Net
What is Lucene.Net?
Lucene.Net is a port of the Lucene search engine
library, written in C# and targeted at .NET
runtime users.
What is Lucene?
Apache Lucene is a high-performance, full-
featured text search engine library written
entirely in Java.
Apache Lucene is an open source project
available for free download.
History
1997 – Lucene project began by Doug Cutting
2000 – First open source release
2002 – First Apache Jakarta release
2005 – Lucene becomes a top-level project
2006 – Lucene.Net gets Apache incubation status
2010 – Lucene.Net orphaned by original committers
2011 – Lucene.Net reaccepted into Apache Incubator
2012 – Lucene.Net graduates from the Incubator
Why you should care
You want to provide
customers with a
“Google-like” search
experience
You want to tune
incoming queries or
results ranking
You want better
performance than SQL
“like” searches
You want to avoid
deploying a separate
search tool with your
website or application
What does it do?
• Allows you to index and search vast amounts
of text quickly
• Provides a powerful query syntax
• Integrates into applications easily
How it works
• Lucene uses an inverted index
– Maps terms to the documents that contain them
• Lucene manages its index
– Stores the index in memory or on disk
– Allows documents to be added or removed
• Makes an index for each document
• Merges the index with a set of other indices
Who uses Lucene.Net?
• Stackoverflow
• RavenDB
• Sitecore
• Orchard
• MindTouch
• Umbraco
• Sitefinity
• SubText
CONCEPTS
Illuminating Lucene.Net
Differences between Java and .Net
The Lucene.Net API:
• Lags a few steps behind the Java version of
Lucene
• Takes advantage of advanced .Net features not
found in Java
But it:
• Preserves the core Lucene concepts
• Maintains indexes that are compatible with the
Java version
Logical Index Storage
• Field – a name/value pair
• Document – a sequence of fields
• Index – a collection of documents
Physical Index Storage
• Lucene generates a
series of files within a
single directory
• Moving an index is a
copy-and-paste
operation
• You can compress or zip
an index to archive it
Luke
• Lucene Index Toolbox
• Built in Java, but can
read Lucene.Net
indexes
• http://code.google.com
/p/luke/
Analyzers and Tokens
• Analyzers take strings of text and break them
into tokens
• Tokens are chunks of text and associated
metadata
Terms, Queries and Hits
• Terms – the basic unit for searching. A field
name and a value to seek.
• Queries – combine terms to form search
criteria
• Hits – a ranked list of pointers to documents
CODE EXAMPLES
Create documents demo
• IndexWriter
• Directory
• Analyzer
• Document
• Field
Read documents demo
• IndexReader
• Term
• Query
• Hits
Update documents demo
• IndexWriter
• Document
• Term
Delete documents demo
• IndexWriter
• Query
• Term
Search demo
• IndexSearcher
• QueryParser
• Query
• Term
• TopDocs
• ScoreDoc
THINGS TO KNOW
Illuminating Lucene.Net
Transactional Lucene
• Lucene supports ACID commits to its indexes
• Lucene uses the Commit and Rollback syntax,
much like relational databases.
• Source:
http://blog.mikemccandless.com/2012/03/tra
nsactional-lucene.html
Lucene index types
FSDirectory
• Stores indexed documents
on disk
• Persists data across sessions
• Best choice for most
applications
Your first choice
RAMDirectory
• Stores indexed documents
in memory
• Entire index must fit into
available memory
• Does not persist data
• Faster than FSDirectory
Useful for unit testing
Precalculation
• How you store things in Lucene matters –
choose field options and analyzers carefully
• The way you retrieve information determines
how it should be stored
• Smaller indexes give you better performance
Field.Store
Yes – stores the text in its original form
No – the original text is not preserved
Field.Index
• No – the field is not indexed, so it is not
searchable
• Not analyzed – the text is treated as single
unit and indexed whole
• Analyzed – the text is broken down into
tokens and indexed
Field.TermVector
• No – Does not store term vectors
• Yes – Stores the term vectors of each
document (terms and number of occurrences)
• With Positions Offsets – Term vector, token
position and offset information
Field types indexing options
Field Stored Analyzed Vectored
Id Yes Not analyzed No
Modified Yes Not analyzed No
Path Yes Analyzed No
Content No Analyzed With Positions Offsets
An example of storing fields related to files on
your computer.
Analyzers
• Break apart text into tokens; each token gets
indexed separately
• Remove stop words
• Decide how to handle punctuation
• Handle languages and case sensitivity
• You can create your own by building from
scratch or chaining exiting analyzers
Types of Queries
• TermQuery
• PhraseQuery
• RangeQuery
• PrefixQuery , Wildcard Query
• FuzzyQuery
• Use BooleanQuery to combine them
Query syntax
Query Type Purpose Sample
TermQuery Single word query scarlett
PhraseQuery Matches terms in order “frankly my dear”
RangeQuery Matches documents between the
terms
[1861 to 1865]
{1861 to 1865}
WildcardQuery Lightweight regex-like term matching Atl*
D?m?
PrefixQuery Matches terms that being with the
string
War*
FuzzyQuery Closeness matching cry~
BooleanQuery Combines other queries into complex
expressions
Scarlett AND “frankly my
dear” -voldemort
Query, Filter, and Sort
• Lucene.Net can handle all three
• Default sort is by relevance
• Prefer queries to filters – they perform better
Using Dispose()
Linq Providers
• LINQ to Lucene
• http://linqtolucene.cod
eplex.com/
• Lucene.Net.Linq
• https://github.com/the
motleyfool/Lucene.Net.
Linq
• Chris Eldredge
• MotleyFool
Recap
• Why would I use a search engine?
• Why would I use Lucene.Net?
• How would I add Lucene.Net to my project?
– Web
– Desktop
• Where could I go to learn more?
• When can I buy Dean a beer?
REFERENCES
Illuminating Lucene.Net
Web References
• Lucene.Net – http://lucenenet.apache.org
• Solr – http://lucene.apache.org/solr
• Wikipedia
– http://en.wikipedia.org/wiki/Lucene
– http://en.wikipedia.org/wiki/Search_engine_indexing
• Academic discussions
– http://lucene.sourceforge.net/talks/pisa/
– http://lucene.sourceforge.net/talks/inktomi/
Books
• Lucene in Action,
Second Edition
• Michael McCandless,
Erick Hatcher, Otis
Gospodnetić
• Manning Publications
• July 2010
• http://www.manning.co
m/hatcher3/
Books
• Taming Text
• Grant S. Ingersoll,
Thomas S. Morton,
Andrew L. Farris
• Manning Publications
• January 2013
• http://www.manning.co
m/ingersoll/
Books
• Introduction to
Information Retrieval
• Christopher D. Manning,
Prabhakar Raghavan,
Hinrich Schutze
• Cambridge University Press
• 2008
• http://www-
nlp.stanford.edu/IR-book/
Presentations
• http://www.slideshare.net/nitin_stephens/luc
ene-basics
Blogs
• http://blog.mikemccandless.com/
Sample Files
All the literature shown in the code samples
comes from Project Gutenberg.
http://www.gutenberg.org/

Más contenido relacionado

La actualidad más candente

Integrating Doctrine with Laravel
Integrating Doctrine with LaravelIntegrating Doctrine with Laravel
Integrating Doctrine with LaravelMark Garratt
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Ceylon introduction by Stéphane Épardaud
Ceylon introduction by Stéphane ÉpardaudCeylon introduction by Stéphane Épardaud
Ceylon introduction by Stéphane ÉpardaudUnFroMage
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesRahul Singh
 
Ceylon SDK by Stéphane Épardaud
Ceylon SDK by Stéphane ÉpardaudCeylon SDK by Stéphane Épardaud
Ceylon SDK by Stéphane ÉpardaudUnFroMage
 
Apachesolr presentation
Apachesolr presentationApachesolr presentation
Apachesolr presentationfreeformkurt
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Python first day
Python first dayPython first day
Python first dayfarkhand
 
Java SE 7 New Features and Enhancements
Java SE 7 New Features and EnhancementsJava SE 7 New Features and Enhancements
Java SE 7 New Features and EnhancementsFu Cheng
 
Ceylon module repositories by Aleš Justin
Ceylon module repositories by Aleš JustinCeylon module repositories by Aleš Justin
Ceylon module repositories by Aleš JustinUnFroMage
 
WordPress 4.4
WordPress 4.4WordPress 4.4
WordPress 4.4Toru Miki
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyondAnshum Gupta
 

La actualidad más candente (20)

Integrating Doctrine with Laravel
Integrating Doctrine with LaravelIntegrating Doctrine with Laravel
Integrating Doctrine with Laravel
 
Solr vs ElasticSearch
Solr vs ElasticSearchSolr vs ElasticSearch
Solr vs ElasticSearch
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Drupal7 and Apache Solr
Drupal7 and Apache SolrDrupal7 and Apache Solr
Drupal7 and Apache Solr
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Ceylon introduction by Stéphane Épardaud
Ceylon introduction by Stéphane ÉpardaudCeylon introduction by Stéphane Épardaud
Ceylon introduction by Stéphane Épardaud
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Ceylon SDK by Stéphane Épardaud
Ceylon SDK by Stéphane ÉpardaudCeylon SDK by Stéphane Épardaud
Ceylon SDK by Stéphane Épardaud
 
Apachesolr presentation
Apachesolr presentationApachesolr presentation
Apachesolr presentation
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Python first day
Python first dayPython first day
Python first day
 
Python first day
Python first dayPython first day
Python first day
 
Java SE 7 New Features and Enhancements
Java SE 7 New Features and EnhancementsJava SE 7 New Features and Enhancements
Java SE 7 New Features and Enhancements
 
Ceylon module repositories by Aleš Justin
Ceylon module repositories by Aleš JustinCeylon module repositories by Aleš Justin
Ceylon module repositories by Aleš Justin
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
Akka.Net Overview
Akka.Net OverviewAkka.Net Overview
Akka.Net Overview
 
WordPress 4.4
WordPress 4.4WordPress 4.4
WordPress 4.4
 
Java and the JVM
Java and the JVMJava and the JVM
Java and the JVM
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyond
 

Similar a Illuminating Lucene.Net

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesAnant Corporation
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for DrupalChris Caple
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 

Similar a Illuminating Lucene.Net (20)

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Illuminating Lucene.Net

  • 1. Illuminating Lucene.Net: Bringing Full-Text Search to Light W. Dean Thrasher 14 May 2013
  • 2. Agenda • About the presenter • About Lucene.Net – What it is – What it does – How it works – Who uses it – Why you should care
  • 3. More Agenda • Core concepts – Lucene structure – Luke – Terminology • Code examples • Things to know • Recap • References
  • 6. What is Lucene.Net? Lucene.Net is a port of the Lucene search engine library, written in C# and targeted at .NET runtime users.
  • 7. What is Lucene? Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java. Apache Lucene is an open source project available for free download.
  • 8. History 1997 – Lucene project began by Doug Cutting 2000 – First open source release 2002 – First Apache Jakarta release 2005 – Lucene becomes a top-level project 2006 – Lucene.Net gets Apache incubation status 2010 – Lucene.Net orphaned by original committers 2011 – Lucene.Net reaccepted into Apache Incubator 2012 – Lucene.Net graduates from the Incubator
  • 9. Why you should care You want to provide customers with a “Google-like” search experience You want to tune incoming queries or results ranking You want better performance than SQL “like” searches You want to avoid deploying a separate search tool with your website or application
  • 10. What does it do? • Allows you to index and search vast amounts of text quickly • Provides a powerful query syntax • Integrates into applications easily
  • 11. How it works • Lucene uses an inverted index – Maps terms to the documents that contain them • Lucene manages its index – Stores the index in memory or on disk – Allows documents to be added or removed • Makes an index for each document • Merges the index with a set of other indices
  • 12. Who uses Lucene.Net? • Stackoverflow • RavenDB • Sitecore • Orchard • MindTouch • Umbraco • Sitefinity • SubText
  • 14. Differences between Java and .Net The Lucene.Net API: • Lags a few steps behind the Java version of Lucene • Takes advantage of advanced .Net features not found in Java But it: • Preserves the core Lucene concepts • Maintains indexes that are compatible with the Java version
  • 15. Logical Index Storage • Field – a name/value pair • Document – a sequence of fields • Index – a collection of documents
  • 16. Physical Index Storage • Lucene generates a series of files within a single directory • Moving an index is a copy-and-paste operation • You can compress or zip an index to archive it
  • 17. Luke • Lucene Index Toolbox • Built in Java, but can read Lucene.Net indexes • http://code.google.com /p/luke/
  • 18. Analyzers and Tokens • Analyzers take strings of text and break them into tokens • Tokens are chunks of text and associated metadata
  • 19. Terms, Queries and Hits • Terms – the basic unit for searching. A field name and a value to seek. • Queries – combine terms to form search criteria • Hits – a ranked list of pointers to documents
  • 21. Create documents demo • IndexWriter • Directory • Analyzer • Document • Field
  • 22. Read documents demo • IndexReader • Term • Query • Hits
  • 23. Update documents demo • IndexWriter • Document • Term
  • 24. Delete documents demo • IndexWriter • Query • Term
  • 25. Search demo • IndexSearcher • QueryParser • Query • Term • TopDocs • ScoreDoc
  • 27. Transactional Lucene • Lucene supports ACID commits to its indexes • Lucene uses the Commit and Rollback syntax, much like relational databases. • Source: http://blog.mikemccandless.com/2012/03/tra nsactional-lucene.html
  • 28. Lucene index types FSDirectory • Stores indexed documents on disk • Persists data across sessions • Best choice for most applications Your first choice RAMDirectory • Stores indexed documents in memory • Entire index must fit into available memory • Does not persist data • Faster than FSDirectory Useful for unit testing
  • 29. Precalculation • How you store things in Lucene matters – choose field options and analyzers carefully • The way you retrieve information determines how it should be stored • Smaller indexes give you better performance
  • 30. Field.Store Yes – stores the text in its original form No – the original text is not preserved
  • 31. Field.Index • No – the field is not indexed, so it is not searchable • Not analyzed – the text is treated as single unit and indexed whole • Analyzed – the text is broken down into tokens and indexed
  • 32. Field.TermVector • No – Does not store term vectors • Yes – Stores the term vectors of each document (terms and number of occurrences) • With Positions Offsets – Term vector, token position and offset information
  • 33. Field types indexing options Field Stored Analyzed Vectored Id Yes Not analyzed No Modified Yes Not analyzed No Path Yes Analyzed No Content No Analyzed With Positions Offsets An example of storing fields related to files on your computer.
  • 34. Analyzers • Break apart text into tokens; each token gets indexed separately • Remove stop words • Decide how to handle punctuation • Handle languages and case sensitivity • You can create your own by building from scratch or chaining exiting analyzers
  • 35. Types of Queries • TermQuery • PhraseQuery • RangeQuery • PrefixQuery , Wildcard Query • FuzzyQuery • Use BooleanQuery to combine them
  • 36. Query syntax Query Type Purpose Sample TermQuery Single word query scarlett PhraseQuery Matches terms in order “frankly my dear” RangeQuery Matches documents between the terms [1861 to 1865] {1861 to 1865} WildcardQuery Lightweight regex-like term matching Atl* D?m? PrefixQuery Matches terms that being with the string War* FuzzyQuery Closeness matching cry~ BooleanQuery Combines other queries into complex expressions Scarlett AND “frankly my dear” -voldemort
  • 37. Query, Filter, and Sort • Lucene.Net can handle all three • Default sort is by relevance • Prefer queries to filters – they perform better
  • 39. Linq Providers • LINQ to Lucene • http://linqtolucene.cod eplex.com/ • Lucene.Net.Linq • https://github.com/the motleyfool/Lucene.Net. Linq • Chris Eldredge • MotleyFool
  • 40. Recap • Why would I use a search engine? • Why would I use Lucene.Net? • How would I add Lucene.Net to my project? – Web – Desktop • Where could I go to learn more? • When can I buy Dean a beer?
  • 42. Web References • Lucene.Net – http://lucenenet.apache.org • Solr – http://lucene.apache.org/solr • Wikipedia – http://en.wikipedia.org/wiki/Lucene – http://en.wikipedia.org/wiki/Search_engine_indexing • Academic discussions – http://lucene.sourceforge.net/talks/pisa/ – http://lucene.sourceforge.net/talks/inktomi/
  • 43. Books • Lucene in Action, Second Edition • Michael McCandless, Erick Hatcher, Otis Gospodnetić • Manning Publications • July 2010 • http://www.manning.co m/hatcher3/
  • 44. Books • Taming Text • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris • Manning Publications • January 2013 • http://www.manning.co m/ingersoll/
  • 45. Books • Introduction to Information Retrieval • Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze • Cambridge University Press • 2008 • http://www- nlp.stanford.edu/IR-book/
  • 48. Sample Files All the literature shown in the code samples comes from Project Gutenberg. http://www.gutenberg.org/

Notas del editor

  1. Egad, the PUNishment! Well, at least I didn’t have a boring “Introduction to Lucene.NET” title.
  2. Oooh, an agenda. Aren’t I organized?
  3. Please send me an email to get in touch with me. Keep up with what I’m doing on the Infovark website or on my LinkedIn profile. I’ve listed my twitter handles – personal and work – but I rarely log into Twitter for any length of time. Send me a private message if you want to get my attention on Twitter.
  4. Doug Cutting had written search engines in other languages, but he wanted to teach himself Java. So the Lucene project began. Although he started building a commercial venture around the project, he decided that he preferred writing code to running a business. He open sourced the code in 2000.Lucene got adopted by the Apache Software foundation in 2001. Lucene.Net, which began as an independent port of Lucene, was accepted by the ASF in 2006.In 2010, Lucene.Net hit a rough patch, but thatnk’s to the efforts of the Alt.Net community, it was reintroduced to the Apache Incubator. In 2012, it graduated from the Incubator and became a full-fledged Apache project.
  5. Inverted indexMaps terms to the documents that contain themTerms may include metadata to improve rankingTerms may include position data for proximity searches
  6. These are a few examples of websites, applications, and platforms that use Lucene.Net. If I included those that use Lucene, the Java version, the list would be huge. Even if you don’t use Lucene.Net directly, chances are good that you use something that does. Lucene has become a foundational technology for many of the tools and sites we use today, but not many folks working on the Microsoft side are familiar with it. Some prominent Java examples include: LinkedIn, Twitter, IBM’s OpenFind, and many more.
  7. The .Net version is catching up with the Java version, but it remains nearly a full version behind.The .Net API is much nicer to work with, having good collections and generics support.Tools that interact with a Lucene index will work regardless of the Lucene library that created it.
  8. Although we’ll be working with the Lucene.NET API tonight, many of the concepts you’ll hear will apply to any search engine, though the specific terminology may differ a little. Let’s review some basic definitions we’ll use throughout the rest of the presentation.Index – a collection of documentsDocument – a sequence of fieldsField – a string name/value pair
  9. Luke is one of the ugliest applications I’ve ever seen, but it’s extremely useful. It exposes just about every aspect of the Lucene API, so it makes a great test-bed for trying out different ideas.
  10. Analyzer – breaks field values into tokensToken – a tuple consisting of a chunk of text and its associated metadata. Tokens are the raw bits that gets indexed.(Tokens and terms are closely related.)
  11. Query – a way to ask a question of an indexTerm – a tuple containing a field and a value to seek
  12. Here are some of the key classes used to add documents to the index.I really ought to add some details to the slide for folks who can’t see the code sample.
  13. Updating is a fairly new operation in the Lucene.Net API. Under the hood, it’s doing a Delete operation then an Add operation.
  14. Did you know that you can use an IndexReader to update and delete documents, too? Yes, but I don’t recommend it. This is one of the parts of the API that’s getting revised in the near future.
  15. Unlike a relational database, there’s no “normal form” to guide you when structuring a Lucene index. The key thing to remember is that the
  16. Keeping the original text within the Lucene index is convenient, but can vastly increase the size of your indexes.
  17. Term Vector Yes
  18. Just an example of how you might combine the flags when adding fields to a document.
  19. TermQuery – retrieve documents by a keyPrefixQuery – matches the start of a string valueRangeQuery – searches starting at one term and ending at another (useful for date searches)BooleanQuery – lets you combine other queries using AND, OR, NOT operationsPhaseQuery – finds terms a specified distance from one anotherFuzzyQuery – matches terms similar to a specified term
  20. Examples of query syntax.
  21. Some odds and ends on Queries, filters and sorting.
  22. We can finally dispose of our Lucene objects in versions 2.9.4 and later. If you’re using older versions, you must remember to try/finally the FSDirectory and IndexWriter.Remember that it’s much more efficient to add a bunch of documents within a single using statement than to open a new IndexWriter each time.