This document provides an overview of Lucene and how it can be used with MySQL. It discusses:
- What Lucene is and its origins as an open source information retrieval library.
- How Lucene works as a toolkit for building search applications rather than a turnkey search engine.
- Core Lucene classes like IndexWriter, Directory, Analyzer, and Document that are used for indexing data.
- Classes like IndexSearcher and Query that support basic search operations through queries and hits.
- Examples of loading data from a MySQL database into a Lucene index and performing searches on that indexed data.
3. What is Lucene?
Started in 1997 “self serving project”
2001: Apache folks adopts Lucene
Open Source Information Retrieval (IR) Library
- available from the Apache Software Foundation
- Search and Index any textual data
- Doesn’t care about language, source and format of data
4. Lucene?
Not a turnkey search engine
Standard
- for building open-source based large-scale search
applications
- a high performance, scalable, cross-platform search toolkit
- Today: translated into C++, C#, Perl, Python, Ruby
- for embedded and customizable search
- widely adopted by OEM software vendors and enterprise IT
departments
6. What types of queries it supports
Single and multi-term queries
Phrase queries
Wildcards
Result ranking
+apple –computer +pie
country:USA
country:USA AND state:CA
7. Cons
Need Java resources (programmers)
- JSP experience plus
Implementation and Maintenance Cost
By default
- No installer or wizard for setup (it’s a toolkit )
- No administration or command line tools (demo avail.)
- No spider
- Coding yourself is always an option
- No complex script language support by default
- 3rd
party tools available
8. Cons 2
- No built-in support for (Demos avail. for how to implement)
- HTML format
- PDF format
- Microsoft Office Documents
- Advanced XML queries
- “How tos” available.
- No database gateway
- Integrates with MySQL with little work
- Web interface
- JSP sample available
- Missing enterprise support
9. Lucene Libraries
1. The Lucene libraries include core search components such
as a document indexer, index searcher, query parser, and
text analyzer.
10. Who is behind Lucene?
Doug Cutting (Author)
Previously at Excite
Apache Software Foundation
11. Who uses Lucene?
IBM
- IBM OmniFind Yahoo! Edition
CNET
- http://reviews.cnet.com/
- http://www.mail-archive.com/java-user@lucene.apache.org/msg02645.html
Wikipedia
Fedex
Akamai’s EdgeComputing platform
Technorati
FURL
Sun
- Open Solaris Source Browser
12. When to use Lucene?
Search applications
Search functionality for existing applications
Search enabling database application
13. When not to use?
Not ideal for
- Adding generic search to site
- Enterprise systems needing support for proprietary formats
- Extremely high volume systems
- Through a better architecture this can be solved
- Investigate carefully if
- You need more than 100 QPS per system
- Highly volatile data
- Updates are actually Deletes and Additions
- Additions visible to new sessions only
14. Why Lucene?
What problems does Lucene solve?
- Full text with MySQL
- Pros and Cons
Powerful features
Simple API
Scalable, cost-effective, efficient Indexing
- Powerful Searching through multiple query types
17. IndexWriter
IndexWriter
- Creates new index
- Adds document to new index
- Gives you “write” access but no “read” access
- Not the only class used to modify an index
- Lucene API can be used as well
18. Directory
Directory
- Represents location of the Lucene Index
- Abstract class
- Allows its subclasses to store the index as they see fit
- FSDirectory
- RAMDirectory
- Interface Identical to FSDirectory
19. Analyzer
Analyzer
- Text passed through analyzer before indexing
- Specified in the IndexWriter constructor
- Incharge of extracting tokens out of text to be indexed
- Rest is eliminated
- Several implementation available (stop words, lower case
etc)
20. Document
Document
- Collection of fields (virtual document)
- Chunk of data
- Fields of a document represent the document or meta-data
associated with that document
- -Original source of Document data (word PDF) irrelevant
- Metadata indexed and stored separately as fields of a
document
- Text only: java.lang.String and java.io.Reader are the only
things handled by core
21. Field 1
Field
- Document in an index contains one or more fields (in a class called Field)
- Each field represents data that is either queried against or retrieved from index during
search.
- Four different types:
- Keyword
- Isn’t analyzed
- But indexed and stored in the index
- Ideal for:
- URLs
- Paths
- SSN
- Names
- Orginal value is reserved in entirety
22. Field types
- Unindexed
- Neither analyzed nor indexed
- Value stored in index as is
- Fields that need to be displayed with search results (URL
etc)
- But you won’t search based on these fields
- Because original values are stored
- Don’t store fields with very large values
- Especially if index size will be an issue
23. Field types
- Unstored
- Opposite of UnIndexed
- Field type is analyzed and indexed but isn’t stored in the
index
- Suitable for indexing a large amount of text that’s not going
to be needed in original form
- E.g.
- HTML of a webpage etc
24. Field types
- Text
- Analyzed and indexed
- Field of this type can be searched against
- Be careful about the field size
- If data indexed is String, it will be stored
- If Data is from a Reader
- It will not be stored
25. Note:
Field.Text(String, String) and Field.Text(String, Reader) are
different.
- (String, String) stores the field data
- (String, Reader) does not
To index a String, but not store it, use
- Field.UnStored(String, String)
26. Classes for Basic Search Operations
IndexSearcher
- Opens an index in read-only mode
- Offers a number of search methods
- Some of which implemented in Searcher class
IndexSearcher is = new IndexSearcher(
FSDirectory.getDirectory("/tmp/index", false));
Query q = new TermQuery(new Term("contents",
"lucene"));
Hits hits = is.search(q);
27. Classes for Basic Search Operations
Term
- Basic unit for searching
- Consists of pair of string elements: name of field and value
of field
- Term objects are involved in indexing process
- Term objects can be constructed and used with TermQUery
Query q = new TermQuery(new Term("contents",
"lucene"));
Hits hits = is.search(q);
28. Classes for Basic Search Operations
Query
- A number of query subclasses
- BooleanQuery
- PhraseQuery
- PrefixQuery
- PhrasePrefixQuery
- RangeQuery
- FilteredQuery
- SpanQuery
29. Classes for Basic Search Operations
TermQuery
- Most basic type of query supported by Lucene
- Used for matching documents that contain fields with
specific values
Hits
- Simple container of pointers to ranked search results.
- Hits instances don’t load from index all documents that
match a query but only a small portion (performance)
30. Indexing
Multiple type indexing
- Scalable
- High Performance
- “over 20MB/minute on Pentium M 1.5GHz”
- Incremental indexing and batch indexing have same cost
- Index Size
- index size roughly 20-30% the size of text indexed
- Compare to MySQL’s FULL-TEXT index size
- Cost-effective
- 1 MB heap (small RAM needed)
31. Powerful Searching & Sorting
- Ranked Searching
- Multiple Powerful Query Types
- phrase queries, wildcard queries, proximity queries, range
queries and more
- Fielded Searching
- fielded searching (e.g., title, author, contents)
- Date Range Searching
- date-range searching
- Multiple Index Searching with Merged Results
- Sort by any field
32. How to Integrate Your Application With Lucene
Install JDK (5 or 6)
Testing Lucene Demo
33. Prerequisites: JDK
Installing JDK
- For downloading visit the JDK5
http://java.sun.com/javase/downloads/index_jdk5.jsp page
- or JDK 6 download page
http://java.sun.com/javase/downloads/index.jsp
- Once downloaded:
- Change Permissions
- [root@srv31 jdk-install]# chmod 755 jdk-1_5_0_09-linux-
i586.bin
- Install
- [root@srv31 jdk-install]# ./jdk-1_5_0_09-linux-i586.bin
34. Testing Lucene Demo
Step 2: Testing Lucene Demo
- Set up your environment
- vi /root/.bashrc
- export PATH=/var/www/html/java/jdk1.5.0_09/bin:$PATH
export
CLASSPATH=.:/var/www/html/java/jdk1.5.0_09:/var/www/html/java/jdk1.5.0_09/lib:/var/www/html/jav
a/jdk1.5.0_09/lib/lucene-2.1.0/lucene-core-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/lucene-
2.1.0/lucene-demos-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar
- Now get and place in /var/www/html/java/jdk1.5.0_09/lib/lucene-2.1.0/
- Lucene Java
- http://www.apache.org/dyn/closer.cgi/lucene/java/
- XMLRPC Library
- [root@srv31 lib]# wget http://mirror.candidhosting.com/pub/apache/lucene/java/lucene-2.1.0.zip
[root@srv31 lib]# unzip lucene-2.1.0.zip
[root@srv31 lib]# cp -p lucene-2.1.0/lucene-core-2.1.0.jar ../lib/
[root@srv31 lib]# cp -p lucene-2.1.0/lucene-demos-2.1.0.jar ../lib/
[root@srv31 lib]# cp -p /var/www/html/java/jdk1.5.0_06/lib/xmlrpc-3.0a1.jar
/var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar
Now "dot" the above file:
[root@srv31 lib]# . /root/.bashrc
35. Testing Lucene Demo 2
- Believe it or not, we are now ready to test the Lucene Demo.
- Indexing
- I just let it loose on a randomly picked directory to give you an
idea:
[root@srv31 lib]# java org.apache.lucene.demo.IndexFiles
/var/www/html/java/jdk1.5.0_09/
adding /var/www/html/java/jdk1.5.0_09/include/jni.h
adding /var/www/html/java/jdk1.5.0_09/include/linux/jawt_md.h
adding /var/www/html/java/jdk1.5.0_09/include/linux/jni_md.h
adding /var/www/html/java/jdk1.5.0_09/include/jvmti.h
adding /var/www/html/java/jdk1.5.0_09/include/jvmdi.h
Optimizing...
157013 total milliseconds
37. Loading data from MySQL
…
String url = "jdbc:mysql://127.0.0.1/odp";
Connection con = DriverManager.getConnection(url, “user",
“pass");
Statement Stmt = con.createStatement();
ResultSet RS = Stmt.executeQuery
("SELECT * FROM " +
" articles" );
38. Loading data from MySQL 2
while (RS.next()) {
// System.out.print(""" + RS.getString(1) + """);
try {
final Document doc = new Document();
// create Document
doc.add(Field.Text("title", RS.getString("title")));
doc.add(Field.Text("type", "article"));
doc.add(Field.Text("author",
RS.getString("author")));
doc.add(Field.Text("body", RS.getString("body")));
doc.add(Field.Text("extended",
RS.getString("extended")));
…
39. Loading data from MySQL 3
…
doc.add(Field.Text("tags", RS.getString("tags")));
doc.add(Field.UnIndexed("permalink", RS.getString("permalink") ));
doc.add(Field.UnIndexed("id", RS.getString("id")));
doc.add(Field.UnIndexed("member_id", RS.getString("member_id")));
doc.add(Field.UnIndexed("portal_id", RS.getString("portal_id")));
//doc.add(Field.Text("id", RS.getString("id")));
writer.addDocument(doc);
}
catch (IOException e) { System.err.println("Unable to index student"); }
}
// close connection
40. Searching Data using XML RPC
public static void searchArticles( final String search, final int numberOfResults)
throws Exception
{
final Query query;
Analyzer analyzer = new StandardAnalyzer();
query = QueryParser.parse(search, "title", analyzer);
final ArrayList ids = new ArrayList();
try {
final IndexReader reader = IndexReader.open(INDEX_DIR);
final IndexSearcher searcher = new IndexSearcher(reader);
final Hits hits = searcher.search(query);
for (int i = 0; i != hits.length() && i != numberOfResults; ++i) {
final Document doc = hits.doc(i);
// id field needs to be added //ids.add(new Integer(doc.getField("id").stringValue()));
…
42. Searching Data using XML RPC 3
}
searcher.close();
reader.close();
}
catch (IOException e) {
System.out.println("Error while reading student data
from index");
}
}
43. Future of Lucene
Advanced Linguistics Modules that integrate with Lucene
- Support for complex script languages
- Basis Technologies’ Rosette® Linguistics Platform
- The same linguistic software that powers multilingual web
search on Google, Live.com, Yahoo! and leading enterprise
search engines
- “allows Lucene-based applications to index and search text
in multiple languages concurrently, including complex script
languages such as Arabic, Chinese, Farsi, Japanese and
Korean. “
- www.basistech.com/lucene
44. What are the ports of Lucene
Lucene4c - C
CLucene - C++
MUTIS - Delphi
Lucene.Net - a straight C#/.NET port of Lucene by the
Apache Software Foundation, fully compatible with it.
Plucene - Perl
Kinosearch - Perl
Pylucene - Lucene interfaced with a Python front-end
Ferret and RubyLucene - Ruby
Zend Framework (Search) - PHP
Montezuma - Common Lisp
45. Where to get help about Lucene?
http://lucene.apache.org/java/docs/mailinglists.html
IRC