SlideShare a Scribd company logo
1 of 31
Finding stuff under the Couch
   with CouchDB-Lucene



           Martin Rehfeld
        @ RUG-B 01-Apr-2010
CouchDB

•   JSON document store
•   all documents in a given database reside in
    one large pool and may be retrieved using
    their ID ...
•   ... or through Map & Reduce based indexes
So how do you do full
    text search?
You potentially could
 achieve this with just
Map & Reduce functions
But that would mean
implementing an actual
   search engine ...
... and this has been done
           before.
Enter Lucene
Apache Lucene is a high-
performance, full-featured text search
engine library written entirely in Java.
It is a technology suitable for nearly
any application that requires full-text
search, especially cross-platform.
                  Courtesy of The Apache Foundation
Lucene Features
•   ranked searching
•   many powerful query types: phrase queries,
    wildcard queries, proximity queries, range
    queries and more
•   fielded searching (e.g., title, author, contents)
•   boolean operators
•   sorting by any field
•   allows simultaneous update and searching
CouchDB Integration
•   couchdb-lucene
    (ready to run Lucene plus
    CouchDB interface)

•   Search interface via
    http_db_handlers, usually
    _fti


•   Indexer interface via
    CouchDB
    update_notification
    facility and fulltext design
    docs
Sample design document,
          i.e., _id: „_design/search“


{
    "fulltext": {
      "by_name": {
      "defaults": { "store":"yes" },
      "index":"function(doc) { var ret=new
Document(); ret.add(doc.name); return ret }"
    }
    }
}
Sample design document,
          i.e., _id: „_design/search“

                     Name of the index
{
    "fulltext": {
      "by_name": {
      "defaults": { "store":"yes" },
      "index":"function(doc) { var ret=new
Document(); ret.add(doc.name); return ret }"
    }
    }
}
Sample design document,
          i.e., _id: „_design/search“

                     Name of the index
{
    "fulltext": {              Default options
      "by_name": {             (can be overridden per field)
      "defaults": { "store":"yes" },
      "index":"function(doc) { var ret=new
Document(); ret.add(doc.name); return ret }"
    }
    }
}
Sample design document,
          i.e., _id: „_design/search“

                       Name of the index
{
    "fulltext": {                Default options
      "by_name": {               (can be overridden per field)
      "defaults": { "store":"yes" },
      "index":"function(doc) { var ret=new
Document(); ret.add(doc.name); return ret }"
    }
    }     Index function
}
Sample design document,
          i.e., _id: „_design/search“

                       Name of the index
{
    "fulltext": {                 Default options
      "by_name": {                (can be overridden per field)
      "defaults": { "store":"yes" },
      "index":"function(doc) { var ret=new
Document(); ret.add(doc.name); return ret }"
    }
    }     Index function Builds and returns documents to
}                        be put into Lucene‘s index (may
                         return an array of multiple
                         documents)
Querying the index
http://localhost:5984/your-couch-db/_fti/
your-design-document-name/your-index-name?

 q=
   
   
   
   
   query string

 sort=	 	      	 
     comma-separated fields to sort on

 limit=	 	     	 
     max number of results to return

 skip=
    
   
   
   offset
 include_docs=

       include CouchDB documents in
 
 
   
   
   
   
   response
A full stack example
CouchDB Person
         Document
{
    "_id": "9db68c69726e486b811859937fbb6b09",
    "_rev": "1-c890039865e37eb8b911ff762162772e",
    "name": "Martin Rehfeld",
    "email": "martin.rehfeld@glnetworks.de",
    "notes": "Talks about CouchDB Lucene"
}
Objectives

•   Search for people by name
•   Search for people by any field‘s content
•   Querying from Ruby
•   Paginating results
Index Function
function(doc) {
  // first check if doc is a person document!
  ...
  var ret=new Document();
  ret.add(doc.name);
  ret.add(doc.email);
  ret.add(doc.notes);
  ret.add(doc.name, {field:“name“, store:“yes“});
  ret.add(doc.email, {field:“email“, store:“yes“});
  return ret;
}
Index Function
function(doc) {
  // first check if doc is a person document!
  ...
  var ret=new Document();


                      }   content added to
  ret.add(doc.name);
  ret.add(doc.email);
  ret.add(doc.notes);
                          „default“ field
  ret.add(doc.name, {field:“name“, store:“yes“});
  ret.add(doc.email, {field:“email“, store:“yes“});
  return ret;
}
Index Function
function(doc) {
  // first check if doc is a person document!
  ...
  var ret=new Document();


                      }   content added to
  ret.add(doc.name);
  ret.add(doc.email);
  ret.add(doc.notes);
                          „default“ field
  ret.add(doc.name, {field:“name“, store:“yes“});
  ret.add(doc.email, {field:“email“, store:“yes“});
  return ret;
                                content added to
}
                                named fields
Field Options
name           description                 available options

          the field name to index
field                                          user-defined
                   under
                                      date, double, float, int, long,
type       the type of the field
                                                string
        whether the data is stored.
store   The value will be returned               yes, no
           in the search result
                                          analyzed,
        whether (and how) the data analyzed_no_norms, no,
index
                is indexed              not_analyzed,
                                   not_analyzed_no_norms
Querying the Index I
http://localhost:5984/mydb/_fti/search/
global?q=couchdb
 {
     "q": "default:couchdb",
     "etag": "119e498956048ea8",
     "skip": 0,
     "limit": 25,
     "total_rows": 1,
     "search_duration": 0,
     "fetch_duration": 8,
     "rows":    [
       {
         "id": "9db68c69726e486b811859937fbb6b09",
         "score": 4.520571708679199,
         "fields":        {
           "name": "Martin Rehfeld",
           "email": "martin.rehfeld@glnetworks.de",
         }
       }
     ]
 }
Querying the Index I
http://localhost:5984/mydb/_fti/search/
global?q=couchdb
                                  default field
 {
     "q": "default:couchdb",      is queried
     "etag": "119e498956048ea8",
     "skip": 0,
     "limit": 25,
     "total_rows": 1,
     "search_duration": 0,
     "fetch_duration": 8,
     "rows":    [
       {
         "id": "9db68c69726e486b811859937fbb6b09",
         "score": 4.520571708679199,
         "fields":        {
           "name": "Martin Rehfeld",
           "email": "martin.rehfeld@glnetworks.de",
         }
       }
     ]
 }
Querying the Index I
http://localhost:5984/mydb/_fti/search/
global?q=couchdb
                                  default field
 {
     "q": "default:couchdb",      is queried Content of fields
     "etag": "119e498956048ea8",
     "skip": 0,
     "limit": 25,                              with store:“yes“
     "total_rows": 1,
     "search_duration": 0,                     option are returned
     "fetch_duration": 8,
     "rows":    [                              with the query
       {
                                               results
         "id": "9db68c69726e486b811859937fbb6b09",
         "score": 4.520571708679199,
         "fields":        {
           "name": "Martin Rehfeld",
           "email": "martin.rehfeld@glnetworks.de",
         }
       }
     ]
 }
Querying the Index II
http://localhost:5984/mydb/_fti/search/
global?q=name:rehfeld
 {
     "q": "name:rehfeld",
     "etag": "119e498956048ea8",
     "skip": 0,
     "limit": 25,
     "total_rows": 1,
     "search_duration": 0,
     "fetch_duration": 8,
     "rows":    [
       {
         "id": "9db68c69726e486b811859937fbb6b09",
         "score": 4.520571708679199,
         "fields":        {
           "name": "Martin Rehfeld",
           "email": "martin.rehfeld@glnetworks.de",
         }
       }
     ]
 }
Querying the Index II
http://localhost:5984/mydb/_fti/search/
global?q=name:rehfeld
 {
     "q": "name:rehfeld",                       name field
     "etag": "119e498956048ea8",
     "skip": 0,
     "limit": 25,
                                                is queried
     "total_rows": 1,
     "search_duration": 0,
     "fetch_duration": 8,
     "rows":    [
       {
         "id": "9db68c69726e486b811859937fbb6b09",
         "score": 4.520571708679199,
         "fields":        {
           "name": "Martin Rehfeld",
           "email": "martin.rehfeld@glnetworks.de",
         }
       }
     ]
 }
Querying from Ruby

class Search
  include HTTParty

 base_uri "localhost:5984/#{CouchPotato::Config.database_name}/_fti/search"
 format :json

  def self.query(options = {})
    index = options.delete(:index)
    get("/#{index}", :query => options)
  end
end
Controller / Pagination
class SearchController < ApplicationController
  HITS_PER_PAGE = 10

  def index
    result = Search.query(params.merge(:skip => skip, :limit => HITS_PER_PAGE))
    @hits = WillPaginate::Collection.create(params[:page] || 1, HITS_PER_PAGE,
                                            result['total_rows']) do |pager|
      pager.replace(result['rows'])
    end
  end

private

  def skip
    params[:page] ? (params[:page].to_i - 1) * HITS_PER_PAGE : 0
  end
end
Resources

•   http://couchdb.apache.org/
•   http://lucene.apache.org/java/docs/index.html
•   http://github.com/rnewson/couchdb-lucene
•   http://lucene.apache.org/java/3_0_1/
    queryparsersyntax.html
Q &A



!
    Martin Rehfeld

    http://inside.glnetworks.de
    martin.rehfeld@glnetworks.de

    @klickmich

More Related Content

What's hot

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Operating system services 9
Operating system services 9Operating system services 9
Operating system services 9
myrajendra
 
Operating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyOperating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - Concurrency
Peter Tröger
 
MS-Access Tables Forms Queries Reports.ppt
MS-Access Tables Forms Queries Reports.pptMS-Access Tables Forms Queries Reports.ppt
MS-Access Tables Forms Queries Reports.ppt
1520lakshyagupta
 

What's hot (20)

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Operating system services 9
Operating system services 9Operating system services 9
Operating system services 9
 
Deadlock in operating systems
Deadlock in operating systemsDeadlock in operating systems
Deadlock in operating systems
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
HyperText Transfer Protocol (HTTP)
HyperText Transfer Protocol (HTTP)HyperText Transfer Protocol (HTTP)
HyperText Transfer Protocol (HTTP)
 
LDAP
LDAPLDAP
LDAP
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Simple Mail Transfer Protocol
Simple Mail Transfer ProtocolSimple Mail Transfer Protocol
Simple Mail Transfer Protocol
 
Data Structures - Lecture 9 [Stack & Queue using Linked List]
 Data Structures - Lecture 9 [Stack & Queue using Linked List] Data Structures - Lecture 9 [Stack & Queue using Linked List]
Data Structures - Lecture 9 [Stack & Queue using Linked List]
 
Operating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyOperating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - Concurrency
 
POP3 Post Office Protocol
POP3 Post Office ProtocolPOP3 Post Office Protocol
POP3 Post Office Protocol
 
Kettle – Etl Tool
Kettle – Etl ToolKettle – Etl Tool
Kettle – Etl Tool
 
Oracle RAC 11g Release 2 Client Connections
Oracle RAC 11g Release 2 Client ConnectionsOracle RAC 11g Release 2 Client Connections
Oracle RAC 11g Release 2 Client Connections
 
MS-Access Tables Forms Queries Reports.ppt
MS-Access Tables Forms Queries Reports.pptMS-Access Tables Forms Queries Reports.ppt
MS-Access Tables Forms Queries Reports.ppt
 
Queue ppt
Queue pptQueue ppt
Queue ppt
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 

Viewers also liked

Couch Db In 60 Minutes
Couch Db In 60 MinutesCouch Db In 60 Minutes
Couch Db In 60 Minutes
George Ang
 
Couch db@nosql+taiwan
Couch db@nosql+taiwanCouch db@nosql+taiwan
Couch db@nosql+taiwan
Kenzou Yeh
 

Viewers also liked (20)

Real World CouchDB
Real World CouchDBReal World CouchDB
Real World CouchDB
 
CouchDB – A Database for the Web
CouchDB – A Database for the WebCouchDB – A Database for the Web
CouchDB – A Database for the Web
 
OSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDBOSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDB
 
(R)évolutions, l'innovation entre les lignes
(R)évolutions, l'innovation entre les lignes(R)évolutions, l'innovation entre les lignes
(R)évolutions, l'innovation entre les lignes
 
MySQL Indexes
MySQL IndexesMySQL Indexes
MySQL Indexes
 
Lucene - 10 ans d'usages plus ou moins classiques
Lucene - 10 ans d'usages plus ou moins classiquesLucene - 10 ans d'usages plus ou moins classiques
Lucene - 10 ans d'usages plus ou moins classiques
 
CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Clo...
CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Clo...CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Clo...
CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Clo...
 
ZendCon 2011 Learning CouchDB
ZendCon 2011 Learning CouchDBZendCon 2011 Learning CouchDB
ZendCon 2011 Learning CouchDB
 
Migrating to CouchDB
Migrating to CouchDBMigrating to CouchDB
Migrating to CouchDB
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDB
 
Couch Db In 60 Minutes
Couch Db In 60 MinutesCouch Db In 60 Minutes
Couch Db In 60 Minutes
 
Couch db
Couch dbCouch db
Couch db
 
Couch db@nosql+taiwan
Couch db@nosql+taiwanCouch db@nosql+taiwan
Couch db@nosql+taiwan
 
CouchDB at New York PHP
CouchDB at New York PHPCouchDB at New York PHP
CouchDB at New York PHP
 
Couch db
Couch dbCouch db
Couch db
 
CouchDB
CouchDBCouchDB
CouchDB
 
CouchApps: Requiem for Accidental Complexity
CouchApps: Requiem for Accidental ComplexityCouchApps: Requiem for Accidental Complexity
CouchApps: Requiem for Accidental Complexity
 
CouchDB Vs MongoDB
CouchDB Vs MongoDBCouchDB Vs MongoDB
CouchDB Vs MongoDB
 
An introduction to CouchDB
An introduction to CouchDBAn introduction to CouchDB
An introduction to CouchDB
 
CouchDB
CouchDBCouchDB
CouchDB
 

Similar to CouchDB-Lucene

10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling
DATAVERSITY
 
03 form-data
03 form-data03 form-data
03 form-data
snopteck
 
MongoDb and NoSQL
MongoDb and NoSQLMongoDb and NoSQL
MongoDb and NoSQL
TO THE NEW | Technology
 

Similar to CouchDB-Lucene (20)

10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling
 
Schema Design with MongoDB
Schema Design with MongoDBSchema Design with MongoDB
Schema Design with MongoDB
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Postman Collection Format v2.0 (pre-draft)
Postman Collection Format v2.0 (pre-draft)Postman Collection Format v2.0 (pre-draft)
Postman Collection Format v2.0 (pre-draft)
 
03 form-data
03 form-data03 form-data
03 form-data
 
d3sparql.js demo at SWAT4LS 2014 in Berlin
d3sparql.js demo at SWAT4LS 2014 in Berlind3sparql.js demo at SWAT4LS 2014 in Berlin
d3sparql.js demo at SWAT4LS 2014 in Berlin
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Building Your First MongoDB App
Building Your First MongoDB AppBuilding Your First MongoDB App
Building Your First MongoDB App
 
Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Building a Scalable Inbox System with MongoDB and Java
Building a Scalable Inbox System with MongoDB and JavaBuilding a Scalable Inbox System with MongoDB and Java
Building a Scalable Inbox System with MongoDB and Java
 
Power tools in Java
Power tools in JavaPower tools in Java
Power tools in Java
 
Peggy elasticsearch應用
Peggy elasticsearch應用Peggy elasticsearch應用
Peggy elasticsearch應用
 
Academy PRO: Elasticsearch. Data management
Academy PRO: Elasticsearch. Data managementAcademy PRO: Elasticsearch. Data management
Academy PRO: Elasticsearch. Data management
 
Elasticsearch in 15 Minutes
Elasticsearch in 15 MinutesElasticsearch in 15 Minutes
Elasticsearch in 15 Minutes
 
Webinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev TeamsWebinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev Teams
 
MongoDb and NoSQL
MongoDb and NoSQLMongoDb and NoSQL
MongoDb and NoSQL
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

CouchDB-Lucene

  • 1. Finding stuff under the Couch with CouchDB-Lucene Martin Rehfeld @ RUG-B 01-Apr-2010
  • 2. CouchDB • JSON document store • all documents in a given database reside in one large pool and may be retrieved using their ID ... • ... or through Map & Reduce based indexes
  • 3. So how do you do full text search?
  • 4. You potentially could achieve this with just Map & Reduce functions
  • 5. But that would mean implementing an actual search engine ...
  • 6. ... and this has been done before.
  • 7. Enter Lucene Apache Lucene is a high- performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Courtesy of The Apache Foundation
  • 8. Lucene Features • ranked searching • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more • fielded searching (e.g., title, author, contents) • boolean operators • sorting by any field • allows simultaneous update and searching
  • 9. CouchDB Integration • couchdb-lucene (ready to run Lucene plus CouchDB interface) • Search interface via http_db_handlers, usually _fti • Indexer interface via CouchDB update_notification facility and fulltext design docs
  • 10. Sample design document, i.e., _id: „_design/search“ { "fulltext": { "by_name": { "defaults": { "store":"yes" }, "index":"function(doc) { var ret=new Document(); ret.add(doc.name); return ret }" } } }
  • 11. Sample design document, i.e., _id: „_design/search“ Name of the index { "fulltext": { "by_name": { "defaults": { "store":"yes" }, "index":"function(doc) { var ret=new Document(); ret.add(doc.name); return ret }" } } }
  • 12. Sample design document, i.e., _id: „_design/search“ Name of the index { "fulltext": { Default options "by_name": { (can be overridden per field) "defaults": { "store":"yes" }, "index":"function(doc) { var ret=new Document(); ret.add(doc.name); return ret }" } } }
  • 13. Sample design document, i.e., _id: „_design/search“ Name of the index { "fulltext": { Default options "by_name": { (can be overridden per field) "defaults": { "store":"yes" }, "index":"function(doc) { var ret=new Document(); ret.add(doc.name); return ret }" } } Index function }
  • 14. Sample design document, i.e., _id: „_design/search“ Name of the index { "fulltext": { Default options "by_name": { (can be overridden per field) "defaults": { "store":"yes" }, "index":"function(doc) { var ret=new Document(); ret.add(doc.name); return ret }" } } Index function Builds and returns documents to } be put into Lucene‘s index (may return an array of multiple documents)
  • 15. Querying the index http://localhost:5984/your-couch-db/_fti/ your-design-document-name/your-index-name? q= query string sort= comma-separated fields to sort on limit= max number of results to return skip= offset include_docs= include CouchDB documents in response
  • 16. A full stack example
  • 17. CouchDB Person Document { "_id": "9db68c69726e486b811859937fbb6b09", "_rev": "1-c890039865e37eb8b911ff762162772e", "name": "Martin Rehfeld", "email": "martin.rehfeld@glnetworks.de", "notes": "Talks about CouchDB Lucene" }
  • 18. Objectives • Search for people by name • Search for people by any field‘s content • Querying from Ruby • Paginating results
  • 19. Index Function function(doc) { // first check if doc is a person document! ... var ret=new Document(); ret.add(doc.name); ret.add(doc.email); ret.add(doc.notes); ret.add(doc.name, {field:“name“, store:“yes“}); ret.add(doc.email, {field:“email“, store:“yes“}); return ret; }
  • 20. Index Function function(doc) { // first check if doc is a person document! ... var ret=new Document(); } content added to ret.add(doc.name); ret.add(doc.email); ret.add(doc.notes); „default“ field ret.add(doc.name, {field:“name“, store:“yes“}); ret.add(doc.email, {field:“email“, store:“yes“}); return ret; }
  • 21. Index Function function(doc) { // first check if doc is a person document! ... var ret=new Document(); } content added to ret.add(doc.name); ret.add(doc.email); ret.add(doc.notes); „default“ field ret.add(doc.name, {field:“name“, store:“yes“}); ret.add(doc.email, {field:“email“, store:“yes“}); return ret; content added to } named fields
  • 22. Field Options name description available options the field name to index field user-defined under date, double, float, int, long, type the type of the field string whether the data is stored. store The value will be returned yes, no in the search result analyzed, whether (and how) the data analyzed_no_norms, no, index is indexed not_analyzed, not_analyzed_no_norms
  • 23. Querying the Index I http://localhost:5984/mydb/_fti/search/ global?q=couchdb { "q": "default:couchdb", "etag": "119e498956048ea8", "skip": 0, "limit": 25, "total_rows": 1, "search_duration": 0, "fetch_duration": 8, "rows": [ { "id": "9db68c69726e486b811859937fbb6b09", "score": 4.520571708679199, "fields": { "name": "Martin Rehfeld", "email": "martin.rehfeld@glnetworks.de", } } ] }
  • 24. Querying the Index I http://localhost:5984/mydb/_fti/search/ global?q=couchdb default field { "q": "default:couchdb", is queried "etag": "119e498956048ea8", "skip": 0, "limit": 25, "total_rows": 1, "search_duration": 0, "fetch_duration": 8, "rows": [ { "id": "9db68c69726e486b811859937fbb6b09", "score": 4.520571708679199, "fields": { "name": "Martin Rehfeld", "email": "martin.rehfeld@glnetworks.de", } } ] }
  • 25. Querying the Index I http://localhost:5984/mydb/_fti/search/ global?q=couchdb default field { "q": "default:couchdb", is queried Content of fields "etag": "119e498956048ea8", "skip": 0, "limit": 25, with store:“yes“ "total_rows": 1, "search_duration": 0, option are returned "fetch_duration": 8, "rows": [ with the query { results "id": "9db68c69726e486b811859937fbb6b09", "score": 4.520571708679199, "fields": { "name": "Martin Rehfeld", "email": "martin.rehfeld@glnetworks.de", } } ] }
  • 26. Querying the Index II http://localhost:5984/mydb/_fti/search/ global?q=name:rehfeld { "q": "name:rehfeld", "etag": "119e498956048ea8", "skip": 0, "limit": 25, "total_rows": 1, "search_duration": 0, "fetch_duration": 8, "rows": [ { "id": "9db68c69726e486b811859937fbb6b09", "score": 4.520571708679199, "fields": { "name": "Martin Rehfeld", "email": "martin.rehfeld@glnetworks.de", } } ] }
  • 27. Querying the Index II http://localhost:5984/mydb/_fti/search/ global?q=name:rehfeld { "q": "name:rehfeld", name field "etag": "119e498956048ea8", "skip": 0, "limit": 25, is queried "total_rows": 1, "search_duration": 0, "fetch_duration": 8, "rows": [ { "id": "9db68c69726e486b811859937fbb6b09", "score": 4.520571708679199, "fields": { "name": "Martin Rehfeld", "email": "martin.rehfeld@glnetworks.de", } } ] }
  • 28. Querying from Ruby class Search include HTTParty base_uri "localhost:5984/#{CouchPotato::Config.database_name}/_fti/search" format :json def self.query(options = {}) index = options.delete(:index) get("/#{index}", :query => options) end end
  • 29. Controller / Pagination class SearchController < ApplicationController HITS_PER_PAGE = 10 def index result = Search.query(params.merge(:skip => skip, :limit => HITS_PER_PAGE)) @hits = WillPaginate::Collection.create(params[:page] || 1, HITS_PER_PAGE, result['total_rows']) do |pager| pager.replace(result['rows']) end end private def skip params[:page] ? (params[:page].to_i - 1) * HITS_PER_PAGE : 0 end end
  • 30. Resources • http://couchdb.apache.org/ • http://lucene.apache.org/java/docs/index.html • http://github.com/rnewson/couchdb-lucene • http://lucene.apache.org/java/3_0_1/ queryparsersyntax.html
  • 31. Q &A ! Martin Rehfeld http://inside.glnetworks.de martin.rehfeld@glnetworks.de @klickmich

Editor's Notes

  1. short recap of what CouchDB is
  2. some (very) limited examples are actually floating around
  3. mapping all documents, split them into words, push through a stemmer, and cross-index them with the documents containing them
  4. ... multiple times, in fact
  5. add all searchable content to the default field, add fields for searching by individual field or using contents in view
  6. the stored field contents can be used to render search results without touching CouchDB
  7. the stored field contents can be used to render search results without touching CouchDB
  8. could be as simple as that (using the httparty gem &amp; Couch Potato) sans error handling
  9. using the Search class in an controller + pagination; utilizing the will_paginate gem