19. WHY SEARCH SUCKS?
How do you implement search?
def search
@results = MyModel.search params[:q]
respond_with @results
end
20. WHY SEARCH SUCKS?
How do you implement search?
Query Results Result
MAGIC
def search
@results = MyModel.search params[:q]
respond_with @results
end
21. WHY SEARCH SUCKS?
How do you implement search?
Query Results Result
MAGIC + /
def search
@results = MyModel.search params[:q]
respond_with @results
end
25. HOW DOES SEARCH WORK?
A collection of documents
file_1.txt
The ruby is a pink to blood-‐red colored gemstone ...
file_2.txt
Ruby is a dynamic, reflective, general-‐purpose object-‐oriented
programming language ...
file_3.txt
"Ruby" is a song by English rock band Kaiser Chiefs ...
26. HOW DOES SEARCH WORK?
How do you search documents?
File.read('file_1.txt').include?('ruby')
File.read('file_2.txt').include?('ruby')
...
27. HOW DOES SEARCH WORK?
The inverted index
TOKENS POSTINGS
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
28. HOW DOES SEARCH WORK?
The inverted index
MySearchLib.search "ruby"
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
29. HOW DOES SEARCH WORK?
The inverted index
MySearchLib.search "song"
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
30. HOW DOES SEARCH WORK?
The inverted index
MySearchLib.search "ruby AND song"
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
31. module SimpleSearch
A naïve Ruby implementation
def index document, content
tokens = analyze content
store document, tokens
puts "Indexed document #{document} with tokens:", tokens.inspect, "n"
end
def analyze content
# >>> Split content by words into "tokens"
content.split(/W/).
# >>> Downcase every word
map { |word| word.downcase }.
# >>> Reject stop words, digits and whitespace
reject { |word| STOPWORDS.include?(word) || word =~ /^d+/ || word == '' }
end
def store document_id, tokens
tokens.each do |token|
# >>> Save the "posting"
( (INDEX[token] ||= []) << document_id ).uniq!
end
end
def search token
puts "Results for token '#{token}':"
# >>> Print documents stored in index for this token
INDEX[token].each { |document| " * #{document}" }
end
INDEX = {}
STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there t
extend self
end
32. HOW DOES SEARCH WORK?
Indexing documents
SimpleSearch.index "file1", "Ruby is a language. Java is also a language.
SimpleSearch.index "file2", "Ruby is a song."
SimpleSearch.index "file3", "Ruby is a stone."
SimpleSearch.index "file4", "Java is a language."
Indexed document file1 with tokens:
["ruby", "language", "java", "also", "language"]
Indexed document file2 with tokens:
["ruby", "song"] Words downcased,
stopwords removed.
Indexed document file3 with tokens:
["ruby", "stone"]
Indexed document file4 with tokens:
["java", "language"]
33. HOW DOES SEARCH WORK?
The index
puts "What's in our index?"
p SimpleSearch::INDEX
{
"ruby" => ["file1", "file2", "file3"],
"language" => ["file1", "file4"],
"java" => ["file1", "file4"],
"also" => ["file1"],
"stone" => ["file3"],
"song" => ["file2"]
}
34. HOW DOES SEARCH WORK?
Search the index
SimpleSearch.search "ruby"
Results for token 'ruby':
* file1
* file2
* file3
35. HOW DOES SEARCH WORK?
The inverted index
TOKENS POSTINGS
ruby 3 file_1.txt file_2.txt file_3.txt
pink 1 file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
36. It is very practical to know how search works.
For instance, now you know that
the analysis step is very important.
It's more important than the “search” step.
ElasticSearch
37. module SimpleSearch
def index document, content
tokens = analyze content
store document, tokens
puts "Indexed document #{document} with tokens:", tokens.inspect, "n"
end
def analyze content
# >>> Split content by words into "tokens"
content.split(/W/).
# >>> Downcase every word
map { |word| word.downcase }.
# >>> Reject stop words, digits and whitespace
reject { |word| STOPWORDS.include?(word) || word =~ /^d+/ || word == '' }
end
def store document_id, tokens
tokens.each do |token|
# >>> Save the "posting"
( (INDEX[token] ||= []) << document_id ).uniq!
end
end
def search token
puts "Results for token '#{token}':"
# >>> Print documents stored in index for this token
INDEX[token].each { |document| " * #{document}" }
end
INDEX = {}
STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there t
extend self
end
A naïve Ruby implementation
38. HOW DOES SEARCH WORK?
The Search Engine Textbook
Search Engines
Information Retrieval in Practice
Bruce Croft, Donald Metzler and Trevor Strohma
Addison Wesley, 2009
http://search-engines-book.com
39. SEARCH IMPLEMENTATIONS
The Baseline Information Retrieval Implementation
Lucene in Action
Michael McCandless, Erik Hatcher and Otis Gospodnetic
July, 2010
http://manning.com/hatcher3
41. ElasticSearch is an open source, scalable,
distributed, cloud-ready, highly-available full-
text search engine and database with powerfull
aggregation features, communicating by JSON
over RESTful HTTP, based on Apache Lucene.
ElasticSearch
44. ELASTICSEARCH FEATURES
HTTP / JSON / Schema-free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
# Add a document
curl -‐X POST
"http://localhost:9200/articles/article/1"
INDEX TYPE ID
-‐d '{ "title" : "One" }'
DOCUMENT
50. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
curl -‐X POST "http://localhost:9200/articles/comment" -‐d '
{
"body" : "Wow! Really nice JSON support.",
DIFFERENT TYPE
"published_on" : "2011/05/27 10:05:00",
"author" : {
"first_name" : "John",
"last_name" : "Pear",
"email" : "john@pear.org"
}
}'
curl -‐X POST "http://localhost:9200/articles/_refresh"
curl -‐X GET
"http://localhost:9200/articles/comment/_search?q=author.first_name:john"
51. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
curl -‐X GET
"http://localhost:9200/articles/comment/_search?q=body:json"
Search single type
curl -‐X GET
"http://localhost:9200/articles/_search?q=body:json"
Search whole index
curl -‐X GET
"http://localhost:9200/articles,users/_search?q=body:json"
Search multiple indices
curl -‐X GET
"http://localhost:9200/_search?q=body:json"
Search all indices
52. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
curl -‐X DELETE "http://localhost:9200/articles"; sleep 1
curl -‐X POST "http://localhost:9200/articles/article" -‐d '
{
"id" : "abc123",
"title" : "ElasticSearch Understands JSON!",
"body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s first ...",
"published_on" : "2011/05/27 10:00:00",
"tags" : ["search", "json"],
"author" : {
"first_name" : "Clara",
"last_name" : "Rice",
"email" : "clara@rice.org"
}
}'
curl -‐X POST "http://localhost:9200/articles/_refresh"
curl -‐X GET "http://localhost:9200/articles/article/abc123"
53. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
{"_index":"articles","_type":"article","_id":"1","_version":1, "_source" :
{
"id" : "1",
"title" : "ElasticSearch Understands JSON!",
"body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s
"published_on" : "2011/05/27 10:00:00",
"tags" : ["search", "json"],
"author" : {
"first_name" : "Clara",
"last_name" : "Rice",
"email" : "clara@rice.org"
}
}}
“The Index Is Your Database”
55. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
The “Sliding Window” problem
curl -‐X DELETE http://localhost:9200 / logs_2010_01
logs_2010_02
logs
logs_2010_03
logs_2010_04
“We can really store only three months worth of data.”
56. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Index Templates
curl -‐X PUT localhost:9200/_template/bookmarks_template -‐d '
{
"template" : "users_*", Apply this configuration
for every matching
"settings" : { index being created
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 3
}
},
"mappings": {
"url": {
"properties": {
"url": {
"type": "string", "analyzer": "url_ngram", "boost": 10
},
"title": {
"type": "string", "analyzer": "snowball", "boost": 5
}
// ...
}
}
}
}
'
http://www.elasticsearch.org/guide/reference/api/admin-indices-templates.html
58. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Index A is split into 3 shards, and duplicated in 2 replicas.
A1 A1' A1'' Replicas
A2 A2' A2''
A3 A3' A3''
curl -‐XPUT 'http://localhost:9200/A/' -‐d '{
"settings" : {
"index" : {
Shards "number_of_shards" : 3,
"number_of_replicas" : 2
}
}
}'
59. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Im
pr
ce
ove
an
rm
in
de
rfo
xi
pe
ng
h
pe
a rc
rfo
se
rm
e
ov
an
pr
ce
Im
SH
AR
AS
DS
IC
PL
RE
60. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Y U NO ASK FIRST???
61. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Indexing 100 000 documents (~ 56MB), one shard, no replicas, MacBookAir SSD 2GB
# Index all at once
time curl -‐s -‐X POST "http://localhost:9200/_bulk"
-‐-‐data-‐binary @data/bulk_all.json > /dev/null
real 2m1.142s
# Index in batches of 1000
for file in data/bulk_*.json; do
time curl -‐s -‐X POST "http://localhost:9200/_bulk"
-‐-‐data-‐binary @$file > /dev/null
done
real 1m36.697s (-‐25sec, 80%)
# Do not refresh during indexing in batches
"settings" : { "refresh_interval" : "-‐1" }
for file in data/bulk_*.json; do
...
real 0m38.859s (-‐82sec, 32%)
62. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
$ curl -‐X GET "http://localhost:9200/_search?q=<YOUR QUERY>"
apple
Terms
apple iphone
Phrases "apple iphone"
Proximity "apple safari"~5
Fuzzy apple~0.8
app*
Wildcards
*pp*
Boosting apple^10 safari
[2011/05/01 TO 2011/05/31]
Range
[java TO json]
apple AND NOT iphone
+apple -‐iphone
Boolean
(apple OR iphone) AND NOT review
title:iphone^15 OR body:iphone
Fields published_on:[2011/05/01 TO "2011/05/27 10:00:00"]
http://lucene.apache.org/java/3_1_0/queryparsersyntax.html
71. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
Tire.index 'articles' do
delete
create
store :title => 'One', :tags => ['ruby'], :published_on => '2011-‐01-‐01'
store :title => 'Two', :tags => ['ruby', 'python'], :published_on => '2011-‐01-‐02'
store :title => 'Three', :tags => ['java'], :published_on => '2011-‐01-‐02'
store :title => 'Four', :tags => ['ruby', 'php'], :published_on => '2011-‐01-‐03'
refresh
end
s = Tire.search 'articles' do
query { string 'title:T*' }
filter :terms, :tags => ['ruby']
sort { title 'desc' }
http://github.com/karmi/tire
facet 'global-‐tags' { terms :tags, :global => true }
facet 'current-‐tags' { terms :tags }
end
72. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
class Article < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
end
$ rake environment tire:import CLASS='Article'
Article.search do
query { string 'love' }
facet('timeline') { date :published_on, :interval => 'month' }
sort { published_on 'desc' }
end
http://github.com/karmi/tire
73. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
class Article
include Whatever::ORM
include Tire::Model::Search
include Tire::Model::Callbacks
end
$ rake environment tire:import CLASS='Article'
Article.search do
query { string 'love' }
facet('timeline') { date :published_on, :interval => 'month' }
sort { published_on 'desc' }
end
http://github.com/karmi/tire
74. Try ElasticSearch in a Ruby On Rails aplication with a one-line command
$ rails new tired -‐m "https://gist.github.com/raw/951343/tired.rb"
A “batteries included” installation.
Downloads and launches ElasticSearch.
Sets up a Rails applicationand and launches it.
When you're tired of it, just delete the folder.