SlideShare a Scribd company logo
1 of 134
Download to read offline
Suche mit Apache Lucene & Co
Christian Meder
Bernhard Pflugfelder
inovex Gmbh
Background
‣  open source (free software)
‣  Linux
‣  Web
‣  Java
‣  Android
‣  CTO@inovex
‣  Christian Meder
Christian MederSpeaker
2
Background
‣  Lucene
‣  Solr
‣  Text Mining Technologies,
Information Retrieval
‣  Hadoop
‣  Java
‣  Big Data Engineer@inovex
‣  bpflugfelder@inovex.de
Bernhard PflugfelderSpeaker
3
‣  09:00 - 09:30
Introduction, Search in a nutshell
‣  09:30 - 10:00
Solr Exercise 1: Installation, Web Admin Interface
‣  10:00 - 10:30
Solr Exercise 2: Indexing, Queries I
‣  10:30 - 11:00
Coffee Break
‣  11:30 - 12:00
Solr Exercise 3: Data ingestion XML / SQL, Queries II
Session IAgenda
4
‣  12:00 - 12:30
Solr Exercise 4: Schema, Data types, Analyzers, Stemming
‣  12:30 – 13:30
Lunch
‣  13:30 - 14:00
Solr Exercise 5: Facet search, Filter search, Interval search
‣  14:00 - 14:30
Solr Exercise 6: Dismax, Autosuggestion, MoreLikeThis
Session IIAgenda
5
‣  14:30 - 15:00
ES Exercise 1: Installation, Indexing, Queries I
‣  15:00 - 15:30
Coffee Break
‣  15:30 - 16:00
ES Exercise 2: Schema, Data types, Analyzers, Queries II
‣  16:00 - 16:30
ES Exercise 3: Data ingestion SQL / XML
‣  16:30 - 17:00
ES Exercise 4: Facet search, Filter search, Interval search
Session IIIAgenda
6
Search tag cloudIntroduction
7
‣  Classical search applications are applications focusing on
information or document retrieval
‣  Requirement: find information the user asks for!
‣  Some examples:
‣  Web search
‣  Enterprise search
‣  Document search (within DMS or CMS)
‣  Search on portals and archives
‣  Product search
‣  Specialized searches for people, companies, etc.
Classical search
applications
Introduction
8
Where search is in Enterprise SearchIntroduction
9
Where search is in Online shopsIntroduction
10
Where search is in Semantic search @
Google
Introduction
11
Where search is inIntroduction
12
Navigation &
Information access
Data Analysis Search-based
applications
Introduction
13
http://datarpm.com/product
‣  Can you think of other scenarios where search applications
will also do a good job?
‣  Remind the key capabilities of search technologies:
‣  Persistency
‣  Flexible data model
‣  Unstructured data, but not only
‣  Extremely quick access to data
‣  Horizontal scalability
There are plenty of applications scenarios out there where
search technologies shall be considered!
NoSQL DatabaseIntroduction
14
Document store
Hot open source
search technologies
Projects
15
http://lucene.apache.org
http://lucene.apache.org/solr/
http://www.elasticsearch.org
Lucene is an open source, pure Java API
for enabling information retrieval
‣  Originally developed by Doug Cutting 1999 and became Apache TLP in 2001
‣  Licensed by Apache License 2.0
‣  Pure Java Library with implementations for :
‣  Lucene.NET (http://lucenenet.apache.org)
‣  PyLucene (http://lucene.apache.org/pylucene/)
‣  and more:
http://wiki.apache.org/lucene-java/LuceneImplementations
‣  Large and very active developer community, well documented and supported (38
active committer!)
‣  Current stable release: 4.2.1
‣  Widely used and adopted for commercial / non-commercial projects:
http://wiki.apache.org/lucene-java/PoweredBy
Projects
16
Overview
http://lucene.apache.org/
‣  Scalable, High-Performance Indexing
‣  over 95GB/hour on modern hardware
‣  small RAM requirements
‣  incremental indexing as fast as batch indexing
‣  index size roughly 20-30% the size of text indexed
‣  Powerful, Accurate and Efficient Search Algorithms
‣  ranked searching -- best results returned first
‣  many powerful query types
‣  fielded searching (e.g., title, author, contents)
‣  date-range searching
‣  sorting by any field
‣  multiple-index searching with merged results
‣  allows simultaneous update and searching
[From http://lucene.apache.org/core/features.html]
Projects
17
Highlights
http://lucene.apache.org/
Solr is a standalone enterprise search server & document
store with based on Lucene
‣  Created by Yonik Seeley at CNET Networks in 2004
‣  Introduced as Apache Incubator in 2006, became TLP in 2007
‣  Licensed by Apache License 2.0
‣  Seeley and others founded Lucid Imagination -> LucidWorks
‣  Large and very active developer community, well documented and supported
(strong relationship to Lucene community also)
‣  Current stable release: 4.2.1
‣  Widely used and adopted for commercial / non-commercial projects:
http://wiki.apache.org/solr/PublicServers
OverviewProjects
18
http://lucene.apache.org/solr/
‣  Architectural highlights
‣  Extensible Plugin Architecture
‣  SolrCloud – distributed indexing and search architecture
‣  Efficient Replication to other Solr Search Servers
‣  Configurable Query Result, Filter, and Document cache instances
‣  Access & Monitoring
‣  Standards Based Open Interfaces
‣  XML,JSON and HTTP
‣  REST-like API
‣  Comprehensive HTML Administration Interfaces
‣  Server statistics exposed over JMX for monitoring
HighlightsProjects
19
http://lucene.apache.org/solr/
‣  Data model
‣  Lucene’s document oriented index data structure
‣  Schema for field types and fields of documents
‣  Analysis & Indexing highlights
‣  Out-of-box support for JSON, XML, CSV/delimited-text, DBMS
‣  Support of PDF, DOC, XLS, PPT, HTML
‣  Declarative Lucene Analyzer specification
‣  Many additional text analysis components including word splitting, regex and
sounds-like filters
‣  External file-based configuration of stopword lists, synonym lists, and
protected word lists
HighlightsProjects
20
http://lucene.apache.org/solr/
Open source search technologies
‣  Search highlights
‣  Facet search and filtering (values, queries, date/time ranges)
‣  Geospatial search (e.g. local search)
‣  Configurable caching
‣  Sorting (number of fields, complex functions of numeric fields)
‣  Autocomplete
‣  Highlighted context snippets
‣  Spelling suggestions for user queries
‣  More Like This suggestions for given document
‣  Function Query
‣  Advanced query parser for high relevancy results from user-entered queries
HighlightsProjects
21
http://lucene.apache.org/solr/
‣  Solr clients in various languages are freely available:
‣  Java, Scala, Ruby, Python, .NET, Javascript (AJAX), …
‣  http://wiki.apache.org/solr/IntegratingSolr
‣  Very helpful tools:
‣  Grep (log file analysis)
‣  Luke (index analysis)
‣  Solrmeter (performance analysis)
‣  Scalable Performance Monitoring for Solr (Monitoring)
Clients & ToolsProjects
22
http://lucene.apache.org/solr/
Documentation URL
Getting started http://lucene.apache.org/solr/4_0_0/
tutorial.html
Release documentation: http://lucene.apache.org/solr/4_0_0/
Javadocs http://lucene.apache.org/solr/4_0_0/solr-
core/index.html
Solr Wiki http://wiki.apache.org/solr/
Mailing lists http://lucene.apache.org/solr/
discussion.html
Apache Solr 3 Enterprise Search Server http://link.packtpub.com/2LjDxE
Apache Solr 3.1 Cookbook http://www.packtpub.com/solr-3-1-
enterprise-search-server-cookbook/book
LucidWorks Technical Support http://support.lucidworks.com/home
DocumentationProjects
23
http://lucene.apache.org/solr/
+  Solr is a mature technology widely used in commercial applications
‣  Easy integration in third-party application
‣  Big community, good documentation, good support
‣  You have a Solr problem - most likely someone else had it already
‣  Very helpful tools for analysis and monitoring
+  Solr provides a large bundle of features:
‣  Lots of analyzers and specific query types
‣  Individual relevance boosting
‣  Admin interface
-  Because Solr can so much, it’s a heavy weight technology:
‣  much to configure
‣  most part of the configuration is static / no api access
‣  includes redundant functionality (e.g. similar requesthandlers)
Pros & ConsProjects
24
http://lucene.apache.org/solr/
Search ArchitectureProjects
25
‣  Installation
‣  Administration
‣  Solr Web Admin Interface
Solr Exercise I
26
‣  Solr is a pure Java application
‣  Solr is built upon:
‣  Lucene
‣  Zookeeper
‣  Guava-libraries
‣  HttpComponents, SLF4J, Various Commons libraries
‣  Solr source code available at:
‣  http://svn.apache.org/viewcvs.cgi/lucene/dev/ (Web access)
‣  http://svn.apache.org/repos/asf/lucene/dev/ (anonymous access)
‣  Solr needs a servlet container to run such as Jetty, Tomcat, Glassfish to run
‣  Embedded Jetty for easily playing and testing Solr
Solr Exercise I
27
Overview
http://lucene.apache.org/solr/
Run Solr on embedded Jetty:
1.  Unpack the Solr distribution to your desired location (= SOLR_MAIN)
2.  Change to directory SOLR_MAIN/example
3.  Start the example Solr instance: java -jar start.jar
To verify the installation open your browser and go to the Solr Admin page
http://localhost:8983/solr
Solr Exercise I
28
Installation
http://lucene.apache.org/solr/
‣  Solr Core (aka Core)
‣  basically an isolated running instance of a Solr index
‣  each Core has its own solrconfig.xml, schema.xml and index data
‣  search results can not be computed over Cores
‣  Solr Collection (aka Collection)
‣  Logical index distributed over multiple machines
‣  Physical partitioning using sharding
‣  Part of SolrCloud (Scalability, High Availability)
Solr Exercise I
29
Core vs. Collection
http://lucene.apache.org/solr/
Solr Home Directory as recommended:
‣  solr.xml
‣  primary configuration file Solr looks for when starting
‣  this file specifies the list of SolrCores it should load
‣  Solr Core Instance Directories
‣  contains configuration and data of a SolrCore
‣  lib/
‣  shared lib directory for solr instance
‣  zoo.cfg
‣  Zookeeper configuration when using SolrCloud
‣  How to tell Solr where SOLR_HOME is located?
‣  Use the Java system property: solr.solr.home
‣  e.g. java -Dsolr.solr.home=/some/dir -jar start.jar
Solr Exercise I
30
Solr Home
http://lucene.apache.org/solr/
Solr Core Instance Directory as recommended:
‣  conf/
‣  This directory is mandatory and must contain your solrconfig.xml and
schema.xml.
‣  Any other optional configuration files would also be kept here.
‣  data/
‣  This directory is the default location where Solr will keep your index, and is
used by the replication scripts for dealing with snapshots.
‣  You can override this location in the conf/solrconfig.xml.
‣  lib/
‣  This directory is optional. If it exists, Solr will load any Jars found in this
directory and use them to resolve any "plugins” specified in your
solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...).
Solr Exercise I
31
Instance Directory
http://lucene.apache.org/solr/
Solr includes an Admin Web interface providing your with
‣  General configuration details
‣  Core-specific configuration details
‣  Log information
‣  Run queries
‣  Document field / Term statistics
‣  Document fields
‣  Cache statistics
‣  Server cluster information
Access it via http://localhost:8983/solr
Solr Exercise I
32
Admin Web
interfacehttp://lucene.apache.org/solr/
‣  Indexing the first XML data
‣  Try first simple queries
‣  Different query types
‣  Get result score
‣  Highlighting
Solr Exercise II
33
Search BasicsSolr Exercise II
34
Document Query
indexing indexing
(Query analysis)
Representation Representation
(tokens) Query (tokens)
evaluation
Index-based search
‣  An inverted index is an index data
structure that
‣  stores mappings from tokens to
their locations (e.g. documents)
‣  allows fast access of those
documents that contains specific
tokens
‣  The purpose of an inverted index
is to allow fast full text searches
Search BasicsSolr Exercise II
35
Inverted index
Solr Exercise II
36	
  
Index
Document
Document
Document
Document
Field
Field
Field
Field
Field
Name Value
Search Basics Data model
Solr Exercise II
37	
  
Doc 1:
Penn State
Football …
football
Doc 2:
Football
players …
State
Posting
id
word doc offset
1 football Doc 1 3
Doc 1 67
Doc 2 1
2 penn Doc 1 1
3 players Doc 2 2
4 state Doc 1 2
Doc 2 13
Posting
Table
Search Basics Data model
‣  How to select important
terms?
‣  Simple method: using
middle-frequency words
Solr Exercise II
38
Frequency/Informativity
frequency informativity
Max.
Min.
1 2 3 … Rank
Search Basics Term selection
‣  tf = term frequency
‣  frequency of a term/keyword in a document
‣  The higher the tf, the higher the importance (weight) for the doc.
‣  df = document frequency
‣  no. of documents containing the term
‣  distribution of the term
‣  idf = inverse document frequency
‣  the unevenness of term distribution in the corpus
‣  the specificity of term to a document
‣  The more the term is distributed evenly, the less it is specific to a document
weight(t,D) = tf(t,D) * idf(t)
Solr Exercise II
39
Search Basics Term selection
‣  1-word query:
The documents to be retrieved are those that include the word
‣  Retrieve the inverted list for the word
‣  Sort in decreasing order of the weight of the word
‣  Multi-word query?
-  Combining several lists
-  How to combine matches of these different lists?
-  How to interpret the weight? (IR model)
Solr Exercise II
40
Search Basics Querying
‣  Vector space = all the terms encountered
<t1, t2, t3, …, tn>
‣  Document
D = < a1, a2, a3, …, an>
ai = weight of ti in D
‣  Query
Q = < b1, b2, b3, …, bn>
bi = weight of ti in Q
‣  R(D,Q) = Sim(D,Q)
‣  Cosine Similarity (TF*IDF)
‣  Okapi BM25
Vector-space modelSearch Basics
41
t1
t2
D
Q
‣  The Solr UpdateRequestHandler defines the logic to deal with index update
actions based on a specific data source or data format
‣  UpdateRequestHandlers must be defined in the solrconfig.xml and are matched
to specific url path in oder to access it via HTTP
‣  Solr supports serveral file types out-of-the-box by using the specific update
handler:
‣  Standard UpdateRequestHandler
‣  supporting XML, XSLT, JSON, CSV and javabin
‣  DataImportHandler
‣  Indexing events: Add/Replace, Commit, Soft Commit, Delete
Solr Indexing Update Request
handlers
Solr Exercise II
42
<requestHandler name=“update” class="solr.UpdateRequestHandler"/>
Solr Indexing XML AddSolr Exercise II
43
curl http://localhost:8983/solr/jax2013/update
-H 'Content-Type:text/xml' --data-binary
'<add>
<doc>
<field name=”id”>etext78942</field>
<field name=”title”>Solr textbook</field>
<field name=”subject">search technology</field>
<field name=”author">Bernhard Pflugfelder</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>'
Solr Indexing XML UpdateSolr Exercise II
44
curl http://localhost:8983/solr/jax2013/update
-H 'Content-Type:text/xml’ --data-binary
'<add>
<doc>
<field name=”id">etext78942</field>
<field name=”author" update="set">Christian Meder</field>
<field name=”subject" update="add">open source</field>
</doc>
</add>'
Solr Indexing XML DeleteSolr Exercise II
45
curl http://localhost:8983/solr/jax2013/update
-H 'Content-Type:text/xml’ --data-binary
'<delete>
<id>etext78942</id>
<query>author:meder</query>
</delete>'
Solr Indexing XML CommitSolr Exercise II
46
curl http://localhost:8983/solr/jax2013/update
-H 'Content-Type:text/xml’ --data-binary
'<commit waitSearcher="false"/>'
curl 'http://localhost:8983/solr/jax2013/update?
optimize=true&waitFlush=false'
‣  Multiple index actions in one JSON
Solr Indexing JSON Add / Delete /
Commit
Solr Exercise II
47
curl http://localhost:8983/solr/jax2013/update/json -H 
'Content-type:application/json' -d ’
{
"add": {
"commitWithin": 5000,
"doc": {
"f1": "v1",
"f1": "v2"
}
},
"commit": {},
"delete": { "id":"ID" },
"delete": { "query":"QUERY" }
"delete": { "query":"QUERY", 'commitWithin':'500' }
}'
‣  Commands add, set and inc
Solr Indexing JSON Atomic updatesSolr Exercise II
48
curl http://localhost:8983/solr/jax2013/update/json -H 
'Content-type:application/json' -d ’
[
{
"id" : "etext78942",
"title" : {"set":”solr 4.2.1 textbook"},
”viewcount” : {"inc":3},
"author" : {"add":”Bernhard Pflugfelder"}
}
]’
Solr Indexing Try outSolr Exercise II
49
cd SOLR_MAIN/example/exampledocs
curl 'http://localhost:8983/solr/collection1/update/json?
commit=true’ --data-binary @books.json
-H 'Content-type:application/json'
cd SOLR_MAIN/example/exampledocs
java -jar post.jar -h
java -jar post.jar *.xml
‣  q=+content:goethe +content:schiller
‣  q=+content:goethe -content:schiller
‣  q=title:faust
‣  q=title:faust AND -content:goethe
‣  q=content:“romeo and juliet”
‣  q=title:water*
‣  q=title:water~0.5
‣  q=created:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]
‣  q=viewcount:[20 TO 50]
‣  q=viewcount:[100 TO *]
Solr QueriesSolr Exercise II
50
curl –XPOST ‘http://localhost:8983/solr/jax2013/select’ –d
Query Syntax
Solr Queries Common
parameters
Solr Exercise II
51
Param name Param value Description
q string The user query string
start number Offset in the list of returned documents
rows number Number of documents returned
fq string A filter query
fl string,string,… Fields returned for each document
debugQuery true / false Include debug info in the response
curl –XPOST ‘http://localhost:8983/solr/collection1/select’ –d
‘q=+solr –elasticsearch&start=20&row=40&fl=* score’
Highlighting OverviewSolr Exercise II
52
Param name Param value Description
hl true / false Switch on / off highlighting
hl.q string Alternative highlighting query
hl.fl string, string,… Fields used for highlighting
hl.snippets number Number of maximum snippets
hl.fragsize number Number of characters per snippet
hl.simple.pre[post] string Text appears before / after match
curl –XPOST ‘http://localhost:8983/solr/collection1/select’ –d
‘q=+solr –elasticsearch&start=20&row=40&fl=* score
&hl=true&hl.fl=title,abstract’
‣  Datainputhandler SQL
‣  Datainputhandler XML
Solr Exercise III
53
‣  DataInputhandler makes possible to:
‣  index data in relational databases
‣  compose documents from multiple columns and tables
‣  bulk import or incremental update using Delta Query mechanism
‣  schedule full imports and delta imports
‣  Index data from XML/HTML using XPATH expressions
‣  DataInputhandler is part of Solr Contrib
‣  Define in solrconfig.xml
DataInputhandler OverviewSolr Exercise III
54
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/home/username/data-config.xml</str>
</lst>
</requestHandler>
‣  http://localhost:8983/solr/dataimport?command=full-import
‣  http://localhost:8983/solr/dataimport?command=delta-import
‣  http://localhost:8983/solr/dataimport?command=status
‣  http://localhost:8983/solr/dataimport?command=reload-config
‣  http://localhost:8983/solr/dataimport?command=abort
DataInputhandler CommandsSolr Exercise III
55
‣  The dataconfig.xml defines the data source and which data shall be used to
populate Solr documents during import
‣  Defines tags:
‣  dataSource
‣  document
‣  entity
‣  The entity defines a specific data selection resulting in a Solr document
‣  The query gives the data needed to populate fields of the Solr document
DataInputhandler ConfigurationSolr Exercise III
56
<dataConfig>
<dataSource … />
<document name="products">
<entity name="item" query="select * from item” />
</document>
</dataConfig>
‣  MySQL
‣  Oracle
‣  Use multiple data source within on DIH config by property name
‣  Each entity definition must then define a parameter name as well
DataInputhandler DataSourceSolr Exercise III
57
<dataSource name="jdbc" driver=”com.mysql.jdbc.Driver”
url="jdbc:mysql://localhost/dbname"
user="db_username" password="db_password"/>/>
<dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver"
url="jdbc:oracle:thin:@//hostname:port/SID"
user="db_username" password="db_password"/>
DataInputhandler SQL full-importSolr Exercise III
58
<dataConfig>
<dataSource … />
<document name="products">
<entity name="item" query="select * from item">
<field column="ID" name="id" />
<field column="NAME" name="name" />
<field column="MANU" name="manu" />
<field column="WEIGHT" name="weight" />
<field column="PRICE" name="price" />
<field column="POPULARITY" name="popularity" />
<field column="INSTOCK" name="inStock" />
<field column="INCLUDES" name="includes" />
</entity>
</document>
</dataConfig>
DataInputhandler SQL full-importSolr Exercise III
59
<dataConfig>
<dataSource … />
<document>
<entity name="item" query="select * from item">
<entity name="feature" query="select description as
features from feature where item_id='${item.ID}'"/>
<entity name="item_category" query="select CATEGORY_ID
from item_category where item_id='${item.ID}'">
<entity name="category" query="select description as cat
from category where id = '${item_category.CATEGORY_ID}'"/>
</entity>
</entity>
</document>
</dataConfig>
‣  Increment update of the specific content of a relational database
‣  Avoid indexing already indexed data again
‣  http://localhost:8983/solr/dataimport?command=delta-import
‣  Provide three specific queries for each entity except root:
‣  The deltaImportQuery gives the data needed to populate fields when
running a delta-import
‣  The deltaQuery gives the primary keys of the current entity which have
changes since the last index time
‣  The parentDeltaQuery uses the changed rows of the current table
(fetched with deltaQuery) to give the changed rows in the parent table.
This is necessary because whenever a row in the child table changes, we
need to re-generate the document which has that field.
DataInputhandler SQL Delta-ImportSolr Exercise III
60
DataInputhandler SQL Delta-ImportSolr Exercise III
61
<entity name="item" pk="ID”
query="select * from item”
deltaImportQuery="select * from item where ID='${dih.delta.id}'”
deltaQuery="select id from item where last_modified &gt;
'${dih.last_index_time}'”>
<entity name="feature" pk="ITEM_ID” query="select description as
features from feature where item_id='${item.ID}'” />
<entity name="item_category" pk="ITEM_ID, CATEGORY_ID” query="select
CATEGORY_ID from item_category where ITEM_ID='${item.ID}'">
<entity name="category" pk="ID” query="select description as cat
from category where id = '${item_category.CATEGORY_ID}'” />
</entity>
</entity>
‣  HTTP source
‣  XML File source
DataInputhandler Other DataSourcesSolr Exercise III
62
<dataConfig>
<dataSource type="HttpDataSource" />
…
</dataConfig>
<dataConfig>
<dataSource type=”FileDataSource" encoding=“UTF-8”/>
…
</dataConfig>
‣  The entity defines location of the XML file
‣  Solr document field population is done by evaluating XPATH expressions
DataInputhandler XML full-importSolr Exercise III
63
<entity name="page”
processor="XPathEntityProcessor"
stream="true"
forEach="/RDF/etext/"
url="../../catalog.rdf.xml"
transformer="RegexTransformer,DateFormatTransformer”>
<field column="id" xpath="/RDF/etext/@id" />
<field column="title" xpath="/RDF/etext/title" />
<field column="alternative" xpath="/RDF/etext/alternative" />
<field column="author" xpath="/RDF/etext/creator" />
<field column="multi_author” xpath="/RDF/etext/creator/Bag/li" />
<field column="subject" xpath="//LCSH/value" />
<field column="viewcount"
xpath="/RDF/etext/downloads/nonNegativeInteger/value" />
<field column="created"
xpath="/RDF/etext/created/W3CDTF/value" dateTimeFormat="yyyy-MM-dd" />
</entity>
‣  Schema,
‣  Data types
‣  Analyzers, Tokenizers
Solr Exercise IV
64
‣  Defines document representation by specifying fields
‣  with a specific field type
‣  with specific field type properties
‣  Dynamic fields
‣  CopyField
‣  Define analyzers:
‣  Tokenizers
‣  Filters
‣  Synonym lists, stop word lists
‣  additional text analysis
‣  Assign analyzers to the Text-based data types (solr.TextField)
‣  Example schema.xml
SchemaSolr Exercise IV
65
Overview
‣  Field types
‣  int, long, float, double, boolean
‣  string, date, binary
‣  derived from solr.TextField
‣  text_general, text_de, text_en, …
‣  Field type properties
‣  indexed (true / false)
‣  stored (true / false)
‣  multiValued (true / false)
‣  termVectors (true / false)
Schema FieldsSolr Exercise IV
66
Break stream of characters into
tokens / terms
‣  Normalization (e.g. case)
‣  Stopwords
‣  Stemming
‣  Lemmatizer / Decomposer
‣  Part of Speech Tagger
‣  Information Extraction
Analyzing /
Tokenization
OverviewSolr Exercise IV
67
‣  function words do not bear useful information for searching
of, in, about, with, I, although, …
‣  Stopword list: contain stopwords, not to be used as index
‣  Prepositions
‣  Articles
‣  Pronouns
‣  Some adverbs and adjectives
‣  Some frequent words (e.g. document)
‣  The removal of stopwords usually improves search quality
‣  Solr provides default stopword lists for various languages
Analyzing /
Tokenization
StopwordsSolr Exercise IV
68
‣  Apply strict algorithmic normalization of inflection forms (e.g. Porter)
‣  Strategy: removing some endings of words.
Example:
computer, compute, computes, computing, computed, computation are all
normalized to comput
‣  But: going -> go, king -> k ???????????
‣  Stemming might work well for English
‣  However, be careful using stemming, especially for German
Analyzing /
Tokenization
StemmingSolr Exercise IV
69
Analyzing /
Tokenization
Define an analyzerSolr Exercise IV
70
<fieldType name=”<name>" class="solr.TextField”
positionIncrementGap="100">
<analyzer type="index”>
<!– tokenizer and filters for indexing -->
<tokenizer class=“CLASS” PARAMS />
<filter class=“CLASS” PARAMS />
</analyzer>
<analyzer type="query">
<!– tokenizer and filters for search -->
<tokenizer class=“CLASS” PARAMS />
<filter class=“CLASS” PARAMS />
</analyzer>
</fieldType>
‣  TokenizerFactories
‣  solr.StandardTokenizerFactory
‣  solr.WhitespaceTokenizerFactory
‣  solr.KeywordTokenizerFactory
‣  TokenFilterFactories
‣  solr.LowerCaseFilterFactory
‣  solr.TrimFilterFactory
‣  solr.StopFilterFactory
‣  solr.WordDelimiterFilterFactory
‣  solr.SynonymFilterFactory
‣  solr.EdgeNGramFilterFactory
Analyzing /
Tokenization
Tokenizers & FiltersSolr Exercise IV
71
‣  English
‣  solr.PorterStemFilterFactory
‣  solr.SnowballPorterFilterFactory
‣  solr.EnglishMinimalStemFilterFactory
‣  German
‣  solr.SnowballPorterFilterFactory
‣  solr.GermanLightStemFilterFactory
‣  solr.GermanMinimalStemFilterFactory
‣  More information at http://wiki.apache.org/solr/LanguageAnalysis
Analyzing /
Tokenization
Language analysisSolr Exercise IV
72
‣  Faceted search
‣  Filter query
‣  MoreLikeThis query
Solr Exercise V
73
Faceted search OverviewSolr Exercise V
74
‣  „Die Aussage eines Probanden bei einem Usability-Test einer Faceted Search
Lösung im Rahmen dieser Studie ist damit richtungsweisend:
‣  „Mit dem Filter hier habe ich das Gefühl, dass selbst eine schnöde Suche richtig
Spaß machen kann.””
‣  Quelle: Faceted Search: Die neue Suche im Usability-Test (zum
kostenlosen Download unter http://usability.de)
Faceted search MotivationSolr Exercise V
75
‣  Faceted search (aka faceted
navigation) organizes search results
based on different categories or
dimensions giving the user the
possibility to drill down the search
results
‣  Facets can be authors, titles, tags,
dates, languages, file types …
‣  Typically, meta data describing
concepts and meaning of documents
are useful as facets
‣  Facets can be shown with counts
Faceted search OverviewSolr Exercise V
76
‣  Solr provides faceting mechanism out-of-the-box including the returning of counts
‣  Important: facet fields must be defined with indexed=true
‣  Often facet fields are analyzed differently as search fields. Therefore it is
common to define separate document fields for faceting in schema.xml
‣  Facet fields shall not be tokenized, lower-cased, stemmed
‣  Facet fields can be of type
‣  int, long, float, double, boolean
‣  solr.TextField
‣  date
‣  From the view point of performance also define
‣  stored=false
‣  omitNorms=false
Faceted search Solr FacetingSolr Exercise V
77
<field name=”facet_author” indexed=“true” stored=“false” omitNorms=“false” />
‣  Solr provides two basic mechanism to build facets
‣  Arbitrary faceting (facet.query=query)
‣  Field value faceting (facet.field=fieldname)
‣  In case of Field value faceting two faceting methods can be chosen
‣  Enum Based Field Queries (facet.method=enum)
‣  Field Cache (facet.method=fc)
‣  Other common parameters
Faceted search Solr FacetingSolr Exercise V
78
Param name Param value Description
facet true / false Switch on / off faceting
facet.prefix String Facet results must start with prefix
facet.sort sort / index Sort facet results
facet.limit number Limit number of facet results
facet.mincount number Minimal count to be considered
Faceted search Date facetingSolr Exercise V
79
Param name Param value Description
facet.date fieldname The fieldname of type
date used for date
faceting
facet.date.start date expression The start date of the first
date facet interval
facet.date.end date expression The upper bound for the
last date facet interval
facet.date.gap date expression The size of each date
range interval
q=*:*&rows=0&wt=xml&indent=true&facet=true&facet.date=created&
facet.date.start=1996-01-31T23:00:00Z&
facet.date.end=2013-04-021T00:00:00Z&facet.date.gap=%2B1YEAR
Faceted search Range facetingSolr Exercise V
80
Param name Param value Description
facet.range fieldname The fieldname of a
numeric field type
facet.range.start number The start date of the first
range interval
facet.range.end number The upper bound for the
last range interval
facet.range.gap number The size of each range
interval
q=*:*&rows=0&wt=xml&indent=true&facet=true&facet.range=viewcount&
facet.range.start=0&facet.range.end=150&facet.range.gap=20
‣  Filter queries restrict the document result set to a specific subset of the returned
set based on the original query
‣  The scores of the documents are not influenced by filter queries
‣  Examples
‣  access permissions (ACLs)
‣  categories or tags
‣  Importantly, the results of filter queries are automatically cached per default
‣  Solr uses a separate in-memory filter cache
‣  Thus, filter queries will be evaluated very fast if they are cached
‣  Complex, often used queries are good candidates for filter queries
‣  Keep in mind that the size of filter cache depends on the search scenario must
therefore be tuned explicitly
Filter query OverviewSolr Exercise V
81
‣  Filter queries are defined by query parameter fq
‣  Avoid caching filter queries
Filter query ExamplesSolr Exercise V
82
q=content:arthur&fq=subject:fantasy&fl=title,author&rows=5
content:arthur&fq=subject:fantasy&fq=viewcount:[* TO 100]&
fl=title,author&rows=5
content:arthur&fq=subject:fantasy
&fq={!cache=false}viewcount:[* TO 100]&fl=title,author&rows=5
‣  Idea of MoreLikeThis
‣  MoreLikeThis constructs a query based on the terms of given set of fields
‣  Matching documents are “similar” based on the chosen set of fields
‣  Fields used by MoreLikeThis should define termVerctors=“true”
MoreLikeThis OverviewSolr Exercise V
83
Param name Param value Description
mlt.fl fieldnames Fields to be used by MLT
mlt.mintf number Minimum term ferquency
mlt.mindf number Minimum document frequency
mlt.minwl number Minimum word length
mlt.maxwl number Maximum word length
mlt.maxqt number Maximum number of query terms
q=content:schiller&mlt=true&mlt.fl=subject&mlt.mindf=50
&mlt.mintf=1
‣  Advanced queries:
‣  Dismax query parser
‣  Sorting
‣  Grouping
‣  Autosuggestion
Solr Exercise VI
84
‣  Motivation
‣  Standard Solr parser only supports simple query control
‣  One field can be defined as default search field
‣  Supports only boolean conjunction of sub queries (AND / OR)
‣  Strict query syntax to perform e.g. phrase queries
‣  Dismax (and eDismax) query parsers are more robust query parsers offering
various additional query parameters and controls to optimize queries
‣  These additional query parameters and controls are hidden from the user
‣  Dismax stands for Disjunction Max
‣  Disjunction means that multiple fields can be search simultaneously with
different field weights
‣  Max means that the maximum score of the field matches is taken as the
document score (instead of the sum)
DisMax Parser OverviewSolr Exercise VI
85
Param name Description
q.alt Alternative query executed if the user query is not
specified or blank
qf The query fields to be searched for. Each field can be
defined with an individual field weight.
mm Minimum match of query words in order to evaluate a
document match
pf Defines phrase fields. Boost documents that have the
search terms in close proximity within the phrase fields.
ps The phrase slop effecting the boosting of phrase queries
evaluated on the pf fields
qs The phrase slop for user defined phrase queries
qb A raw query that is added to the user query to influence
scoring
bf Function queries that are added to the user queries to
influence scoring
DisMax Parser ParametersSolr Exercise VI
86
DisMax Parser ExamplesSolr Exercise VI
87
http://localhost:8983/solr/jax2013/select?
q=schiller&defType=dismax&qf=author^20.0+content^0.3
http://localhost:8983/solr/jax2013/select?
q=schiller&defType=dismax&qf=author^20.0+content^0.3
&bq=subject:drama^5.0
‣  Ranking (= ordering) the documents results based on criteria
‣  Default ranking is done based on the document score
‣  The sort parameter allows to rank the document results based on an arbitrary
field or even function
‣  Sort fields must be defined as indexed=true and multiValued=false
‣  Syntax: …&sort=fieldname [asc/desc],fieldname [asc/desc],…
Sorting OverviewSolr Exercise VI
88
http://localhost:8983/solr/jax2013/select?
q=schiller&defType=dismax&qf=author^20.0+content^0.3&sort=viewcount+desc
http://localhost:8983/solr/jax2013/select?
q=schiller&defType=dismax&qf=author^20.0+content^0.3&sort=viewcount+desc
Grouping OverviewSolr Exercise VI
89
‣  Motivation
‣  Documents with a common values for some field are partitioned into groups
‣  Documents with the same field value are collapsed to a single result
Grouping ParametersSolr Exercise VI
90
Query parameter Query value Description
group true / false Switch on / off grouping
group.field fieldname Field to group on
rows number Number of groups returned
start number Offset in into the list of returned
groups
group.limit number Number of docs returned for each
group
group.offset number Offset into the list of returned
documents per group
sort fieldname [asc/desc] Sort groups on some field
group.sort fieldname [asc/desc] Sort documents of every group on
some field
Autosuggestion OverviewSolr Exercise VI
91
‣  Autosuggestion (aka Autocomplete) is a common search feature that supports
the user by providing query suggestions during typing
‣  Autosuggestion functionality can include
‣  the search index
‣  separate word lists
‣  synonyms / black lists
‣  grouping suggestions
‣  Fuzziness
‣  Whatever mechanism is actually used to provide autosuggest, it must be
evaluated suggestions very quickly.
‣  Solr provides different mechanisms to build autosuggestion functionality:
‣  using facet search
‣  using standard search (standard query parser)
‣  using spellchecker Solr plugin
Autosuggestion OverviewSolr Exercise VI
92
‣  Define new field title_auto using for autosuggestion
‣  Define the field type text_auto providing specific analysis for autosuggestion
‣  How to get suggestions for a user query?
Autosuggestion Using facetingSolr Exercise VI
93
<field name=”title" type="text_general" indexed="true” stored="true” />
<field name=”title_auto" type="text_auto" indexed="true" stored="true” />
<copyField source=”content" dest=”content_auto" />
<fieldType name="text_auto" class="solr.TextField” positionIncrementGap="100”>
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
q=*:*&facet=true&facet.field=title_auto&facet.mincount=1&facet.prefix=schi
‣  Again, define new field title_auto as in previous slide
‣  Next, redefine the field type text_auto as follows
‣  Now, you can use the standard Solr query parser to get suggestions
Autosuggestion Using standard
search
Solr Exercise VI
94
<fieldType name="text_auto" class="solr.TextField” positionIncrementGap=“100”>
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory"
minGramSize="1" maxGramSize=“25" side="front" />
</analyzer>
</fieldType>
q=title_auto:query&q.op=AND&rows=5&fl=title
q=title_auto:query&q.op=AND&rows=0&facet=true&
facet.field=tag&facet.mincount=1&facet.limit=5
Elasticsearch is a “distributed-from-scratch” search server
based on Lucene
Created by Shay Banon with a first version made public in 02/2010:
ElasticSearch itself was born out of my frustration with the fact that there isn’t really
a good, open source, solution for distributed search engine out there, which also
combines what I expect of search engines after building Compass (and on that, I will
blog later…).
I have been working on this for the past several months, pouring my search and
distributed knowledge into this (and portions of my heart and time ;) )
[http://www.elasticsearch.org/blog/2010/02/08/youknowforsearch.html]
OverviewProjects
95
http://www.elasticsearch.org/
‣  Current stable version 0.20.6
‣  Licensed by Apache License 2.0
‣  Small group of core developer, but strong support of valuable Lucene committer
‣  Already a promising list of users (small and big companies)
‣  github, soundcloud, stackoverflow, mozilla, klout
‣  http://www.elasticsearch.org/users/
OverviewProjects
96
http://www.elasticsearch.org/
‣  Pure Java application
‣  Search, indexing und scoring is done by Lucene
‣  Document-oriented
‣  Schema-less
‣  Well, ElasticSearch might be schema-less, Lucene isn’t!
‣  ElasticSearch therefore automatically detect correct types
‣  However, a schema is still needed! Why?
‣  HTTP & JSON API for all interactions
‣  Indexing / Updating
‣  Searching
‣  Administration / Monitoring
‣  Distribution is fundamental feature of ElasticSearch!
HighlightsProjects
97
http://www.elasticsearch.org/
‣  Facet search and filtering (values, queries, date/time ranges)
‣  Lots of query types
‣  Script filters
‣  Geospatial search called GeoShape Query
‣  Configurable caching for
‣  Filters
‣  Field data
‣  NRT search with separate API
‣  Sorting, Highlighting
‣  MoreLikeThis based on document or field
‣  Multi Tenancy:
‣  Define multiple indices that e.g. handles documents differently during
indexing
‣  Still, you can search over them with one query
HighlightsProjects
98
http://www.elasticsearch.org/
‣  ElasticSearch Gateway Module stores indices and metadata to:
‣  Local FS, Shared FS, Hadoop, Amazon S3
‣  River Interface:
‣  Pluggable service to constantly pull data
‣  Manage over specific REST endpoint
‣  Implementations for CouchDB, MongoDB
‣  Lucene Analyzer specification over elasticsearch.yml or API
‣  Bulk indexing
‣  Default: single document indexing
‣  Bulk indexing over specific REST endpoints
HighlightsProjects
99
http://www.elasticsearch.org/
+  Simple but effective architecture
+  Easiness of use, even when using distributed search
+  High matureness, even though ES is young
+  Modern technologies used
+  HTTP and JSON only
-  Shard splitting is not trivial
-  Still small community and small group of core developer
-  Compared to Solr:
-  Less number of query types
-  Less possibilities for boosting
-  Less number of analyzers
-  Missing features such as clustering, autocomplete, spell checking
Pros & ConsProjects
100
http://www.elasticsearch.org/
‣  Installation
‣  Indexing
‣  Queries I
ES Exercise I
101
‣  On Linux systems
‣  On Windows systems
‣  Run
InstallationES Exercise I
102
unzip elasticsearch-0.20.6.zip
cd elasticsearch-0.20.6
bin/elasticsearch –f
[unzip elasticsearch-0.20.6.zip]
dir elasticsearch-0.20.6
bin/elasticsearch.bat -f
curl -X GET http://localhost:9200/
http://www.elasticsearch.org/
‣  On Linux systems
‣  Run
‣  Shutdown
InstallationES Exercise I
103
unzip elasticsearch-0.20.6.zip
cd elasticsearch-0.20.6
bin/elasticsearch –p path/to/pidfile
curl -X GET http://localhost:9200/
curl -XPOST 'http://localhost:9200/_shutdown’
curl -XPOST 'http://localhost:9200/_cluster/nodes/_shutdown’
http://www.elasticsearch.org/
‣  bin/
‣  eslasticsearch [elasticsearch.bat] to start elasticsearch server
‣  script plugin [plugin.bat] to install plugins
‣  config/
‣  contains the global configuration
‣  server config file elasticsearch.yml
‣  logging config file logging.yml
‣  data/
‣  standard directory containing index data
‣  configurable by path.data
ES_HOMEES Exercise I
104
http://www.elasticsearch.org/
‣  lib/
‣  shared library directory
‣  place additional libraries here
‣  logs/
‣  log files will be placed here using default log configuration
‣  configurable by path.log in elasticsearch.yml
ES_HOMEES Exercise I
105
http://www.elasticsearch.org/
‣  cluster
‣  one or more nodes build a cluster
‣  usually distributed over various machines
‣  one master node that is automatically chosen
‣  node
‣  running instance of elasticsearch
‣  a node automatically discovers other nodes at start up
‣  node discovery is done either using unicast or multicast messages
‣  index
‣  separate document database model with own mapping and types
‣  is partitioned in one or more primary and replica shards
TerminologyES Exercise I
106
http://www.elasticsearch.org/
‣  mapping
‣  schema definition defining types with their associated fields
‣  field types and properties
‣  shard
‣  low level data structure of elasticsearch
‣  single Lucene index
‣  managed automatically by elasticsearch
‣  primary shard
‣  every documents is exclusively stored in a primary shard
‣  all primary shards make up the documents of the index
‣  default: 5 primary shards
TerminologyES Exercise I
107
http://www.elasticsearch.org/
‣  replica shard
‣  each primary shard is replicated 0 or more times
‣  replica shards are distributed automatically
‣  replica shards are used for search and primary shard fail-over
‣  type
‣  within an index zero or more types can be defined
‣  a type defines a certain set of field similar to a table structure
‣  types are defined in the mapping
TerminologyES Exercise I
108
http://www.elasticsearch.org/
‣  Index API
‣  index (PUT/POST)
‣  update (PUT/POST)
‣  delete (DELETE),
‣  delete by query (DELETE)
‣  Documents are defined as JSON objects
‣  index and type are defined in the url path
‣  automatic creation of an index and mapping
‣  action.auto_create_index
‣  index.mapper.dynamic
‣  elasticssearch automatically identifies field types based on JSON input
‣  automatic ID generation
Index APIES Exercise I
109
http://www.elasticsearch.org/
‣  Index a book
‣  Index a book with defining a named type
Index APIES Exercise I
110
$ curl -XPUT 'http://localhost:9200/books/book/1' -d '{
"author" : "bernhard pflugfelder",
"post_date" : "2013-04-22T14:12:12",
"title" : "my first book",
"abstract" : "this book is about elasticsearch",
}'
$ curl -XPUT 'http://localhost:9200/books/book/1' -d '{
"book" : {
"author" : "bernhard pflugfelder",
"post_date" : "2013-04-22T14:12:12",
"title" : "my first book",
"abstract" : "this book is about elasticsearch",
}
}'
http://www.elasticsearch.org/
‣  Index a book with automatic ID generation
‣  Result
Index APIES Exercise I
111
$ curl -XPOST 'http://localhost:9200/books/book/' -d '{
"author" : "bernhard pflugfelder",
"post_date" : "2013-04-22T14:12:12",
"title" : "my first book",
"abstract" : "this book is about elasticsearch",
}'
{
"ok" : true,
"_index" : "books",
"_type" : "book",
"_id" : "6a8ca01c-7896-48e9-81cc-9f70661fcb32",
"_version" : 1
}
http://www.elasticsearch.org/
‣  Update operations are done by providing a script manipulating the field structure
‣  Following steps composes the update process:
‣  fetch the requested document
‣  apply the script
‣  indexed as a new document
‣  Only the source field _source can be updated
‣  _source is always stored in the index
‣  stores the actual JSON used at index time
‣  can be disabled for every type separately
‣  can be compressed (from version 0.90 compression is done automatically)
Index APIES Exercise I
112
{
"book" : {
"_source" : {"enabled" : false}}
}
http://www.elasticsearch.org/
‣  Create a new field tag
‣  Replace the value of field tag
‣  Add an additional value for the field tag
Index APIES Exercise I
113
curl -XPOST 'localhost:9200/books/book/1/_update' -d '{
"script" : "ctx._source.tag = "search""
}'
curl -XPOST 'localhost:9200/books/book/1/_update' -d '{
"script" : "ctx._source.tags += tag",
"params" : {
"tag" : "open source technologies"
}
curl -XPOST 'localhost:9200/books/book/1/_update' -d '{
"script" : "ctx._source.tag = "search technologies""
}'
http://www.elasticsearch.org/
‣  Delete a document based on its unique ID
‣  Delete a document based on a search query
Index APIES Exercise I
114
curl -XDELETE 'http://localhost:9200/books/book/1'
$ curl -XDELETE 'http://localhost:9200/books/book/_query' -d '{
"term" : { "author" : "bernhard pflugfelder" }
}
'
http://www.elasticsearch.org/
‣  Term query
‣  Terms query
Search APIES Exercise I
115
$ curl -XGET 'http://localhost:9200/books/book/_search' -d '{
"query" : {
"term" : { "author" : "bernhard" }
}}'
$ curl -XGET 'http://localhost:9200/books/book/_search' -d '{
"query" : {
"terms" : { "author" : [ "bernhard”, “pflugfelder” ],
“minimum_match” : 1
}}}'
http://www.elasticsearch.org/
‣  Match queries accepts text, numeric and date values
‣  Match queries are applied per field, automatically chosen proper analyzer
‣  Types of match queries
‣  boolean (default)
‣  phrase match
‣  phrase prefix match
‣  multi match (two or more fields are searched)
Search APIES Exercise I
116
http://www.elasticsearch.org/
‣  Simple syntax
‣  Extended syntax
Search APIES Exercise I
117
$ curl -XGET 'http://localhost:9200/books/book/_search' -d '{
"query" : {
"term" : { "author" : "bernhard" }
}}'
{"match" : {
"abstract" : {
"query" : "about elasticsearch",
"operator" : "and"
}}}
Param name Param value Description
operator “and”, “or” boolean operator
fuzziness 0.0 – 1.0 add fuzziness to the original terms
http://www.elasticsearch.org/
‣  Simple syntax
‣  Extended syntax
Search APIES Exercise I
118
$ curl -XGET 'http://localhost:9200/books/book/_search' -d '{
"query" : {
"match_phrase" : { ” abstract" : "about elasticsearch" }
}}'
{"match_phrase" : {
”abstract" : {
"query" : "about elasticsearch",
"operator" : "and"
}}}
Param name Param value Description
slop number phrase sloppiness
analyzer 0.0 – 1.0 analyzer name to be used for query
http://www.elasticsearch.org/
‣  Mapping (aka schema)
‣  Field types
‣  Analyzers
‣  Queries II
ES Exercise II
119
‣  The schema mapping defines the index structure and document representation
‣  Elasticsearch works without an explicit schema (“schema-less”),
‣  Automatic inference is however dangerous in many situations
‣  This, define an explicit schema is the preferred way
‣  A mapping consists of:
‣  type name
‣  list of fields (i.e. properties)
‣  each property defines a field type and, optionally, field attributes
‣  Mappings are formatted in JSON
‣  Mappings are managed using the Mapping API (PUT / POST / GET)
MappingES Exercise II
120
http://www.elasticsearch.org/
‣  Define a mapping for type book
‣  Retrieve the current mapping for type book
MappingES Exercise II
121
# echo " {
"mappings" : {
"books" : {
"properties" : {
”id" : { "type" : "string" },
"title" : { "type" : "string" },
"author" : { "type" : "string" },
”subject" : { "type" : ”string" },
”view_count" : { "type" : ”integer" },
"created" : { "type" : "date",
"format" : “dateOptionalTime" }
}}}} " > book.json
curl –XPUT 'localhost:9200/gutenberg/books/_mapping’ –d @book.json
# curl 'localhost:9200/gutenberg/books/_mapping?pretty=1
http://www.elasticsearch.org/
‣  Field types
‣  string, date
‣  number
‣  byte, short, integer, long, float, double
‣  boolean, binary (BASE64)
‣  Common field attributes
MappingES Exercise II
122
Name Value Description
index_name string field name stored within the index
index yes / no Field shall be searchable
store yes ( no Original values shall be stored
analyzer string Analyzer used for that field
null_value value Default field value if a value is not assigned to a
document
http://www.elasticsearch.org/
AnalyzersES Exercise II
123
‣  Analyzers are defined either
‣  in elasticsearch.yml or elasticsearch.json
‣  by the Index API
‣  Common analyzers
‣  standard
‣  whitespace
‣  stop
‣  keyword
‣  language
‣  snowball
curl 'localhost:9200/_analyze?analyzer=standard' -d ’elasticsearch is groovy!’
curl 'localhost:9200/_analyze?analyzer=whitespace' -d ’elasticsearch is groovy!'
curl 'localhost:9200/_analyze?analyzer=stop' -d ’elasticsearch is groovy!'
curl 'localhost:9200/_analyze?analyzer=keyword' -d ’elasticsearch is groovy!’
http://www.elasticsearch.org/
AnalyzersES Exercise II
124
discovery.zen.multicast.enabled: false
http:
max_content_length: 100000
index:
number_of_shards: 1
analysis:
analyzer:
Default:
type: standard
lowercase_analyzer:
type: custom
tokenizer: standard
filter: [standard, lowercase]
http://www.elasticsearch.org/
‣  Elasticsearch provides two highlighting algorithms
‣  fast vector highlighter
‣  highlighter (standard implementation)
‣  Requirement to use fast vector highlighter
HighlightingES Exercise II
125
{”books" : {
”title" : {"type" : "string”,
"term_vector" : "with_positions_offsets”}}}
{
"query" : {...},
"highlight" : {
"pre_tags" : ["<tag1>", "<tag2>"],
"post_tags" : ["</tag1>", "</tag2>"],
"fields" : {
"_all" : {}
}
}
}
http://www.elasticsearch.org/
‣  Faceted search
‣  Filter query
‣  Sorting
‣  More Like This
ES Exercise III
126
‣  Elasticsearch provides the following facet mechanism:
‣  Group results by a field value
‣  Group by numeric or date ranges
‣  Group numeric or date values in equally sized buckets (histogram)
‣  Group results around a coordinate based on the geo distance
‣  Basic facet definition
‣  Facet types: terms, range, histogram, date_histogram, geo_distance
Faceted searchES Exercise III
127
{
"facets" : {
"<FACET NAME>" : {
"<FACET TYPE>" : { ... },
"global" : true
}}}
http://www.elasticsearch.org/
Faceted searchES Exercise III
128
curl -X POST http://localhost:9200/gutenberg/books/_search?pretty=1 -d ’
{
"from": 0,
"size": 10,
"query": {
"match": {
”author": ”schiller"
}
},
"facets": {
"tagsFacet": {
"terms": {
"field": ”subject",
"size": 10
}
}
}
}'
http://www.elasticsearch.org/
Faceted searchES Exercise III
129
{
"query" : {
"match_all" : {}
},
"facets" : {
"range1" : {
"range" : {
”view_count" : [
{ "to" : 50 },
{ "from" : 20, "to" : 70 },
{ "from" : 70, "to" : 120 },
{ "from" : 150 }
]
}
}
}
}
http://www.elasticsearch.org/
‣  Histogram facet works on any numeric field
‣  Field values are rounded to fit in the respective bucket
‣  The property interval defines the bucket size
Faceted searchES Exercise III
130
{
"query" : {
"match_all" : {}
},
"facets" : {
"histo1" : {
"histogram" : {
"field" : ”view_count",
"interval" : 100
}
}
}
}
http://www.elasticsearch.org/
‣  Elastic search also provides filter queries internally cached for optimal
performance
‣  A filter query can be applied based on a returned search result like here
Filter queryES Exercise III
131
curl -XPOST 'localhost:9200/gutenberg/books/_search?pretty=1' -d '
{
"query" : {
"term" : { ”title" : ”schiller" }
},
"filter" : {
"term" : { ”subject" : ”drama" }
},
"facets" : {
"tag" : {
"terms" : { "field" : ”subject" }
}
}
}'
http://www.elasticsearch.org/
‣  Or the filter query is applied during the search of the user query at first place
‣  Difference to previous filter query?
Filter queryES Exercise III
132
curl -XPOST 'localhost:9200/books/_search?pretty=1' -d '
{
"filtered" : {
"query" : {
"term" : { ”author" : “schiller" }
},
"filter" : {
"range" : {
”view_count" : { "from" : 50, "to" : 100 }
}
}
}
}'
http://www.elasticsearch.org/
‣  Sorting is done based on one or multiple fields
‣  In case of multiple sorting fields, sorting is done per field
‣  ascending / descending sorting
‣  _score refers to sort based on the score
SortingES Exercise III
133
curl -XPOST 'localhost:9200/gutenberg/books/_search?pretty=1' -d ’
{
"sort" : [
{ ”view_count" : {"order" : ”desc"} },
"_score”
],
"query" : {
"term" : { "title" : ”schiller" }
}
}'
http://www.elasticsearch.org/
mlt queryES Exercise III
134
curl -XPOST 'localhost:9200/gutenberg/books/_search?pretty=1' -d ’
{
"more_like_this" : {
"fields" : ["title", ”subject"],
"like_text" : "text like this one",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}'
http://www.elasticsearch.org/
Name Value Description
fields fieldname(s) List of fields used for mlt
like_text string The text to find docs like
min_term_freq number Minimal term freq
max_query_terms number Maximal term freq
min_doc_freq number Minimal document freq
max_doc_freq number Maximal document freq
percent_terms_to_match 0.0 – 1.0 Percentage of terms match

More Related Content

What's hot

Boost your productivity with Scala tooling!
Boost your productivity  with Scala tooling!Boost your productivity  with Scala tooling!
Boost your productivity with Scala tooling!MeriamLachkar1
 
OUGLS 2016: How profiling works in MySQL
OUGLS 2016: How profiling works in MySQLOUGLS 2016: How profiling works in MySQL
OUGLS 2016: How profiling works in MySQLGeorgi Kodinov
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
OUGLS 2016: Guided Tour On The MySQL Source Code
OUGLS 2016: Guided Tour On The MySQL Source CodeOUGLS 2016: Guided Tour On The MySQL Source Code
OUGLS 2016: Guided Tour On The MySQL Source CodeGeorgi Kodinov
 
oracle upgradation
oracle upgradationoracle upgradation
oracle upgradationinfluxbob
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Lucidworks
 
Apache Karaf in DX 7.2 - Developers Meetup - March 2017
Apache Karaf in DX 7.2 - Developers Meetup - March 2017Apache Karaf in DX 7.2 - Developers Meetup - March 2017
Apache Karaf in DX 7.2 - Developers Meetup - March 2017Jahia Solutions Group
 
Integrate ManifoldCF with Solr
Integrate ManifoldCF with SolrIntegrate ManifoldCF with Solr
Integrate ManifoldCF with Solrfrancelabs
 
Boost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BSBoost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BSInformation Development World
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrAngel Borroy López
 
Tomcat and apache httpd training
Tomcat and apache httpd trainingTomcat and apache httpd training
Tomcat and apache httpd trainingFranck SIMON
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under controlMarcin Przepiórowski
 
Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!sparkfabrik
 
Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015Anshum Gupta
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CIJukka Zitting
 

What's hot (20)

Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Boost your productivity with Scala tooling!
Boost your productivity  with Scala tooling!Boost your productivity  with Scala tooling!
Boost your productivity with Scala tooling!
 
24sax
24sax24sax
24sax
 
OUGLS 2016: How profiling works in MySQL
OUGLS 2016: How profiling works in MySQLOUGLS 2016: How profiling works in MySQL
OUGLS 2016: How profiling works in MySQL
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
OUGLS 2016: Guided Tour On The MySQL Source Code
OUGLS 2016: Guided Tour On The MySQL Source CodeOUGLS 2016: Guided Tour On The MySQL Source Code
OUGLS 2016: Guided Tour On The MySQL Source Code
 
oracle upgradation
oracle upgradationoracle upgradation
oracle upgradation
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
 
Apache Karaf in DX 7.2 - Developers Meetup - March 2017
Apache Karaf in DX 7.2 - Developers Meetup - March 2017Apache Karaf in DX 7.2 - Developers Meetup - March 2017
Apache Karaf in DX 7.2 - Developers Meetup - March 2017
 
Integrate ManifoldCF with Solr
Integrate ManifoldCF with SolrIntegrate ManifoldCF with Solr
Integrate ManifoldCF with Solr
 
Splunk Developer Platform
Splunk Developer PlatformSplunk Developer Platform
Splunk Developer Platform
 
Boost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BSBoost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BS
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache Solr
 
Tomcat and apache httpd training
Tomcat and apache httpd trainingTomcat and apache httpd training
Tomcat and apache httpd training
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under control
 
Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!
 
Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Scaling Solr with Solr Cloud
Scaling Solr with Solr CloudScaling Solr with Solr Cloud
Scaling Solr with Solr Cloud
 

Viewers also liked

Crowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNCrowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNRonald Kleverlaan
 
anytime, anyplace care
anytime, anyplace careanytime, anyplace care
anytime, anyplace care3GDR
 
Mobile Internet Satisfaction 2011
Mobile Internet Satisfaction 2011Mobile Internet Satisfaction 2011
Mobile Internet Satisfaction 2011On Device Research
 
Restaurantes burguecrepes
Restaurantes burguecrepesRestaurantes burguecrepes
Restaurantes burguecrepesnegociodemaythe
 
Optimizely Developer Showcase
Optimizely Developer ShowcaseOptimizely Developer Showcase
Optimizely Developer ShowcaseOptimizely
 
Minor ondernemerschap hva - crowdfunding
Minor ondernemerschap hva - crowdfundingMinor ondernemerschap hva - crowdfunding
Minor ondernemerschap hva - crowdfundingRonald Kleverlaan
 

Viewers also liked (7)

Crowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNCrowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVN
 
anytime, anyplace care
anytime, anyplace careanytime, anyplace care
anytime, anyplace care
 
Mobile Internet Satisfaction 2011
Mobile Internet Satisfaction 2011Mobile Internet Satisfaction 2011
Mobile Internet Satisfaction 2011
 
Restaurantes burguecrepes
Restaurantes burguecrepesRestaurantes burguecrepes
Restaurantes burguecrepes
 
Optimizely Developer Showcase
Optimizely Developer ShowcaseOptimizely Developer Showcase
Optimizely Developer Showcase
 
Minor ondernemerschap hva - crowdfunding
Minor ondernemerschap hva - crowdfundingMinor ondernemerschap hva - crowdfunding
Minor ondernemerschap hva - crowdfunding
 
Devops
DevopsDevops
Devops
 

Similar to Suche mit Apache Lucene & Co.

Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher lucenerevolution
 
Getting started faster with LucidWorks for Solr
Getting started faster with LucidWorks for SolrGetting started faster with LucidWorks for Solr
Getting started faster with LucidWorks for SolrLucidworks (Archived)
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Cloudsinovex GmbH
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Case Study: Plus Retail - Moving from the Old World to the New World
Case Study: Plus Retail - Moving from the Old World to the New WorldCase Study: Plus Retail - Moving from the Old World to the New World
Case Study: Plus Retail - Moving from the Old World to the New WorldForgeRock
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Lecture11_LaravelGetStarted_SPring2023.pdf
Lecture11_LaravelGetStarted_SPring2023.pdfLecture11_LaravelGetStarted_SPring2023.pdf
Lecture11_LaravelGetStarted_SPring2023.pdfShaimaaMohamedGalal
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.
 
Introduction to Apache Roller
Introduction to Apache RollerIntroduction to Apache Roller
Introduction to Apache RollerMatt Raible
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backIcinga
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
 
E commerce Search using Apache Solr
E commerce Search using Apache SolrE commerce Search using Apache Solr
E commerce Search using Apache SolrRohan Makkar
 

Similar to Suche mit Apache Lucene & Co. (20)

Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
Getting started faster with LucidWorks for Solr
Getting started faster with LucidWorks for SolrGetting started faster with LucidWorks for Solr
Getting started faster with LucidWorks for Solr
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Clouds
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Case Study: Plus Retail - Moving from the Old World to the New World
Case Study: Plus Retail - Moving from the Old World to the New WorldCase Study: Plus Retail - Moving from the Old World to the New World
Case Study: Plus Retail - Moving from the Old World to the New World
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Lecture11_LaravelGetStarted_SPring2023.pdf
Lecture11_LaravelGetStarted_SPring2023.pdfLecture11_LaravelGetStarted_SPring2023.pdf
Lecture11_LaravelGetStarted_SPring2023.pdf
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Introduction to Apache Roller
Introduction to Apache RollerIntroduction to Apache Roller
Introduction to Apache Roller
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to back
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
E commerce Search using Apache Solr
E commerce Search using Apache SolrE commerce Search using Apache Solr
E commerce Search using Apache Solr
 

More from inovex GmbH

lldb – Debugger auf Abwegen
lldb – Debugger auf Abwegenlldb – Debugger auf Abwegen
lldb – Debugger auf Abwegeninovex GmbH
 
Are you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AIAre you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AIinovex GmbH
 
Why natural language is next step in the AI evolution
Why natural language is next step in the AI evolutionWhy natural language is next step in the AI evolution
Why natural language is next step in the AI evolutioninovex GmbH
 
Network Policies
Network PoliciesNetwork Policies
Network Policiesinovex GmbH
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learninginovex GmbH
 
Jenkins X – CI/CD in wolkigen Umgebungen
Jenkins X – CI/CD in wolkigen UmgebungenJenkins X – CI/CD in wolkigen Umgebungen
Jenkins X – CI/CD in wolkigen Umgebungeninovex GmbH
 
AI auf Edge-Geraeten
AI auf Edge-GeraetenAI auf Edge-Geraeten
AI auf Edge-Geraeteninovex GmbH
 
Prometheus on Kubernetes
Prometheus on KubernetesPrometheus on Kubernetes
Prometheus on Kubernetesinovex GmbH
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systemsinovex GmbH
 
Representation Learning von Zeitreihen
Representation Learning von ZeitreihenRepresentation Learning von Zeitreihen
Representation Learning von Zeitreiheninovex GmbH
 
Talk to me – Chatbots und digitale Assistenten
Talk to me – Chatbots und digitale AssistentenTalk to me – Chatbots und digitale Assistenten
Talk to me – Chatbots und digitale Assistenteninovex GmbH
 
Künstlich intelligent?
Künstlich intelligent?Künstlich intelligent?
Künstlich intelligent?inovex GmbH
 
Das Android Open Source Project
Das Android Open Source ProjectDas Android Open Source Project
Das Android Open Source Projectinovex GmbH
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretabilityinovex GmbH
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseinovex GmbH
 
People & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madnessPeople & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madnessinovex GmbH
 
Infrastructure as (real) Code – Manage your K8s resources with Pulumi
Infrastructure as (real) Code – Manage your K8s resources with PulumiInfrastructure as (real) Code – Manage your K8s resources with Pulumi
Infrastructure as (real) Code – Manage your K8s resources with Pulumiinovex GmbH
 

More from inovex GmbH (20)

lldb – Debugger auf Abwegen
lldb – Debugger auf Abwegenlldb – Debugger auf Abwegen
lldb – Debugger auf Abwegen
 
Are you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AIAre you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AI
 
Why natural language is next step in the AI evolution
Why natural language is next step in the AI evolutionWhy natural language is next step in the AI evolution
Why natural language is next step in the AI evolution
 
WWDC 2019 Recap
WWDC 2019 RecapWWDC 2019 Recap
WWDC 2019 Recap
 
Network Policies
Network PoliciesNetwork Policies
Network Policies
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
 
Jenkins X – CI/CD in wolkigen Umgebungen
Jenkins X – CI/CD in wolkigen UmgebungenJenkins X – CI/CD in wolkigen Umgebungen
Jenkins X – CI/CD in wolkigen Umgebungen
 
AI auf Edge-Geraeten
AI auf Edge-GeraetenAI auf Edge-Geraeten
AI auf Edge-Geraeten
 
Prometheus on Kubernetes
Prometheus on KubernetesPrometheus on Kubernetes
Prometheus on Kubernetes
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Azure IoT Edge
Azure IoT EdgeAzure IoT Edge
Azure IoT Edge
 
Representation Learning von Zeitreihen
Representation Learning von ZeitreihenRepresentation Learning von Zeitreihen
Representation Learning von Zeitreihen
 
Talk to me – Chatbots und digitale Assistenten
Talk to me – Chatbots und digitale AssistentenTalk to me – Chatbots und digitale Assistenten
Talk to me – Chatbots und digitale Assistenten
 
Künstlich intelligent?
Künstlich intelligent?Künstlich intelligent?
Künstlich intelligent?
 
Dev + Ops = Go
Dev + Ops = GoDev + Ops = Go
Dev + Ops = Go
 
Das Android Open Source Project
Das Android Open Source ProjectDas Android Open Source Project
Das Android Open Source Project
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretability
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
People & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madnessPeople & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madness
 
Infrastructure as (real) Code – Manage your K8s resources with Pulumi
Infrastructure as (real) Code – Manage your K8s resources with PulumiInfrastructure as (real) Code – Manage your K8s resources with Pulumi
Infrastructure as (real) Code – Manage your K8s resources with Pulumi
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Suche mit Apache Lucene & Co.

  • 1. Suche mit Apache Lucene & Co Christian Meder Bernhard Pflugfelder inovex Gmbh
  • 2. Background ‣  open source (free software) ‣  Linux ‣  Web ‣  Java ‣  Android ‣  CTO@inovex ‣  Christian Meder Christian MederSpeaker 2
  • 3. Background ‣  Lucene ‣  Solr ‣  Text Mining Technologies, Information Retrieval ‣  Hadoop ‣  Java ‣  Big Data Engineer@inovex ‣  bpflugfelder@inovex.de Bernhard PflugfelderSpeaker 3
  • 4. ‣  09:00 - 09:30 Introduction, Search in a nutshell ‣  09:30 - 10:00 Solr Exercise 1: Installation, Web Admin Interface ‣  10:00 - 10:30 Solr Exercise 2: Indexing, Queries I ‣  10:30 - 11:00 Coffee Break ‣  11:30 - 12:00 Solr Exercise 3: Data ingestion XML / SQL, Queries II Session IAgenda 4
  • 5. ‣  12:00 - 12:30 Solr Exercise 4: Schema, Data types, Analyzers, Stemming ‣  12:30 – 13:30 Lunch ‣  13:30 - 14:00 Solr Exercise 5: Facet search, Filter search, Interval search ‣  14:00 - 14:30 Solr Exercise 6: Dismax, Autosuggestion, MoreLikeThis Session IIAgenda 5
  • 6. ‣  14:30 - 15:00 ES Exercise 1: Installation, Indexing, Queries I ‣  15:00 - 15:30 Coffee Break ‣  15:30 - 16:00 ES Exercise 2: Schema, Data types, Analyzers, Queries II ‣  16:00 - 16:30 ES Exercise 3: Data ingestion SQL / XML ‣  16:30 - 17:00 ES Exercise 4: Facet search, Filter search, Interval search Session IIIAgenda 6
  • 8. ‣  Classical search applications are applications focusing on information or document retrieval ‣  Requirement: find information the user asks for! ‣  Some examples: ‣  Web search ‣  Enterprise search ‣  Document search (within DMS or CMS) ‣  Search on portals and archives ‣  Product search ‣  Specialized searches for people, companies, etc. Classical search applications Introduction 8
  • 9. Where search is in Enterprise SearchIntroduction 9
  • 10. Where search is in Online shopsIntroduction 10
  • 11. Where search is in Semantic search @ Google Introduction 11
  • 12. Where search is inIntroduction 12 Navigation & Information access
  • 14. ‣  Can you think of other scenarios where search applications will also do a good job? ‣  Remind the key capabilities of search technologies: ‣  Persistency ‣  Flexible data model ‣  Unstructured data, but not only ‣  Extremely quick access to data ‣  Horizontal scalability There are plenty of applications scenarios out there where search technologies shall be considered! NoSQL DatabaseIntroduction 14 Document store
  • 15. Hot open source search technologies Projects 15 http://lucene.apache.org http://lucene.apache.org/solr/ http://www.elasticsearch.org
  • 16. Lucene is an open source, pure Java API for enabling information retrieval ‣  Originally developed by Doug Cutting 1999 and became Apache TLP in 2001 ‣  Licensed by Apache License 2.0 ‣  Pure Java Library with implementations for : ‣  Lucene.NET (http://lucenenet.apache.org) ‣  PyLucene (http://lucene.apache.org/pylucene/) ‣  and more: http://wiki.apache.org/lucene-java/LuceneImplementations ‣  Large and very active developer community, well documented and supported (38 active committer!) ‣  Current stable release: 4.2.1 ‣  Widely used and adopted for commercial / non-commercial projects: http://wiki.apache.org/lucene-java/PoweredBy Projects 16 Overview http://lucene.apache.org/
  • 17. ‣  Scalable, High-Performance Indexing ‣  over 95GB/hour on modern hardware ‣  small RAM requirements ‣  incremental indexing as fast as batch indexing ‣  index size roughly 20-30% the size of text indexed ‣  Powerful, Accurate and Efficient Search Algorithms ‣  ranked searching -- best results returned first ‣  many powerful query types ‣  fielded searching (e.g., title, author, contents) ‣  date-range searching ‣  sorting by any field ‣  multiple-index searching with merged results ‣  allows simultaneous update and searching [From http://lucene.apache.org/core/features.html] Projects 17 Highlights http://lucene.apache.org/
  • 18. Solr is a standalone enterprise search server & document store with based on Lucene ‣  Created by Yonik Seeley at CNET Networks in 2004 ‣  Introduced as Apache Incubator in 2006, became TLP in 2007 ‣  Licensed by Apache License 2.0 ‣  Seeley and others founded Lucid Imagination -> LucidWorks ‣  Large and very active developer community, well documented and supported (strong relationship to Lucene community also) ‣  Current stable release: 4.2.1 ‣  Widely used and adopted for commercial / non-commercial projects: http://wiki.apache.org/solr/PublicServers OverviewProjects 18 http://lucene.apache.org/solr/
  • 19. ‣  Architectural highlights ‣  Extensible Plugin Architecture ‣  SolrCloud – distributed indexing and search architecture ‣  Efficient Replication to other Solr Search Servers ‣  Configurable Query Result, Filter, and Document cache instances ‣  Access & Monitoring ‣  Standards Based Open Interfaces ‣  XML,JSON and HTTP ‣  REST-like API ‣  Comprehensive HTML Administration Interfaces ‣  Server statistics exposed over JMX for monitoring HighlightsProjects 19 http://lucene.apache.org/solr/
  • 20. ‣  Data model ‣  Lucene’s document oriented index data structure ‣  Schema for field types and fields of documents ‣  Analysis & Indexing highlights ‣  Out-of-box support for JSON, XML, CSV/delimited-text, DBMS ‣  Support of PDF, DOC, XLS, PPT, HTML ‣  Declarative Lucene Analyzer specification ‣  Many additional text analysis components including word splitting, regex and sounds-like filters ‣  External file-based configuration of stopword lists, synonym lists, and protected word lists HighlightsProjects 20 http://lucene.apache.org/solr/
  • 21. Open source search technologies ‣  Search highlights ‣  Facet search and filtering (values, queries, date/time ranges) ‣  Geospatial search (e.g. local search) ‣  Configurable caching ‣  Sorting (number of fields, complex functions of numeric fields) ‣  Autocomplete ‣  Highlighted context snippets ‣  Spelling suggestions for user queries ‣  More Like This suggestions for given document ‣  Function Query ‣  Advanced query parser for high relevancy results from user-entered queries HighlightsProjects 21 http://lucene.apache.org/solr/
  • 22. ‣  Solr clients in various languages are freely available: ‣  Java, Scala, Ruby, Python, .NET, Javascript (AJAX), … ‣  http://wiki.apache.org/solr/IntegratingSolr ‣  Very helpful tools: ‣  Grep (log file analysis) ‣  Luke (index analysis) ‣  Solrmeter (performance analysis) ‣  Scalable Performance Monitoring for Solr (Monitoring) Clients & ToolsProjects 22 http://lucene.apache.org/solr/
  • 23. Documentation URL Getting started http://lucene.apache.org/solr/4_0_0/ tutorial.html Release documentation: http://lucene.apache.org/solr/4_0_0/ Javadocs http://lucene.apache.org/solr/4_0_0/solr- core/index.html Solr Wiki http://wiki.apache.org/solr/ Mailing lists http://lucene.apache.org/solr/ discussion.html Apache Solr 3 Enterprise Search Server http://link.packtpub.com/2LjDxE Apache Solr 3.1 Cookbook http://www.packtpub.com/solr-3-1- enterprise-search-server-cookbook/book LucidWorks Technical Support http://support.lucidworks.com/home DocumentationProjects 23 http://lucene.apache.org/solr/
  • 24. +  Solr is a mature technology widely used in commercial applications ‣  Easy integration in third-party application ‣  Big community, good documentation, good support ‣  You have a Solr problem - most likely someone else had it already ‣  Very helpful tools for analysis and monitoring +  Solr provides a large bundle of features: ‣  Lots of analyzers and specific query types ‣  Individual relevance boosting ‣  Admin interface -  Because Solr can so much, it’s a heavy weight technology: ‣  much to configure ‣  most part of the configuration is static / no api access ‣  includes redundant functionality (e.g. similar requesthandlers) Pros & ConsProjects 24 http://lucene.apache.org/solr/
  • 26. ‣  Installation ‣  Administration ‣  Solr Web Admin Interface Solr Exercise I 26
  • 27. ‣  Solr is a pure Java application ‣  Solr is built upon: ‣  Lucene ‣  Zookeeper ‣  Guava-libraries ‣  HttpComponents, SLF4J, Various Commons libraries ‣  Solr source code available at: ‣  http://svn.apache.org/viewcvs.cgi/lucene/dev/ (Web access) ‣  http://svn.apache.org/repos/asf/lucene/dev/ (anonymous access) ‣  Solr needs a servlet container to run such as Jetty, Tomcat, Glassfish to run ‣  Embedded Jetty for easily playing and testing Solr Solr Exercise I 27 Overview http://lucene.apache.org/solr/
  • 28. Run Solr on embedded Jetty: 1.  Unpack the Solr distribution to your desired location (= SOLR_MAIN) 2.  Change to directory SOLR_MAIN/example 3.  Start the example Solr instance: java -jar start.jar To verify the installation open your browser and go to the Solr Admin page http://localhost:8983/solr Solr Exercise I 28 Installation http://lucene.apache.org/solr/
  • 29. ‣  Solr Core (aka Core) ‣  basically an isolated running instance of a Solr index ‣  each Core has its own solrconfig.xml, schema.xml and index data ‣  search results can not be computed over Cores ‣  Solr Collection (aka Collection) ‣  Logical index distributed over multiple machines ‣  Physical partitioning using sharding ‣  Part of SolrCloud (Scalability, High Availability) Solr Exercise I 29 Core vs. Collection http://lucene.apache.org/solr/
  • 30. Solr Home Directory as recommended: ‣  solr.xml ‣  primary configuration file Solr looks for when starting ‣  this file specifies the list of SolrCores it should load ‣  Solr Core Instance Directories ‣  contains configuration and data of a SolrCore ‣  lib/ ‣  shared lib directory for solr instance ‣  zoo.cfg ‣  Zookeeper configuration when using SolrCloud ‣  How to tell Solr where SOLR_HOME is located? ‣  Use the Java system property: solr.solr.home ‣  e.g. java -Dsolr.solr.home=/some/dir -jar start.jar Solr Exercise I 30 Solr Home http://lucene.apache.org/solr/
  • 31. Solr Core Instance Directory as recommended: ‣  conf/ ‣  This directory is mandatory and must contain your solrconfig.xml and schema.xml. ‣  Any other optional configuration files would also be kept here. ‣  data/ ‣  This directory is the default location where Solr will keep your index, and is used by the replication scripts for dealing with snapshots. ‣  You can override this location in the conf/solrconfig.xml. ‣  lib/ ‣  This directory is optional. If it exists, Solr will load any Jars found in this directory and use them to resolve any "plugins” specified in your solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...). Solr Exercise I 31 Instance Directory http://lucene.apache.org/solr/
  • 32. Solr includes an Admin Web interface providing your with ‣  General configuration details ‣  Core-specific configuration details ‣  Log information ‣  Run queries ‣  Document field / Term statistics ‣  Document fields ‣  Cache statistics ‣  Server cluster information Access it via http://localhost:8983/solr Solr Exercise I 32 Admin Web interfacehttp://lucene.apache.org/solr/
  • 33. ‣  Indexing the first XML data ‣  Try first simple queries ‣  Different query types ‣  Get result score ‣  Highlighting Solr Exercise II 33
  • 34. Search BasicsSolr Exercise II 34 Document Query indexing indexing (Query analysis) Representation Representation (tokens) Query (tokens) evaluation Index-based search
  • 35. ‣  An inverted index is an index data structure that ‣  stores mappings from tokens to their locations (e.g. documents) ‣  allows fast access of those documents that contains specific tokens ‣  The purpose of an inverted index is to allow fast full text searches Search BasicsSolr Exercise II 35 Inverted index
  • 36. Solr Exercise II 36   Index Document Document Document Document Field Field Field Field Field Name Value Search Basics Data model
  • 37. Solr Exercise II 37   Doc 1: Penn State Football … football Doc 2: Football players … State Posting id word doc offset 1 football Doc 1 3 Doc 1 67 Doc 2 1 2 penn Doc 1 1 3 players Doc 2 2 4 state Doc 1 2 Doc 2 13 Posting Table Search Basics Data model
  • 38. ‣  How to select important terms? ‣  Simple method: using middle-frequency words Solr Exercise II 38 Frequency/Informativity frequency informativity Max. Min. 1 2 3 … Rank Search Basics Term selection
  • 39. ‣  tf = term frequency ‣  frequency of a term/keyword in a document ‣  The higher the tf, the higher the importance (weight) for the doc. ‣  df = document frequency ‣  no. of documents containing the term ‣  distribution of the term ‣  idf = inverse document frequency ‣  the unevenness of term distribution in the corpus ‣  the specificity of term to a document ‣  The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t) Solr Exercise II 39 Search Basics Term selection
  • 40. ‣  1-word query: The documents to be retrieved are those that include the word ‣  Retrieve the inverted list for the word ‣  Sort in decreasing order of the weight of the word ‣  Multi-word query? -  Combining several lists -  How to combine matches of these different lists? -  How to interpret the weight? (IR model) Solr Exercise II 40 Search Basics Querying
  • 41. ‣  Vector space = all the terms encountered <t1, t2, t3, …, tn> ‣  Document D = < a1, a2, a3, …, an> ai = weight of ti in D ‣  Query Q = < b1, b2, b3, …, bn> bi = weight of ti in Q ‣  R(D,Q) = Sim(D,Q) ‣  Cosine Similarity (TF*IDF) ‣  Okapi BM25 Vector-space modelSearch Basics 41 t1 t2 D Q
  • 42. ‣  The Solr UpdateRequestHandler defines the logic to deal with index update actions based on a specific data source or data format ‣  UpdateRequestHandlers must be defined in the solrconfig.xml and are matched to specific url path in oder to access it via HTTP ‣  Solr supports serveral file types out-of-the-box by using the specific update handler: ‣  Standard UpdateRequestHandler ‣  supporting XML, XSLT, JSON, CSV and javabin ‣  DataImportHandler ‣  Indexing events: Add/Replace, Commit, Soft Commit, Delete Solr Indexing Update Request handlers Solr Exercise II 42 <requestHandler name=“update” class="solr.UpdateRequestHandler"/>
  • 43. Solr Indexing XML AddSolr Exercise II 43 curl http://localhost:8983/solr/jax2013/update -H 'Content-Type:text/xml' --data-binary '<add> <doc> <field name=”id”>etext78942</field> <field name=”title”>Solr textbook</field> <field name=”subject">search technology</field> <field name=”author">Bernhard Pflugfelder</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add>'
  • 44. Solr Indexing XML UpdateSolr Exercise II 44 curl http://localhost:8983/solr/jax2013/update -H 'Content-Type:text/xml’ --data-binary '<add> <doc> <field name=”id">etext78942</field> <field name=”author" update="set">Christian Meder</field> <field name=”subject" update="add">open source</field> </doc> </add>'
  • 45. Solr Indexing XML DeleteSolr Exercise II 45 curl http://localhost:8983/solr/jax2013/update -H 'Content-Type:text/xml’ --data-binary '<delete> <id>etext78942</id> <query>author:meder</query> </delete>'
  • 46. Solr Indexing XML CommitSolr Exercise II 46 curl http://localhost:8983/solr/jax2013/update -H 'Content-Type:text/xml’ --data-binary '<commit waitSearcher="false"/>' curl 'http://localhost:8983/solr/jax2013/update? optimize=true&waitFlush=false'
  • 47. ‣  Multiple index actions in one JSON Solr Indexing JSON Add / Delete / Commit Solr Exercise II 47 curl http://localhost:8983/solr/jax2013/update/json -H 'Content-type:application/json' -d ’ { "add": { "commitWithin": 5000, "doc": { "f1": "v1", "f1": "v2" } }, "commit": {}, "delete": { "id":"ID" }, "delete": { "query":"QUERY" } "delete": { "query":"QUERY", 'commitWithin':'500' } }'
  • 48. ‣  Commands add, set and inc Solr Indexing JSON Atomic updatesSolr Exercise II 48 curl http://localhost:8983/solr/jax2013/update/json -H 'Content-type:application/json' -d ’ [ { "id" : "etext78942", "title" : {"set":”solr 4.2.1 textbook"}, ”viewcount” : {"inc":3}, "author" : {"add":”Bernhard Pflugfelder"} } ]’
  • 49. Solr Indexing Try outSolr Exercise II 49 cd SOLR_MAIN/example/exampledocs curl 'http://localhost:8983/solr/collection1/update/json? commit=true’ --data-binary @books.json -H 'Content-type:application/json' cd SOLR_MAIN/example/exampledocs java -jar post.jar -h java -jar post.jar *.xml
  • 50. ‣  q=+content:goethe +content:schiller ‣  q=+content:goethe -content:schiller ‣  q=title:faust ‣  q=title:faust AND -content:goethe ‣  q=content:“romeo and juliet” ‣  q=title:water* ‣  q=title:water~0.5 ‣  q=created:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z] ‣  q=viewcount:[20 TO 50] ‣  q=viewcount:[100 TO *] Solr QueriesSolr Exercise II 50 curl –XPOST ‘http://localhost:8983/solr/jax2013/select’ –d Query Syntax
  • 51. Solr Queries Common parameters Solr Exercise II 51 Param name Param value Description q string The user query string start number Offset in the list of returned documents rows number Number of documents returned fq string A filter query fl string,string,… Fields returned for each document debugQuery true / false Include debug info in the response curl –XPOST ‘http://localhost:8983/solr/collection1/select’ –d ‘q=+solr –elasticsearch&start=20&row=40&fl=* score’
  • 52. Highlighting OverviewSolr Exercise II 52 Param name Param value Description hl true / false Switch on / off highlighting hl.q string Alternative highlighting query hl.fl string, string,… Fields used for highlighting hl.snippets number Number of maximum snippets hl.fragsize number Number of characters per snippet hl.simple.pre[post] string Text appears before / after match curl –XPOST ‘http://localhost:8983/solr/collection1/select’ –d ‘q=+solr –elasticsearch&start=20&row=40&fl=* score &hl=true&hl.fl=title,abstract’
  • 53. ‣  Datainputhandler SQL ‣  Datainputhandler XML Solr Exercise III 53
  • 54. ‣  DataInputhandler makes possible to: ‣  index data in relational databases ‣  compose documents from multiple columns and tables ‣  bulk import or incremental update using Delta Query mechanism ‣  schedule full imports and delta imports ‣  Index data from XML/HTML using XPATH expressions ‣  DataInputhandler is part of Solr Contrib ‣  Define in solrconfig.xml DataInputhandler OverviewSolr Exercise III 54 <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/home/username/data-config.xml</str> </lst> </requestHandler>
  • 55. ‣  http://localhost:8983/solr/dataimport?command=full-import ‣  http://localhost:8983/solr/dataimport?command=delta-import ‣  http://localhost:8983/solr/dataimport?command=status ‣  http://localhost:8983/solr/dataimport?command=reload-config ‣  http://localhost:8983/solr/dataimport?command=abort DataInputhandler CommandsSolr Exercise III 55
  • 56. ‣  The dataconfig.xml defines the data source and which data shall be used to populate Solr documents during import ‣  Defines tags: ‣  dataSource ‣  document ‣  entity ‣  The entity defines a specific data selection resulting in a Solr document ‣  The query gives the data needed to populate fields of the Solr document DataInputhandler ConfigurationSolr Exercise III 56 <dataConfig> <dataSource … /> <document name="products"> <entity name="item" query="select * from item” /> </document> </dataConfig>
  • 57. ‣  MySQL ‣  Oracle ‣  Use multiple data source within on DIH config by property name ‣  Each entity definition must then define a parameter name as well DataInputhandler DataSourceSolr Exercise III 57 <dataSource name="jdbc" driver=”com.mysql.jdbc.Driver” url="jdbc:mysql://localhost/dbname" user="db_username" password="db_password"/>/> <dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@//hostname:port/SID" user="db_username" password="db_password"/>
  • 58. DataInputhandler SQL full-importSolr Exercise III 58 <dataConfig> <dataSource … /> <document name="products"> <entity name="item" query="select * from item"> <field column="ID" name="id" /> <field column="NAME" name="name" /> <field column="MANU" name="manu" /> <field column="WEIGHT" name="weight" /> <field column="PRICE" name="price" /> <field column="POPULARITY" name="popularity" /> <field column="INSTOCK" name="inStock" /> <field column="INCLUDES" name="includes" /> </entity> </document> </dataConfig>
  • 59. DataInputhandler SQL full-importSolr Exercise III 59 <dataConfig> <dataSource … /> <document> <entity name="item" query="select * from item"> <entity name="feature" query="select description as features from feature where item_id='${item.ID}'"/> <entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'"> <entity name="category" query="select description as cat from category where id = '${item_category.CATEGORY_ID}'"/> </entity> </entity> </document> </dataConfig>
  • 60. ‣  Increment update of the specific content of a relational database ‣  Avoid indexing already indexed data again ‣  http://localhost:8983/solr/dataimport?command=delta-import ‣  Provide three specific queries for each entity except root: ‣  The deltaImportQuery gives the data needed to populate fields when running a delta-import ‣  The deltaQuery gives the primary keys of the current entity which have changes since the last index time ‣  The parentDeltaQuery uses the changed rows of the current table (fetched with deltaQuery) to give the changed rows in the parent table. This is necessary because whenever a row in the child table changes, we need to re-generate the document which has that field. DataInputhandler SQL Delta-ImportSolr Exercise III 60
  • 61. DataInputhandler SQL Delta-ImportSolr Exercise III 61 <entity name="item" pk="ID” query="select * from item” deltaImportQuery="select * from item where ID='${dih.delta.id}'” deltaQuery="select id from item where last_modified &gt; '${dih.last_index_time}'”> <entity name="feature" pk="ITEM_ID” query="select description as features from feature where item_id='${item.ID}'” /> <entity name="item_category" pk="ITEM_ID, CATEGORY_ID” query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"> <entity name="category" pk="ID” query="select description as cat from category where id = '${item_category.CATEGORY_ID}'” /> </entity> </entity>
  • 62. ‣  HTTP source ‣  XML File source DataInputhandler Other DataSourcesSolr Exercise III 62 <dataConfig> <dataSource type="HttpDataSource" /> … </dataConfig> <dataConfig> <dataSource type=”FileDataSource" encoding=“UTF-8”/> … </dataConfig>
  • 63. ‣  The entity defines location of the XML file ‣  Solr document field population is done by evaluating XPATH expressions DataInputhandler XML full-importSolr Exercise III 63 <entity name="page” processor="XPathEntityProcessor" stream="true" forEach="/RDF/etext/" url="../../catalog.rdf.xml" transformer="RegexTransformer,DateFormatTransformer”> <field column="id" xpath="/RDF/etext/@id" /> <field column="title" xpath="/RDF/etext/title" /> <field column="alternative" xpath="/RDF/etext/alternative" /> <field column="author" xpath="/RDF/etext/creator" /> <field column="multi_author” xpath="/RDF/etext/creator/Bag/li" /> <field column="subject" xpath="//LCSH/value" /> <field column="viewcount" xpath="/RDF/etext/downloads/nonNegativeInteger/value" /> <field column="created" xpath="/RDF/etext/created/W3CDTF/value" dateTimeFormat="yyyy-MM-dd" /> </entity>
  • 64. ‣  Schema, ‣  Data types ‣  Analyzers, Tokenizers Solr Exercise IV 64
  • 65. ‣  Defines document representation by specifying fields ‣  with a specific field type ‣  with specific field type properties ‣  Dynamic fields ‣  CopyField ‣  Define analyzers: ‣  Tokenizers ‣  Filters ‣  Synonym lists, stop word lists ‣  additional text analysis ‣  Assign analyzers to the Text-based data types (solr.TextField) ‣  Example schema.xml SchemaSolr Exercise IV 65 Overview
  • 66. ‣  Field types ‣  int, long, float, double, boolean ‣  string, date, binary ‣  derived from solr.TextField ‣  text_general, text_de, text_en, … ‣  Field type properties ‣  indexed (true / false) ‣  stored (true / false) ‣  multiValued (true / false) ‣  termVectors (true / false) Schema FieldsSolr Exercise IV 66
  • 67. Break stream of characters into tokens / terms ‣  Normalization (e.g. case) ‣  Stopwords ‣  Stemming ‣  Lemmatizer / Decomposer ‣  Part of Speech Tagger ‣  Information Extraction Analyzing / Tokenization OverviewSolr Exercise IV 67
  • 68. ‣  function words do not bear useful information for searching of, in, about, with, I, although, … ‣  Stopword list: contain stopwords, not to be used as index ‣  Prepositions ‣  Articles ‣  Pronouns ‣  Some adverbs and adjectives ‣  Some frequent words (e.g. document) ‣  The removal of stopwords usually improves search quality ‣  Solr provides default stopword lists for various languages Analyzing / Tokenization StopwordsSolr Exercise IV 68
  • 69. ‣  Apply strict algorithmic normalization of inflection forms (e.g. Porter) ‣  Strategy: removing some endings of words. Example: computer, compute, computes, computing, computed, computation are all normalized to comput ‣  But: going -> go, king -> k ??????????? ‣  Stemming might work well for English ‣  However, be careful using stemming, especially for German Analyzing / Tokenization StemmingSolr Exercise IV 69
  • 70. Analyzing / Tokenization Define an analyzerSolr Exercise IV 70 <fieldType name=”<name>" class="solr.TextField” positionIncrementGap="100"> <analyzer type="index”> <!– tokenizer and filters for indexing --> <tokenizer class=“CLASS” PARAMS /> <filter class=“CLASS” PARAMS /> </analyzer> <analyzer type="query"> <!– tokenizer and filters for search --> <tokenizer class=“CLASS” PARAMS /> <filter class=“CLASS” PARAMS /> </analyzer> </fieldType>
  • 71. ‣  TokenizerFactories ‣  solr.StandardTokenizerFactory ‣  solr.WhitespaceTokenizerFactory ‣  solr.KeywordTokenizerFactory ‣  TokenFilterFactories ‣  solr.LowerCaseFilterFactory ‣  solr.TrimFilterFactory ‣  solr.StopFilterFactory ‣  solr.WordDelimiterFilterFactory ‣  solr.SynonymFilterFactory ‣  solr.EdgeNGramFilterFactory Analyzing / Tokenization Tokenizers & FiltersSolr Exercise IV 71
  • 72. ‣  English ‣  solr.PorterStemFilterFactory ‣  solr.SnowballPorterFilterFactory ‣  solr.EnglishMinimalStemFilterFactory ‣  German ‣  solr.SnowballPorterFilterFactory ‣  solr.GermanLightStemFilterFactory ‣  solr.GermanMinimalStemFilterFactory ‣  More information at http://wiki.apache.org/solr/LanguageAnalysis Analyzing / Tokenization Language analysisSolr Exercise IV 72
  • 73. ‣  Faceted search ‣  Filter query ‣  MoreLikeThis query Solr Exercise V 73
  • 75. ‣  „Die Aussage eines Probanden bei einem Usability-Test einer Faceted Search Lösung im Rahmen dieser Studie ist damit richtungsweisend: ‣  „Mit dem Filter hier habe ich das Gefühl, dass selbst eine schnöde Suche richtig Spaß machen kann.”” ‣  Quelle: Faceted Search: Die neue Suche im Usability-Test (zum kostenlosen Download unter http://usability.de) Faceted search MotivationSolr Exercise V 75
  • 76. ‣  Faceted search (aka faceted navigation) organizes search results based on different categories or dimensions giving the user the possibility to drill down the search results ‣  Facets can be authors, titles, tags, dates, languages, file types … ‣  Typically, meta data describing concepts and meaning of documents are useful as facets ‣  Facets can be shown with counts Faceted search OverviewSolr Exercise V 76
  • 77. ‣  Solr provides faceting mechanism out-of-the-box including the returning of counts ‣  Important: facet fields must be defined with indexed=true ‣  Often facet fields are analyzed differently as search fields. Therefore it is common to define separate document fields for faceting in schema.xml ‣  Facet fields shall not be tokenized, lower-cased, stemmed ‣  Facet fields can be of type ‣  int, long, float, double, boolean ‣  solr.TextField ‣  date ‣  From the view point of performance also define ‣  stored=false ‣  omitNorms=false Faceted search Solr FacetingSolr Exercise V 77 <field name=”facet_author” indexed=“true” stored=“false” omitNorms=“false” />
  • 78. ‣  Solr provides two basic mechanism to build facets ‣  Arbitrary faceting (facet.query=query) ‣  Field value faceting (facet.field=fieldname) ‣  In case of Field value faceting two faceting methods can be chosen ‣  Enum Based Field Queries (facet.method=enum) ‣  Field Cache (facet.method=fc) ‣  Other common parameters Faceted search Solr FacetingSolr Exercise V 78 Param name Param value Description facet true / false Switch on / off faceting facet.prefix String Facet results must start with prefix facet.sort sort / index Sort facet results facet.limit number Limit number of facet results facet.mincount number Minimal count to be considered
  • 79. Faceted search Date facetingSolr Exercise V 79 Param name Param value Description facet.date fieldname The fieldname of type date used for date faceting facet.date.start date expression The start date of the first date facet interval facet.date.end date expression The upper bound for the last date facet interval facet.date.gap date expression The size of each date range interval q=*:*&rows=0&wt=xml&indent=true&facet=true&facet.date=created& facet.date.start=1996-01-31T23:00:00Z& facet.date.end=2013-04-021T00:00:00Z&facet.date.gap=%2B1YEAR
  • 80. Faceted search Range facetingSolr Exercise V 80 Param name Param value Description facet.range fieldname The fieldname of a numeric field type facet.range.start number The start date of the first range interval facet.range.end number The upper bound for the last range interval facet.range.gap number The size of each range interval q=*:*&rows=0&wt=xml&indent=true&facet=true&facet.range=viewcount& facet.range.start=0&facet.range.end=150&facet.range.gap=20
  • 81. ‣  Filter queries restrict the document result set to a specific subset of the returned set based on the original query ‣  The scores of the documents are not influenced by filter queries ‣  Examples ‣  access permissions (ACLs) ‣  categories or tags ‣  Importantly, the results of filter queries are automatically cached per default ‣  Solr uses a separate in-memory filter cache ‣  Thus, filter queries will be evaluated very fast if they are cached ‣  Complex, often used queries are good candidates for filter queries ‣  Keep in mind that the size of filter cache depends on the search scenario must therefore be tuned explicitly Filter query OverviewSolr Exercise V 81
  • 82. ‣  Filter queries are defined by query parameter fq ‣  Avoid caching filter queries Filter query ExamplesSolr Exercise V 82 q=content:arthur&fq=subject:fantasy&fl=title,author&rows=5 content:arthur&fq=subject:fantasy&fq=viewcount:[* TO 100]& fl=title,author&rows=5 content:arthur&fq=subject:fantasy &fq={!cache=false}viewcount:[* TO 100]&fl=title,author&rows=5
  • 83. ‣  Idea of MoreLikeThis ‣  MoreLikeThis constructs a query based on the terms of given set of fields ‣  Matching documents are “similar” based on the chosen set of fields ‣  Fields used by MoreLikeThis should define termVerctors=“true” MoreLikeThis OverviewSolr Exercise V 83 Param name Param value Description mlt.fl fieldnames Fields to be used by MLT mlt.mintf number Minimum term ferquency mlt.mindf number Minimum document frequency mlt.minwl number Minimum word length mlt.maxwl number Maximum word length mlt.maxqt number Maximum number of query terms q=content:schiller&mlt=true&mlt.fl=subject&mlt.mindf=50 &mlt.mintf=1
  • 84. ‣  Advanced queries: ‣  Dismax query parser ‣  Sorting ‣  Grouping ‣  Autosuggestion Solr Exercise VI 84
  • 85. ‣  Motivation ‣  Standard Solr parser only supports simple query control ‣  One field can be defined as default search field ‣  Supports only boolean conjunction of sub queries (AND / OR) ‣  Strict query syntax to perform e.g. phrase queries ‣  Dismax (and eDismax) query parsers are more robust query parsers offering various additional query parameters and controls to optimize queries ‣  These additional query parameters and controls are hidden from the user ‣  Dismax stands for Disjunction Max ‣  Disjunction means that multiple fields can be search simultaneously with different field weights ‣  Max means that the maximum score of the field matches is taken as the document score (instead of the sum) DisMax Parser OverviewSolr Exercise VI 85
  • 86. Param name Description q.alt Alternative query executed if the user query is not specified or blank qf The query fields to be searched for. Each field can be defined with an individual field weight. mm Minimum match of query words in order to evaluate a document match pf Defines phrase fields. Boost documents that have the search terms in close proximity within the phrase fields. ps The phrase slop effecting the boosting of phrase queries evaluated on the pf fields qs The phrase slop for user defined phrase queries qb A raw query that is added to the user query to influence scoring bf Function queries that are added to the user queries to influence scoring DisMax Parser ParametersSolr Exercise VI 86
  • 87. DisMax Parser ExamplesSolr Exercise VI 87 http://localhost:8983/solr/jax2013/select? q=schiller&defType=dismax&qf=author^20.0+content^0.3 http://localhost:8983/solr/jax2013/select? q=schiller&defType=dismax&qf=author^20.0+content^0.3 &bq=subject:drama^5.0
  • 88. ‣  Ranking (= ordering) the documents results based on criteria ‣  Default ranking is done based on the document score ‣  The sort parameter allows to rank the document results based on an arbitrary field or even function ‣  Sort fields must be defined as indexed=true and multiValued=false ‣  Syntax: …&sort=fieldname [asc/desc],fieldname [asc/desc],… Sorting OverviewSolr Exercise VI 88 http://localhost:8983/solr/jax2013/select? q=schiller&defType=dismax&qf=author^20.0+content^0.3&sort=viewcount+desc http://localhost:8983/solr/jax2013/select? q=schiller&defType=dismax&qf=author^20.0+content^0.3&sort=viewcount+desc
  • 90. ‣  Motivation ‣  Documents with a common values for some field are partitioned into groups ‣  Documents with the same field value are collapsed to a single result Grouping ParametersSolr Exercise VI 90 Query parameter Query value Description group true / false Switch on / off grouping group.field fieldname Field to group on rows number Number of groups returned start number Offset in into the list of returned groups group.limit number Number of docs returned for each group group.offset number Offset into the list of returned documents per group sort fieldname [asc/desc] Sort groups on some field group.sort fieldname [asc/desc] Sort documents of every group on some field
  • 92. ‣  Autosuggestion (aka Autocomplete) is a common search feature that supports the user by providing query suggestions during typing ‣  Autosuggestion functionality can include ‣  the search index ‣  separate word lists ‣  synonyms / black lists ‣  grouping suggestions ‣  Fuzziness ‣  Whatever mechanism is actually used to provide autosuggest, it must be evaluated suggestions very quickly. ‣  Solr provides different mechanisms to build autosuggestion functionality: ‣  using facet search ‣  using standard search (standard query parser) ‣  using spellchecker Solr plugin Autosuggestion OverviewSolr Exercise VI 92
  • 93. ‣  Define new field title_auto using for autosuggestion ‣  Define the field type text_auto providing specific analysis for autosuggestion ‣  How to get suggestions for a user query? Autosuggestion Using facetingSolr Exercise VI 93 <field name=”title" type="text_general" indexed="true” stored="true” /> <field name=”title_auto" type="text_auto" indexed="true" stored="true” /> <copyField source=”content" dest=”content_auto" /> <fieldType name="text_auto" class="solr.TextField” positionIncrementGap="100”> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> q=*:*&facet=true&facet.field=title_auto&facet.mincount=1&facet.prefix=schi
  • 94. ‣  Again, define new field title_auto as in previous slide ‣  Next, redefine the field type text_auto as follows ‣  Now, you can use the standard Solr query parser to get suggestions Autosuggestion Using standard search Solr Exercise VI 94 <fieldType name="text_auto" class="solr.TextField” positionIncrementGap=“100”> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize=“25" side="front" /> </analyzer> </fieldType> q=title_auto:query&q.op=AND&rows=5&fl=title q=title_auto:query&q.op=AND&rows=0&facet=true& facet.field=tag&facet.mincount=1&facet.limit=5
  • 95. Elasticsearch is a “distributed-from-scratch” search server based on Lucene Created by Shay Banon with a first version made public in 02/2010: ElasticSearch itself was born out of my frustration with the fact that there isn’t really a good, open source, solution for distributed search engine out there, which also combines what I expect of search engines after building Compass (and on that, I will blog later…). I have been working on this for the past several months, pouring my search and distributed knowledge into this (and portions of my heart and time ;) ) [http://www.elasticsearch.org/blog/2010/02/08/youknowforsearch.html] OverviewProjects 95 http://www.elasticsearch.org/
  • 96. ‣  Current stable version 0.20.6 ‣  Licensed by Apache License 2.0 ‣  Small group of core developer, but strong support of valuable Lucene committer ‣  Already a promising list of users (small and big companies) ‣  github, soundcloud, stackoverflow, mozilla, klout ‣  http://www.elasticsearch.org/users/ OverviewProjects 96 http://www.elasticsearch.org/
  • 97. ‣  Pure Java application ‣  Search, indexing und scoring is done by Lucene ‣  Document-oriented ‣  Schema-less ‣  Well, ElasticSearch might be schema-less, Lucene isn’t! ‣  ElasticSearch therefore automatically detect correct types ‣  However, a schema is still needed! Why? ‣  HTTP & JSON API for all interactions ‣  Indexing / Updating ‣  Searching ‣  Administration / Monitoring ‣  Distribution is fundamental feature of ElasticSearch! HighlightsProjects 97 http://www.elasticsearch.org/
  • 98. ‣  Facet search and filtering (values, queries, date/time ranges) ‣  Lots of query types ‣  Script filters ‣  Geospatial search called GeoShape Query ‣  Configurable caching for ‣  Filters ‣  Field data ‣  NRT search with separate API ‣  Sorting, Highlighting ‣  MoreLikeThis based on document or field ‣  Multi Tenancy: ‣  Define multiple indices that e.g. handles documents differently during indexing ‣  Still, you can search over them with one query HighlightsProjects 98 http://www.elasticsearch.org/
  • 99. ‣  ElasticSearch Gateway Module stores indices and metadata to: ‣  Local FS, Shared FS, Hadoop, Amazon S3 ‣  River Interface: ‣  Pluggable service to constantly pull data ‣  Manage over specific REST endpoint ‣  Implementations for CouchDB, MongoDB ‣  Lucene Analyzer specification over elasticsearch.yml or API ‣  Bulk indexing ‣  Default: single document indexing ‣  Bulk indexing over specific REST endpoints HighlightsProjects 99 http://www.elasticsearch.org/
  • 100. +  Simple but effective architecture +  Easiness of use, even when using distributed search +  High matureness, even though ES is young +  Modern technologies used +  HTTP and JSON only -  Shard splitting is not trivial -  Still small community and small group of core developer -  Compared to Solr: -  Less number of query types -  Less possibilities for boosting -  Less number of analyzers -  Missing features such as clustering, autocomplete, spell checking Pros & ConsProjects 100 http://www.elasticsearch.org/
  • 101. ‣  Installation ‣  Indexing ‣  Queries I ES Exercise I 101
  • 102. ‣  On Linux systems ‣  On Windows systems ‣  Run InstallationES Exercise I 102 unzip elasticsearch-0.20.6.zip cd elasticsearch-0.20.6 bin/elasticsearch –f [unzip elasticsearch-0.20.6.zip] dir elasticsearch-0.20.6 bin/elasticsearch.bat -f curl -X GET http://localhost:9200/ http://www.elasticsearch.org/
  • 103. ‣  On Linux systems ‣  Run ‣  Shutdown InstallationES Exercise I 103 unzip elasticsearch-0.20.6.zip cd elasticsearch-0.20.6 bin/elasticsearch –p path/to/pidfile curl -X GET http://localhost:9200/ curl -XPOST 'http://localhost:9200/_shutdown’ curl -XPOST 'http://localhost:9200/_cluster/nodes/_shutdown’ http://www.elasticsearch.org/
  • 104. ‣  bin/ ‣  eslasticsearch [elasticsearch.bat] to start elasticsearch server ‣  script plugin [plugin.bat] to install plugins ‣  config/ ‣  contains the global configuration ‣  server config file elasticsearch.yml ‣  logging config file logging.yml ‣  data/ ‣  standard directory containing index data ‣  configurable by path.data ES_HOMEES Exercise I 104 http://www.elasticsearch.org/
  • 105. ‣  lib/ ‣  shared library directory ‣  place additional libraries here ‣  logs/ ‣  log files will be placed here using default log configuration ‣  configurable by path.log in elasticsearch.yml ES_HOMEES Exercise I 105 http://www.elasticsearch.org/
  • 106. ‣  cluster ‣  one or more nodes build a cluster ‣  usually distributed over various machines ‣  one master node that is automatically chosen ‣  node ‣  running instance of elasticsearch ‣  a node automatically discovers other nodes at start up ‣  node discovery is done either using unicast or multicast messages ‣  index ‣  separate document database model with own mapping and types ‣  is partitioned in one or more primary and replica shards TerminologyES Exercise I 106 http://www.elasticsearch.org/
  • 107. ‣  mapping ‣  schema definition defining types with their associated fields ‣  field types and properties ‣  shard ‣  low level data structure of elasticsearch ‣  single Lucene index ‣  managed automatically by elasticsearch ‣  primary shard ‣  every documents is exclusively stored in a primary shard ‣  all primary shards make up the documents of the index ‣  default: 5 primary shards TerminologyES Exercise I 107 http://www.elasticsearch.org/
  • 108. ‣  replica shard ‣  each primary shard is replicated 0 or more times ‣  replica shards are distributed automatically ‣  replica shards are used for search and primary shard fail-over ‣  type ‣  within an index zero or more types can be defined ‣  a type defines a certain set of field similar to a table structure ‣  types are defined in the mapping TerminologyES Exercise I 108 http://www.elasticsearch.org/
  • 109. ‣  Index API ‣  index (PUT/POST) ‣  update (PUT/POST) ‣  delete (DELETE), ‣  delete by query (DELETE) ‣  Documents are defined as JSON objects ‣  index and type are defined in the url path ‣  automatic creation of an index and mapping ‣  action.auto_create_index ‣  index.mapper.dynamic ‣  elasticssearch automatically identifies field types based on JSON input ‣  automatic ID generation Index APIES Exercise I 109 http://www.elasticsearch.org/
  • 110. ‣  Index a book ‣  Index a book with defining a named type Index APIES Exercise I 110 $ curl -XPUT 'http://localhost:9200/books/book/1' -d '{ "author" : "bernhard pflugfelder", "post_date" : "2013-04-22T14:12:12", "title" : "my first book", "abstract" : "this book is about elasticsearch", }' $ curl -XPUT 'http://localhost:9200/books/book/1' -d '{ "book" : { "author" : "bernhard pflugfelder", "post_date" : "2013-04-22T14:12:12", "title" : "my first book", "abstract" : "this book is about elasticsearch", } }' http://www.elasticsearch.org/
  • 111. ‣  Index a book with automatic ID generation ‣  Result Index APIES Exercise I 111 $ curl -XPOST 'http://localhost:9200/books/book/' -d '{ "author" : "bernhard pflugfelder", "post_date" : "2013-04-22T14:12:12", "title" : "my first book", "abstract" : "this book is about elasticsearch", }' { "ok" : true, "_index" : "books", "_type" : "book", "_id" : "6a8ca01c-7896-48e9-81cc-9f70661fcb32", "_version" : 1 } http://www.elasticsearch.org/
  • 112. ‣  Update operations are done by providing a script manipulating the field structure ‣  Following steps composes the update process: ‣  fetch the requested document ‣  apply the script ‣  indexed as a new document ‣  Only the source field _source can be updated ‣  _source is always stored in the index ‣  stores the actual JSON used at index time ‣  can be disabled for every type separately ‣  can be compressed (from version 0.90 compression is done automatically) Index APIES Exercise I 112 { "book" : { "_source" : {"enabled" : false}} } http://www.elasticsearch.org/
  • 113. ‣  Create a new field tag ‣  Replace the value of field tag ‣  Add an additional value for the field tag Index APIES Exercise I 113 curl -XPOST 'localhost:9200/books/book/1/_update' -d '{ "script" : "ctx._source.tag = "search"" }' curl -XPOST 'localhost:9200/books/book/1/_update' -d '{ "script" : "ctx._source.tags += tag", "params" : { "tag" : "open source technologies" } curl -XPOST 'localhost:9200/books/book/1/_update' -d '{ "script" : "ctx._source.tag = "search technologies"" }' http://www.elasticsearch.org/
  • 114. ‣  Delete a document based on its unique ID ‣  Delete a document based on a search query Index APIES Exercise I 114 curl -XDELETE 'http://localhost:9200/books/book/1' $ curl -XDELETE 'http://localhost:9200/books/book/_query' -d '{ "term" : { "author" : "bernhard pflugfelder" } } ' http://www.elasticsearch.org/
  • 115. ‣  Term query ‣  Terms query Search APIES Exercise I 115 $ curl -XGET 'http://localhost:9200/books/book/_search' -d '{ "query" : { "term" : { "author" : "bernhard" } }}' $ curl -XGET 'http://localhost:9200/books/book/_search' -d '{ "query" : { "terms" : { "author" : [ "bernhard”, “pflugfelder” ], “minimum_match” : 1 }}}' http://www.elasticsearch.org/
  • 116. ‣  Match queries accepts text, numeric and date values ‣  Match queries are applied per field, automatically chosen proper analyzer ‣  Types of match queries ‣  boolean (default) ‣  phrase match ‣  phrase prefix match ‣  multi match (two or more fields are searched) Search APIES Exercise I 116 http://www.elasticsearch.org/
  • 117. ‣  Simple syntax ‣  Extended syntax Search APIES Exercise I 117 $ curl -XGET 'http://localhost:9200/books/book/_search' -d '{ "query" : { "term" : { "author" : "bernhard" } }}' {"match" : { "abstract" : { "query" : "about elasticsearch", "operator" : "and" }}} Param name Param value Description operator “and”, “or” boolean operator fuzziness 0.0 – 1.0 add fuzziness to the original terms http://www.elasticsearch.org/
  • 118. ‣  Simple syntax ‣  Extended syntax Search APIES Exercise I 118 $ curl -XGET 'http://localhost:9200/books/book/_search' -d '{ "query" : { "match_phrase" : { ” abstract" : "about elasticsearch" } }}' {"match_phrase" : { ”abstract" : { "query" : "about elasticsearch", "operator" : "and" }}} Param name Param value Description slop number phrase sloppiness analyzer 0.0 – 1.0 analyzer name to be used for query http://www.elasticsearch.org/
  • 119. ‣  Mapping (aka schema) ‣  Field types ‣  Analyzers ‣  Queries II ES Exercise II 119
  • 120. ‣  The schema mapping defines the index structure and document representation ‣  Elasticsearch works without an explicit schema (“schema-less”), ‣  Automatic inference is however dangerous in many situations ‣  This, define an explicit schema is the preferred way ‣  A mapping consists of: ‣  type name ‣  list of fields (i.e. properties) ‣  each property defines a field type and, optionally, field attributes ‣  Mappings are formatted in JSON ‣  Mappings are managed using the Mapping API (PUT / POST / GET) MappingES Exercise II 120 http://www.elasticsearch.org/
  • 121. ‣  Define a mapping for type book ‣  Retrieve the current mapping for type book MappingES Exercise II 121 # echo " { "mappings" : { "books" : { "properties" : { ”id" : { "type" : "string" }, "title" : { "type" : "string" }, "author" : { "type" : "string" }, ”subject" : { "type" : ”string" }, ”view_count" : { "type" : ”integer" }, "created" : { "type" : "date", "format" : “dateOptionalTime" } }}}} " > book.json curl –XPUT 'localhost:9200/gutenberg/books/_mapping’ –d @book.json # curl 'localhost:9200/gutenberg/books/_mapping?pretty=1 http://www.elasticsearch.org/
  • 122. ‣  Field types ‣  string, date ‣  number ‣  byte, short, integer, long, float, double ‣  boolean, binary (BASE64) ‣  Common field attributes MappingES Exercise II 122 Name Value Description index_name string field name stored within the index index yes / no Field shall be searchable store yes ( no Original values shall be stored analyzer string Analyzer used for that field null_value value Default field value if a value is not assigned to a document http://www.elasticsearch.org/
  • 123. AnalyzersES Exercise II 123 ‣  Analyzers are defined either ‣  in elasticsearch.yml or elasticsearch.json ‣  by the Index API ‣  Common analyzers ‣  standard ‣  whitespace ‣  stop ‣  keyword ‣  language ‣  snowball curl 'localhost:9200/_analyze?analyzer=standard' -d ’elasticsearch is groovy!’ curl 'localhost:9200/_analyze?analyzer=whitespace' -d ’elasticsearch is groovy!' curl 'localhost:9200/_analyze?analyzer=stop' -d ’elasticsearch is groovy!' curl 'localhost:9200/_analyze?analyzer=keyword' -d ’elasticsearch is groovy!’ http://www.elasticsearch.org/
  • 124. AnalyzersES Exercise II 124 discovery.zen.multicast.enabled: false http: max_content_length: 100000 index: number_of_shards: 1 analysis: analyzer: Default: type: standard lowercase_analyzer: type: custom tokenizer: standard filter: [standard, lowercase] http://www.elasticsearch.org/
  • 125. ‣  Elasticsearch provides two highlighting algorithms ‣  fast vector highlighter ‣  highlighter (standard implementation) ‣  Requirement to use fast vector highlighter HighlightingES Exercise II 125 {”books" : { ”title" : {"type" : "string”, "term_vector" : "with_positions_offsets”}}} { "query" : {...}, "highlight" : { "pre_tags" : ["<tag1>", "<tag2>"], "post_tags" : ["</tag1>", "</tag2>"], "fields" : { "_all" : {} } } } http://www.elasticsearch.org/
  • 126. ‣  Faceted search ‣  Filter query ‣  Sorting ‣  More Like This ES Exercise III 126
  • 127. ‣  Elasticsearch provides the following facet mechanism: ‣  Group results by a field value ‣  Group by numeric or date ranges ‣  Group numeric or date values in equally sized buckets (histogram) ‣  Group results around a coordinate based on the geo distance ‣  Basic facet definition ‣  Facet types: terms, range, histogram, date_histogram, geo_distance Faceted searchES Exercise III 127 { "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true }}} http://www.elasticsearch.org/
  • 128. Faceted searchES Exercise III 128 curl -X POST http://localhost:9200/gutenberg/books/_search?pretty=1 -d ’ { "from": 0, "size": 10, "query": { "match": { ”author": ”schiller" } }, "facets": { "tagsFacet": { "terms": { "field": ”subject", "size": 10 } } } }' http://www.elasticsearch.org/
  • 129. Faceted searchES Exercise III 129 { "query" : { "match_all" : {} }, "facets" : { "range1" : { "range" : { ”view_count" : [ { "to" : 50 }, { "from" : 20, "to" : 70 }, { "from" : 70, "to" : 120 }, { "from" : 150 } ] } } } } http://www.elasticsearch.org/
  • 130. ‣  Histogram facet works on any numeric field ‣  Field values are rounded to fit in the respective bucket ‣  The property interval defines the bucket size Faceted searchES Exercise III 130 { "query" : { "match_all" : {} }, "facets" : { "histo1" : { "histogram" : { "field" : ”view_count", "interval" : 100 } } } } http://www.elasticsearch.org/
  • 131. ‣  Elastic search also provides filter queries internally cached for optimal performance ‣  A filter query can be applied based on a returned search result like here Filter queryES Exercise III 131 curl -XPOST 'localhost:9200/gutenberg/books/_search?pretty=1' -d ' { "query" : { "term" : { ”title" : ”schiller" } }, "filter" : { "term" : { ”subject" : ”drama" } }, "facets" : { "tag" : { "terms" : { "field" : ”subject" } } } }' http://www.elasticsearch.org/
  • 132. ‣  Or the filter query is applied during the search of the user query at first place ‣  Difference to previous filter query? Filter queryES Exercise III 132 curl -XPOST 'localhost:9200/books/_search?pretty=1' -d ' { "filtered" : { "query" : { "term" : { ”author" : “schiller" } }, "filter" : { "range" : { ”view_count" : { "from" : 50, "to" : 100 } } } } }' http://www.elasticsearch.org/
  • 133. ‣  Sorting is done based on one or multiple fields ‣  In case of multiple sorting fields, sorting is done per field ‣  ascending / descending sorting ‣  _score refers to sort based on the score SortingES Exercise III 133 curl -XPOST 'localhost:9200/gutenberg/books/_search?pretty=1' -d ’ { "sort" : [ { ”view_count" : {"order" : ”desc"} }, "_score” ], "query" : { "term" : { "title" : ”schiller" } } }' http://www.elasticsearch.org/
  • 134. mlt queryES Exercise III 134 curl -XPOST 'localhost:9200/gutenberg/books/_search?pretty=1' -d ’ { "more_like_this" : { "fields" : ["title", ”subject"], "like_text" : "text like this one", "min_term_freq" : 1, "max_query_terms" : 12 } }' http://www.elasticsearch.org/ Name Value Description fields fieldname(s) List of fields used for mlt like_text string The text to find docs like min_term_freq number Minimal term freq max_query_terms number Maximal term freq min_doc_freq number Minimal document freq max_doc_freq number Maximal document freq percent_terms_to_match 0.0 – 1.0 Percentage of terms match