SlideShare una empresa de Scribd logo
1 de 206
Descargar para leer sin conexión
Elastic 101
Antonios Giannopoulos DBA @ Rackspace/ObjectRocket
Alex Cercel DBA @ Rackspace/ObjectRocket
Mihai Aldoiu CDE @ Rackspace/ObjectRocket
linkedin.com/in/antonis | linkedin.com/in/alexcercel | linkedin.com/in/aldoiu 1
Introduction
www.objectrocket.com
2
Antonios
Giannopoulos
Alex Cercel Mihai Aldoiu
Overview
• Introduction
• Working with data
• Scaling the cluster
• Operating the cluster
• Troubleshooting the cluster
• Upgrade the cluster
• Security best practices
• Working with data – Advanced
operations
• Best Practices
www.objectrocket.com
3
www.objectrocket.com
4
Labs
1. Unzip the provided .vmdk file
2. Install and or Open VirtualBox
3. Select New
4. Enter A Name
5. Select Type: Linux
6. Select Version: Red Hat (64-bit)
7. Set Memory to at least 4096 (more won’t hurt)
8. Select "Use an existing ... disk file", select the provided .vmdk file
9. Select Create
10. Select Start
11. Login with username: elasticuser , password: elasticuser
12. Navigate to /Percona2018/Lab01 for the first lab.
https://bit.ly/2D1tXL6
Introduction
● Key Terms
● Installation
● Configuration files
● JVM fundamentals
● Lucene basics
www.objectrocket.com
5
What is elasticsearch?
www.objectrocket.com
6
Lucene:
- A search engine library entirely written in Java
- Developed in 1999 by Doug Cutting
- Suitable for any application that requires full text indexing and searching capability
But:
- Challenging to use
- Not originally designed for scaling
Elasticsearch:
- Built on top of Lucene
- Provides scaling
- Language independent
What is ELK stack?
www.objectrocket.com
7
ElasticSearch:
- The main datastore
- Provides distributed search capabilities
Logstash:
- Parse & transform data for ingestion
- Ingests from multiple of sources simultaneously
Kibana:
- An analytics and visualization platform
- Search, visualize & interact with Elasticsearch data
Installing Elasticsearch
www.objectrocket.com
8
Download:
Latest Version: https://www.elastic.co/downloads/elasticsearch
Older Version: Navigate to https://www.elastic.co/downloads/past-releases
The simplest way:
1) wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.2.tar.gz
2) wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.2.tar.gz.sha512
3) shasum -a 512 -c elasticsearch-6.3.2.tar.gz.sha512 (it should return elasticsearch-6.3.2.tar.gz: OK)
4) tar -xzf elasticsearch-6.3.2.tar.gz
Installing Java
www.objectrocket.com
9
ElasticSearch requires JRE (JavaSE runtime environment) or JDK (Java
Development Kit)
- OpenJDK CentOS: yum install java-1.8.0-openjdk
- OpenJDK Ubuntu: apt-get install openjdk-8-jre
ES versions 6, requires Java8 or higher https://www.elastic.co/support/matrix
set JAVA_HOME appropriately
- Create a file under /etc/profile.d for example jdk.sh
- Add the following lines:
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-*"
export PATH=$JAVA_HOME/bin:$PATH
Start the server
www.objectrocket.com
10
Create a user elasticuser*
Using elasticuser execute:
bin/elasticsearch
After some noise:
[INFO ][o.e.n.Node] [name] started
How I know is up and running?
*You can’t start ES using root
$ curl -X GET "localhost:9200/"
{
"name" : "KG-_6s9",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "T9uHpto6QtWRmsjzNFrReA",
"version" : {
"number" : "6.3.2",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "053779d",
"build_date" : "2018-07-
20T05:20:23.451332Z",
"build_snapshot" : false,
"lucene_version" : "7.3.1",
"minimum_wire_compatibility_version" :
"5.6.0",
"minimum_index_compatibility_version" :
"5.0.0"
},
"tagline" : "You Know, for Search"
}
Explore the directories
www.objectrocket.com
11
Folder Description Setting
bin Contains the binary scipts, like elasticsearch
config Contains the configuration files ES_PATH_CONF
data Holds the data (shards/indexes) path.data
lib Contains JAR files
logs Contains the log files path.logs
modules Contains the modules
plugins Contains the plugins. Each plugin has its own subdirectory
Configuration files
www.objectrocket.com
12
elasticsearch.yml
- The primary way of configuring a node.
- Its a template which lists the most important settings for a production cluster
jvm.options
- JVM related options
log4j2.properties
- Elasticsearch uses Log4j 2 for logging
Variables can be set either:
-Using the configuration file: jvm.options: -Xms512mb
- or, using command line ES_JAVA_OPTS="-Xms512m" ./bin/elasticsearch
Elasticsearch.yml
www.objectrocket.com
13
node.name
- Every node should have a unique node.name
- Set it to something meaningful (aws-zone1-objectrocket-es-01)
cluster.name
- A cluster is a set of nodes sharing the same cluster.name
- Set it to something meaningful (production, qa, staging)
path.data
- Path to directory where to store the data (accepts multiple locations)
path.logs
- Path to log files
Elasticsearch.yml
www.objectrocket.com
14
cluster.name: production
node.name: dc1-prd-es1
path.data: /data/es1
path.logs: /logs/es1
bin/elasticsearch -d -p 'elastic.pid'
$ curl -X GET "localhost:9200/"
{
"name" : "dc1-prd-es1",
"cluster_name" : "production",
…
jvm.Options
www.objectrocket.com
15
Each Elasticsearch node runs on its own JVM instance
JVM is a virtual machine that enables a computer to run Java programs
The most important setting is the Heap Size:
- Xms: Represents the initial size of total heap space
- Xmx: Represents the maximum size of total heap space
Best Practices
- Set Xms and Xmx to the same size
- Set Xmx to no more than 50% of your physical RAM
- Do not set Xms and Xmx over 30ish GiB
- Use the server version of OpenJDK
- Lock the RAM for Heap bootstrap.memory_lock
jvm.Options
www.objectrocket.com
16
Heap Off Heap
Indexing buffer
Completion suggester
Cluster state
… and more
Caches:
- query cache (10%)
- field data cache (unbounded)
- …
jvm.Options
www.objectrocket.com
17
Garbage collector
- It is a form of automatic memory management
- Gets rid of objects which are not being used by a Java application anymore
- Automatically reclaims memory for reuse
Garbage collectors
- ConcMarkSweepGC (CMS)
- G1GC (has some Issues with JDK 8)
Elasticsearch uses -XX:+UseConcMarkSweepGC
GC threads
-XX:ParallelGCThreads=N, where N varies on the platform
-XX:ParallelCMSThreads=N , where N varies on the platform
jvm.Options
www.objectrocket.com
18
Eden s0 s1 Old Generation Perm
New Gen -Xmn
JVM Heap –Xms -Xmx
XX: PermSize
XX: MaxPermSize
Minor GC Major GC or full GC
1) A new Page stored in Eden
2) After a GC if it survives it moves to s0 ,s1
3) After multiple GCs, s0 or s1 gets full then pages
moves to Old Gen
OS settings
www.objectrocket.com
19
Disable swap
- sysctl vm.swappiness=1
- Remove Swap
File descriptors
- Set nofile to 65536
- curl -X GET ”<host>:<port>/_nodes/stats/process?filter_path=**.max_file_descriptors”
Virtual Memory
- sysctl -w vm.max_map_count=262144
Max user process
- nproc to 4096
DNS cache settings
- networkaddress.cache.ttl=<timeout>
- networkaddress.cache.negative.ttl=<timeout>
Network settings
www.objectrocket.com
20
Two network communication mechanisms in Elasticsearch
- HTTP: which is how the Elasticsearch REST APIs are exposed
- Transport: used for internal communication between nodes within the cluster
Node 1 Client
Node 2
HTTP
Transport
Network settings
www.objectrocket.com
21
The REST APIs of Elasticsearch are exposed over HTTP
- The HTTP module binds to localhost by default
- Configure with http.host on elasticsearch.yml
- Default port is the first available between 9200-9299
- Configure with http.port on elasticsearch.yml
Each call that goes from one node to another uses the transport module
- Transport binds to localhost by default
- Configure with transport.host on elasticsearch.yml
- Default port is the first available between 9300-9399
- Configure with transport.tcp.port on elasticsearch.yml
Network settings
www.objectrocket.com
22
network.host sets the bind host and the publish host at the same time
network.publish_host
- Defaults to network.host.network.publish_host. Multiple interfaces
network.bind_host
- Defaults to the “best” address from network.host. One interface only
network.host value Description
_[networkInterface]_ Addresses of a network interface, for example _en0_.
_local_ Any loopback addresses on the system, for example 127.0.0.1.
_site_ Any site-local addresses on the system, for example 192.168.0.1.
_global_ Any globally-scoped addresses on the system, for example 8.8.8.8.
Network settings
www.objectrocket.com
23
Zen discovery
- built in & default discovery module default
- It provides unicast discovery,
- Uses the transport module
On elasticsearch.yml
discovery.zen.ping.unicast.hosts: [”node1", ”node2"]
Node 1
Node 2
Transport
Node 3
1) Retrieves IP/ hostname from
list of hosts
2) Tries all hosts until find a
reachable one
3) If the cluster name matches,
joins the cluster
4) If not, starts its own cluster
Bootstrap tests
www.objectrocket.com
24
Development mode: if it does not bind transport to an external interface (the default)
Production mode: if it does bind transport to an external interface
Bypass production mode: Set discovery.type to single-node
Bootstrap Tests
- Inspect a variety of Elasticsearch and system settings
- A node in production mode must pass all Bootstrap tests to start
- es.enforce.bootstrap.checks=true on jvm.options
- Highly recommended to have this setting enabled
Bootstrap tests
www.objectrocket.com
25
List of Bootstrap Tests
- Heap size check
- File descriptor check
- Memory lock check
- Maximum number of threads check
- Max file size check
- Maximum size virtual memory check
- Maximum map count check
- Client JVM check
- Use serial collector check
- System call filter check
- OnError and OnOutOfMemoryError checks
- Early-access check
- G1GC check
- All permission check
Lucene
www.objectrocket.com
26
Lucene uses a data structure called Inverted Index.
An Inverted Index, inverts a page-centric data structure (page->words) to a keyword-
centric data structure (word->pages)
Allow fast full text searches, at a cost of increased processing when a document is
added to the database.
1) Give us your name
2) Give us your home number
3) Give us your home address
Frequency Location
give 3 1,2,3
us 3 1,2,3
your 3 1,2,3
name 1 1
number 1 2
home 2 2,3
address 1 3
Lucene – Key Terms
www.objectrocket.com
27
A Document is the unit of search and index.
A Document consists of one or more Fields. A Field is simply a name-value pair.
An index consists of one or more Documents.
Indexing: involves adding Documents to an Index
Searching:
- involves retrieving Documents from an index.
- Searching requires an index to have already been built
- Returns a list of Hits
Kibana
www.objectrocket.com
28
Download:
Latest Version: https://www.elastic.co/guide/en/kibana/current/targz.html
Simplest way to install it:
Run Kibana:
kibana-6.3.2-linux-x86_64/bin/kibana
Access Kibana:
http://localhost:5601
wget https://artifacts.elastic.co/downloads/kibana/kibana-6.3.2-linux-
x86_64.tar.gz
shasum -a 512 kibana-6.3.2-linux-x86_64.tar.gz
tar -xzf kibana-6.3.2-linux-x86_64.tar.gz
Kibana - Devtools
www.objectrocket.com
29
www.objectrocket.com
30
Lab 1
Install and configure Elastic
Objectives:
Learn how to install and configure a standalone Elastic instance.
Steps:
1. Navigate to /Percona2018/Lab01
2. Read the instructions on Lab01.txt
https://bit.ly/2D1tXL6
Working with
Data ● Indexes
● Shards
● CRUD Operations
● Read Operations
● Mappings
● Analyzers
www.objectrocket.com
31
Working with Data - Index
www.objectrocket.com
32
• An index in Elasticsearch is a logical way of grouping data:
‒ an index has a mapping that defines the fields in the index
‒ an index is a logical namespace that maps to where its contents are stored in the
cluster
• There are two different concepts in this definition:
‒ an index has some type of data schema mechanism
‒ an index has some type of mechanism to distribute data across a cluster
An index means ....
www.objectrocket.com
33
In the Elasticsearch world, index is used as a:
‒ Noun: a document is put into an index in Elasticsearch
‒ Verb: to index a document is to put the document into an index in Elasticsearch
{
"type":"line",
"line_id":4,
"play_name":"Henry IV",
"speech_number":1,
"line_number":"1.1.1",
"speaker":"KING HENRY IV",
"text_entry":"So shaken as we
are, so wan with care,"
}
{
"type":"line",
"line_id":5,
"play_name":"Henry IV",
"speech_number":1,
"line_number":"1.1.2",
"speaker":"KING HENRY IV",
"text_entry":"Find we a time for
frighted peace to pant"
}
{
"type":"line",
"line_id":6,
"play_name":"Henry IV",
"speech_number":1,
"line_number":"1.1.3",
"speaker":"KING HENRY IV",
"text_entry":"And breathe short-winded
accents of new broils"}
My_index
Documents are indexed to an index
Define an index
www.objectrocket.com
34
• Clients communicate with a cluster using Elasticsearch’s REST APIs
• An index is defined using the Create Index API, which can be accomplished with a
simple PUT command
# curl -XPUT 'http://localhost:9200/my_index' -i
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 48
{"acknowledged":true,"shards_acknowledged":true}
Shard
www.objectrocket.com
35
• A shard is a single piece of an Elasticsearch index
‒ Indexes are partitioned into shards so they can be distributed across multiple nodes
• Each shard is a standalone Lucene index
‒ The default number of shards for an index is 5. Number of shards can be changed at index
creation time.
My_index
0
2
4
3
1
Node 1 Node 2
Working with Data - Document
www.objectrocket.com
36
Documents must be JSON objects.
• A document can be any text or numeric data you want to search and/or analyze
‒ Specifically, a document is a top-level object that is serialized into JSON and
stored in Elasticsearch
• Every document has a unique ID
‒ which either you provide, or Elasticsearch generates one for you
{
"type":"line",
"line_id":4,
"play_name":"Henry IV",
"speech_number":1,
"line_number":"1.1.1",
"speaker":"KING HENRY IV",
"text_entry":"So shaken as we are, so wan with care,"
}
Index compression
www.objectrocket.com
37
• Elasticsearch compresses your documents during indexing
‒ documents are grouped into blocks of 16KB, and then compressed together using LZ4
by default
‒ if your documents are larger than 16KB, you will have larger chunks that contain only
one document
• You can change the compression to DEFLATE using the index.codec setting:
‒ reduced storage size at slightly higher CPU usage
PUT my_index
{ "settings": {
"number_of_shards": 3,
"number_of_replicas": 2,
"index.codec" : "best_compression"
}
}
Index a document
www.objectrocket.com
38
The Index API is used to index a document
‒ use a PUT or a POST and add the document in the body request
‒ notice we specify the index, the type and an ID
‒ if no ID is provided, elasticsearch will generate one
# curl -XPUT 'http://localhost:9200/my_index/my_type/1' -H 'Content-Type:
application/json' -d
'{
"line_id":5,
"play_name":"Henry IV",
"speech_number":1,
"line_number":"1.1.2",
"speaker":"KING HENRY IV",
"text_entry":"Find we a time for frighted peace to pant"
}'
{"_index":"my_index","_type":"my_type","_id":"1","_version":1,"result":"created","_shar
ds":{"total":2,"successful":2,"failed":0},"created":true}
Index without specifying an ID
www.objectrocket.com
39
You can leave off the id and let Elasticsearch generate one for you:
‒ But notice that only works with POST, not PUT
‒ The generated id comes back in the response
# curl -XPOST 'http://localhost:9200/my_index/my_type/' -H 'Content-Type:
application/json' -d '
{"line_id":6,
"play_name":"Henry IV",
"speech_number":1,
"line_number":"1.1.3",
"speaker":"KING HENRY IV",
"text_entry":"And breathe short-winded accents of new broils"
}'
{"_index":"my_index","_type":"my_type","_id":"AWZIq227Unvtccn4Vvrz","_version":1,"resul
t":"created","_shards":{"total":2,"successful":2,"failed":0},"created":true}
Reindexing a document
www.objectrocket.com
40
What do you think it happens if we add another document with the same ID ?
curl -XPUT 'http://localhost:9200/my_index/my_type/1' -H
'Content-Type: application/json' -d '
{
"new_field" : "new_value"
}'
...Overwrites the document
www.objectrocket.com
41
• The old field/value pairs of the document are gone
‒ the old document is deleted, and the new one gets indexed
• Notice every document has a _version that is incremented whenever the document is
changed # curl -XGET http://localhost:9200/my_index/my_type/1?pretty -H
'Content-Type: application/json'
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "1",
"_version" : 2,
"found" : true,
"_source" : {
"new_field" : "new_value"
}
}
The _create endpoint
www.objectrocket.com
42
If you do not want a document to be overwritten if it already exists, use the _create
endpoint
‒ no indexing occurs and returns a 409 error message:
# curl -XPUT 'http://localhost:9200/my_index/my_type/1/_create' -H 'Content-Type:
application/json' -d '
{"new_field" : "new_value"}'
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[my_type][
1]: version conflict, document already exists (current version
[2])","index_uuid":"JGY3Q_9NRjWe-wU-
MlK44Q","shard":"3","index":"my_index"}],"type":"version_conflict_engine_exception","rea
son":"[my_type][1]: version conflict, document already exists (current version
[2])","index_uuid":"JGY3Q_9NRjWe-wU-
MlK44Q","shard":"3","index":"my_index"},"status":409}
Locking ?
www.objectrocket.com
43
- Every indexed document has a version number
- Elasticsearch uses Optimistic concurrency control without locking
# curl -XPUT 'http://localhost:9200/my_index/my_type/1?version=3' -d
'{
...
}'
# 200 OK
# curl -XPUT 'http://localhost:9200/my_index/my_type/1?version=2' -d
'{
...
}'
# 409 Conflict
The _update endpoint
www.objectrocket.com
44
To update fields in a document use the _update endpoint.
- Make sure to add the “doc” context
curl -XPOST 'http://localhost:9200/my_index/my_type/1/_update' -H 'Content-Type:
application/json' -d '
{ "doc":
{
"line_id":10,
"play_name":"Henry IV",
"speech_number":1,
"line_number":"1.1.7",
"speaker":"KING HENRY IV",
"text_entry":"Nor more shall trenching war channel her fields"
}
}'
{"_index":"my_index","_type":"my_type","_id":"1","_version":3,"result":"updated","_shar
ds":{"total":2,"successful":2,"failed":0}}
Retrieve a document
www.objectrocket.com
45
Use GET to retrieve an indexed document
‒ Notice we specify the index, the type and an ID
‒ Returns a 200 code if document found or a 404 error if the document is not found
# curl -XGET http://localhost:9200/my_index/my_type/1?pretty
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"line_id" : 5,
"play_name" : "Henry IV",
"speech_number" : 1,
"line_number" : "1.1.2",
"speaker" : "KING HENRY IV",
"text_entry" : "Find we a time for frighted peace to pant"
}
}
Deleting a document
www.objectrocket.com
46
Use DELETE to delete an indexed document
‒ response code is 200 if the document is found, 404 if not
# curl -XDELETE 'http://localhost:9200/my_index/my_type/1/'
-H 'Content-Type: application/json'
{"found":true,"_index":"my_index","_type":"my_type","_id":"
1","_version":7,"result":"deleted","_shards":{"total":2,"su
ccessful":2,"failed":0}}
A simple search
www.objectrocket.com
47
Use a GET request sent to the _search endpoint
‒ every document is a hit for this search
‒ by default, Elasticsearch returns 10 hits
curl -s -XGET 'http://localhost:9200/my_index/my_type/_search'
-H 'Content-Type: application/json'
{
"took" : 1,
"timed_out" : false,
….
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ ...
]
} }
Search for all
docs in my_index
Number of ms it took to process the query
Number of documents there were hits for
this query
Array containing documents hit by the
search criteria
CRUD Operations Summary
www.objectrocket.com
48
Index PUT my_index/my_type/4
Create PUT my_index/my_type/4/_create
{ "speaker":"KING HENRY IV",
"text_entry":"To be commenced in strands afar
remote." }
Read GET my_index/my_type/4
Update POST my_index/my_type/4/_update
{ "my_type" : { "text_entry":"No more the
thirsty entrance of this soil"
}
}
Delete DELETE my_index/my_type/4
Mapping – what is it?
www.objectrocket.com
49
• Elasticsearch will index any document without knowing its details (number of fields,
their data types, etc.) - dynamic mapping
‒ However, behind-the-scenes Elasticsearch assigns data types to your fields in a
mapping. Mapping is the process of defining how a document, and the fields it contains,
are stored and indexed
A mapping is a schema definition that contains:
‒ names of fields
‒ data types of fields
‒ how the field should be indexed and stored by Lucene
• Mappings map your complex JSON documents into the simple flat documents that
Lucene expects.
Defining a mapping
www.objectrocket.com
50
• In most use cases, you will need to define your own mappings, but is not required.
When you index a document, Elasticsearch dynamically creates or updates the
mapping
• Mappings are defined in the“mappings”section of an index. You can:
‒ define mappings at index creation, or
‒ add to a mapping of an existing index
PUT my_index
{
"mappings": {
define mapping here
}
}
Let's view a mapping
www.objectrocket.com
51
GET my_index/_mapping
{
"my_index" : {
"mappings" : {
"my_type" : {
"properties" : {
"line_id" : {
"type" : "long"
},
"line_number" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"play_name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
...
The “properties” section
contains the fields and data
types in your documents
Elasticsearch data types for fields
www.objectrocket.com
52
• Simple types, including:
‒ text: for full text (analyzed) strings
‒ keyword: for exact value strings
‒ date: string formatted as dates, or numeric dates
‒ integer types: like byte, short, integer, long
‒ floating-point numbers: float, double, half_float, scaled_float
‒ boolean
‒ ip: for IPv4 or IPv6 addresses
• Hierarchical Types: like object and nested
• Specialized Types:geo_point, geo_shape and percolator
• Range types and more
Updating existing mapping
www.objectrocket.com
53
• Existing field mappings cannot be updated. Changing the mapping would mean
invalidating already indexed documents.
- Instead, you should create a new index with the correct mappings
and reindex your data into that index.
There are some exceptions to this rule:
• new properties can be added to Object datatype fields.
• new multi-fields can be added to existing fields.
• the ignore_above parameter can be updated.
Prevent mapping explosion
www.objectrocket.com
54
• Defining too many fields in an index is a condition that can lead to a mapping
explosion, which can cause out of memory errors and difficult situations to recover
from.
- For example when using dynamic mapping and every new inserted documents
introduces new fields.
• The following settings allow you to limit the number of field mappings that can be
created manually or dynamically
index.mapping.total_fields.limit - maximum number of fields in an index, defaults to 1000
index.mapping.depth.limit - maximum depth for a field, which is measured as the number of
inner objects, defaults to 20
index.mapping.nested_fields.limit - maximum number of nested fields in an index,
defaults to 50
Analysis
www.objectrocket.com
55
• Analysis is the process of converting full text into terms (tokens) which are added to
the inverted index for searching.
- Analysis is performed by an analyzer which can be either a built-in analyzer or
a custom analyzer defined per index.
For example, at index time the built-in standard analyzer will first convert the sentence
into distinct tokens:
"Welcome to Percona Live - Open Source Database Conference 2018"
[ welcome to percona live open source database conference 2018 ]
Analyzer will lowercase each token, remove frequent stopwords
The analyze api
www.objectrocket.com
56
• The _analyze api can be used to test what an analyzer will to your text
curl -s -XGET localhost:$ES_PORT/_analyze?analyzer=standard -d
'Welcome to Percona Live - Open Source Database Conference 2018' |
python -m json.tool | grep token
"tokens": [
"token": "welcome",
"token": "to",
"token": "percona",
"token": "live",
"token": "open",
"token": "source",
"token": "database",
"token": "conference",
"token": "2018",
Built-in analyzers
www.objectrocket.com
57
• Standard - the default analyzer
• Simple – breaks text into terms whenever it encounters a character which is not a letter
• Keyword – simply indexes the text exactly as is
• Others include:
‒ whitespace, stop, pattern, language, and more are described in the docs at
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
- custom analyzers built by you
Analyzer components
www.objectrocket.com
58
• An analyzer consists of three parts:
1. Character Filters
2. Tokenizer
3. Token Filters
Character
Filters
Tokenizer Token FiltersInput Output
string
tokens
string
tokens
Specifying an analyzer
www.objectrocket.com
59
• At index time:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
• At search time:
Usually, the same analyzer should be applied at index
time and at search time, to ensure that the terms in the
query are in the same format as the terms in the inverted
index.
By default, queries will use the analyzer defined in
the field mapping, but this can be overridden with
the search_analyzer setting:
PUT my_index {
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard" }}}}}
Custom analyzer
www.objectrocket.com
60
• Best described with an example, let's create a custom analyzer based on
standard one, but which also removes stop words
PUT my_index {
"settings": {
"analysis": {
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": ["to", "and", "or", "is", "the"]
} },
"analyzer": { "my_content_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "standard",
"filter": ["lowercase","my_stopwords"] } } }}}
Scaling the cluster
● 10 000ft view on scaling
● Node roles
● Adding a node to a cluster
● Understanding shards
● Replicas
● Read/Write model
● Sample Architectures
www.objectrocket.com
61
10 000ft view on scaling
www.objectrocket.com
62
• ElasticSearch has the potential to be always available as long as we take advantage
of it’s scaling features.
• With vertical scaling(better hardware) having its limitations, we’ll take a look at the
horizontal scaling(more nodes in the same cluster).
• If with other datastores, horizontal scaling has its challenges, such as sharding for
MongoDB(Antonios has written an amazing tutorial on managing a sharded cluster;
you must check it out), ElasticSearch is designed to be distributed by nature so as
long as replicas as being used, the application development as well as the
administration overheard to manage scaling out the cluster is minimal.
10 000ft view on scaling
www.objectrocket.com
63
• We defined a shard as elements that compose the indexes and is, each, a Lucene index.
• By default, ElasticSearch will create 5 per index, but if we have everything on one node and that
node goes down? We face disaster. This is where replicas come in.
• A replica of a shard is an exact copy of that element that lives on another node.
• A node is simply an ElasticSearch process. One or more nodes with the same name under the
“cluster.name” directive under the config file is/are making up a cluster.
10 000ft view on scaling
www.objectrocket.com
64
• All nodes know about all others in the cluster and can also direct a request to another,
if needed.
• Nodes can handle both http(external) traffic as well as transport(inter cluster) traffic.
They can also switch between these. If one node receives an HTTP request that
should have been directed at another, it switches to TRANSPORT.
• Nodes can have one or more roles in the cluster.
Node Roles
www.objectrocket.com
65
• Master-eligible node: A node that has ”node.master” set to true (default), which makes it eligible
to be elected as the masternode, which controls the cluster and carries out administrative functions
such as deleting and creating indexes.
• Data node: A node that has ”node.data” set to true (default). Data nodes hold data and perform
data related operations such as CRUD, search, and aggregations.
• Ingest node: A node that has ”node.ingest” set to true (default). Ingest nodes are able to apply
an ingest pipeline to a document in order to transform and enrich the document, such as adding a
field that wasn’t there before, before indexing. With a heavy ingest load, it makes sense to use
dedicated ingest nodes and to mark the master and data nodes as “node.ingest: false”
Node Roles
www.objectrocket.com
66
• Tribe node: A tribe node, configured via the tribe.* settings, is a special type of coordinating
only node that can connect to multiple clusters and perform search and other operations across all
connected clusters. In later versions of Elastic, this role became obsolete
• Kibana node: In case Kibana is being used on a large scale with many users running
complex queries, you can have a dedicated node or nodes for it.
• To summarize, any node, by default, is master eligible, is acting as a data node as well as
handling ingestions, including ingestion pipelines. As the cluster grows, in order to separate the
overhead of different operations(maintaining the cluster, ingestion pipelines, connecting clusters
etc), it makes sense to define roles.
Adding a node to a cluster
www.objectrocket.com
67
• To add a node or start a cluster,we need to set the directive “cluster.name” to a descriptive value in
/etc/elasticsearch/elasticsearch.yml ; All nodes need to have the same cluster.name:
• By default, ElasticSearch binds to the loopback interface so we must edit the networking section of the
config file and bind the daemon to a specific ip or use 0.0.0.0 for all:
• We must name our nodes, again, with descriptive values:
• Nodes running on the same host will be auto-discovered but remote nodes will use zen discovery which
will take a list of Ips that will assemble the cluster. The firewall must allow communication on 9200,9300:
Adding a node to a cluster
www.objectrocket.com
68
• Of course, there are more options that you can configure but for the sake of this exercise, these
will be enough.
• Once these are set, restart the daemon and a /_cluster/health?pretty should return something like:
curl -X GET http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "democluster", ß
….
"number_of_nodes" : 2, ß
"number_of_data_nodes" : 2, ß
……
}
Understading shards
www.objectrocket.com
69
A shard is a worker unit that holds data and can be assigned to nodes and is, itself a
Lucene index. Think of a self contained search engine that handles a portion of data.
‒ An index is merely a virtual namespace which points to a number of shards
My_index
My_cluster
Node1 Node2
shard shard shard shardshard
An index is "split" into shards
before any documents are
indexed
Primary and Replica
www.objectrocket.com
70
• There are two types of shards:
- primary: the original shards of an index
- replicas: copies of the primary
• Documents are replicated between a primary and its replicas
- a primary and all replicas are guaranteed to be on different nodes
My_cluster
Node1
Node2
P0 P2
R3
P3P1
R1 R4
P4
R0 R2
Number of Primary shards
www.objectrocket.com
71
• Is fixed – default number of primary shard for an index is 5
• You can specify a different number of shards when you create the index.
• Changing the number of shards after the index has been created can be done with the
split or shrink index API but it’s NOT a trivial operation. It’s basically the same as
reindexing. Plan accordingly.
PUT my_new_index
{
"settings": {
"number_of_shards": 3
}
}
Replicas are good for
www.objectrocket.com
72
• High availability
- We can lose a node and still have all the data available
- After losing a primary, Elasticsearch will automatically promote a replica to a
primary and start replicating unassigned replicas
• Read throughput
- Replicas can handle query/read requests from client applications
- Allows you to scale your data and better utilize cluster resources
You can change the number
of replicas for an index at
any time using the
_settings endpoint:
PUT my_index/_settings
{
"number_of_replicas": 2
}
Replicas
www.objectrocket.com
73
• Let’s play a bit with replicas. In this example I’ve indexed Shakespeare’s work again. Here is the
cluster and the index:
curl -X GET
http://localhost:9200/_cluster/health?p
retty
{
"cluster_name" : "democluster",
"status" : "yellow", ß
….
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"unassigned_shards" : 5,
…
}
curl -XGET localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open shakespeare jkJ280IVT3mcfswXwBR1QA 5 1 111394 0 22.4mb 22.4mb
Yellow indicates a problem. What do you think the
problem is? What would be a solution here?
Replicas
www.objectrocket.com
74
• Replicas will get automatically assigned if the topology permits it. All I’ve done was to start a
second node and:
• We can change the replicas number, dynamically, in the index settings. This is a trivial operation,
unlike changing the number of shards.
curl -XGET localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open shakespeare jkJ280IVT3mcfswXwBR1QA 5 1 111394 0 44.9mb 22.4mb
curl -X PUT "localhost:9200/shakespeare/_settings" -H 'Content-
Type: application/json' -d'
> {
> "index" : {
> "number_of_replicas" : 0
> }
> }
> '
{"acknowledged”}
Write Path
www.objectrocket.com
75
• The process of keeping the primary shard in sync with its replicas is called a data replication model.
ElasticSearch’s data replication model is based on the primary-backup model. One primary and n backups.
• This model runs on top of replication groups. We’ve seen before that as a default, we have 5 primary shards and each of
these shards have 1 replica. In the above graph, we have a 2 replication groups and each primary has 3 replicas.
• In the context of a replication group, primary shard is responsible for indexing and keeping the replicas up to date. In a
replication group at a certain point, some replicas might be offline so the master node will keep a in-sync copies group with
the ones that are and have received all the writes that the user has acknowledged.
primary
replica replica
primary
Replication group 1 Replication group 2
replica
replica
replica
replica
Write Path
www.objectrocket.com
76
• The primary follows the flow of validating the incoming operation and the documents, execute it locally, forward the operation
to all replicas in the in-sync list, ack the write once all the replicas from the list have run the operation.
• write
• Some notes about failure handling. In case a primary fails, the indexing will stop for 1 minute while the master promotes a
new primary. The primary will check with his replicas to make sure he’s still primary and wasn’t demoted for whatever reason.
An operation coming from a stale primary will be declined by the replicas.
1
2
3
local
In-sync
1
2
Read Path
www.objectrocket.com
77
• The node that received the query(which is called the coordinating node) will find the relevant shards
for the read request, select an active copy of the data(primary or replica; it will round robin) from a
replication group, send the read request to the selected copies, combine the results and respond.
• The requests to each shard are single threads but more than one shard can be done in parallel.
Read Path
www.objectrocket.com
78
• Because we’re talking about roundrobin when we were talking about the active shard, this is where
adding more replicas will help. Any new request will hit a different replica so the work is spread.
• The failure handling is way easier. If for some reason a response is not received, the coordinating
node will resubmit the read request to the relevant replication group, pick a different replica and the
same flow reapplies.
Sample Architectures
www.objectrocket.com
79
• For lightweight searches and where the data can be reindexed without suffering from loss, the single
node cluster is not unseen.
• A basic deployment with data resilience is the two node cluster. Most SaaS providers start with this
deployment.
• The two node model can be scaled as much as it’s needed but is usually recommended in case you
are running just basic indexing/search operations. In case more granularity is needed, the data can be
reindexed with a higher number of shards and replicas across.
• In case the number of nodes in the cluster gets really high or the operations get complex, it’s time to
separate the roles. Separating the nodes also needs to take in consideration the cases where you
would lose one or more nodes of a specific role. For instance if you’re using ingestion only nodes,
data only nodes and master only nodes, you need to consider what happens if you lose one or more.
Sample Architectures
www.objectrocket.com
80
• ObjectRocket starts with 4 ingestion nodes, 2 kibana, 2 data
and 3 master nodes.
• We don’t care how many client nodes we lose as long as we
have 1 remaining.
• The master nodes pick a active master based on
quorum.This helps with split brain.
• Data nodes, of course, we can lose at maximum one.
• Consider redundant components as much as possible.
• We will cover security in a later chapter. By default, in the
community version, there is no built in security. In this case,
Firewall limitations are a must have.
www.objectrocket.com
81
Lab 2
Scaling the cluster
Objectives:
Adding nodes to your cluster
Change the number of Replicas
Steps:
1. Navigate to /Percona2018/Lab02/
2. Read the instructions on Lab02.txt
https://bit.ly/2D1tXL6
Operating the
cluster ● Working with nodes
● Working with shards
● Reindex
● Backup/Restore
● Plugins
www.objectrocket.com
82
Cheatsheet
www.objectrocket.com
83
curl -X GET ”<host>:<port>/_cluster/settings”
curl -X GET " ”<host>:<port>/_cluster/settings?include_defaults=true”
curl -X PUT ”<host>:<port>/_cluster/settings" -H 'Content-Type:
application/json' -d'
{
"persistent" : {
”name of the setting" : value
}}'
curl -X PUT ”<host>:<port>/_cluster/settings" -H 'Content-Type:
application/json' -d'
{
"transient" : {
”name of the setting" : null
}}'
Shard Allocation
www.objectrocket.com
84
Allow control over how and where shards are allocated
Shard Allocation settings (cluster.routing.allocation)
- enable
- node_concurrent_incoming_recoveries
- node_concurrent_outgoing_recoveries
- node_concurrent_recoveries
- same_shard.host
Shard Rebalancing settings (cluster.routing.rebalance)
- enable
- allow_rebalance
- cluster_concurrent_rebalance
Shard Allocation - Disk
www.objectrocket.com
85
cluster.routing.allocation.disk.threshold_enabled: Defaults to true
Low: Do not allocate new shards. Defaults to 85%
High: Try to relocate shards. Defaults to 90%
Flood_stage: Enforces a read-only index block. Must be released manually. Defaults to 95%
cluster.info.update.interval How often Elasticsearch should check on disk usage (Defaults to 30s)
cluster.routing.allocation.disk.include_relocations: Defaults to true – Could lead to false alerts
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type:
application/json' -d'
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "100gb",
"cluster.routing.allocation.disk.watermark.high": "50gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
"cluster.info.update.interval": "1m” }}
Shard Allocation – Rack/Zone
www.objectrocket.com
86
Make Elasticsearch aware of the topology
- it can ensure that the primary shard and its replica shards are spread across different
- Physical servers (node.attr.phy_host)
- Racks (node.attr.rack_id)
- Availability Zones (node.attr.zone)
- Minimize the risk of losing all shard copies at the same time
- Minimize latency
Configuration:
cluster.routing.allocation.awareness.attributes: zone, rack_id
Force awareness:
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
cluster.routing.allocation.awareness.attributes: zone
Restart node(s)
www.objectrocket.com
87
Elasticsearch wants your data to be fully replicated and evenly balanced.
When a nodes go down:
- The cluster immediately recognize the change
- Rebalancing begins
- Rebalancing takes time and can become costly
During a planned maintenance you should hold off on rebalancing
Restart node(s)
www.objectrocket.com
88
Steps:
1) Flush pending indexing operations POST /_flush/synced
2) Disable shard allocation
3) Shut down a single node
4) Perform a maintenance
PUT /_cluster/settings
{
"transient" : {
"cluster.routing.allocation.enable" : "none”
}
}
Restart node(s)
www.objectrocket.com
89
5) Restart the node, and confirm that it joins the cluster.
6) Re-enable shard allocation as follows:
7) Check the cluster health
PUT /_cluster/settings
{
"transient" : {
"cluster.routing.allocation.enable" : "all"
}
}
Restart node(s)
www.objectrocket.com
90
You can also make Elastic less sensitive to changes
The default for Master is to instruct shard relocations is 1m.
During restarts we can lower the threshold.
Useful setting for slow or unreliable networks.
PUT _all/_settings
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "5m"
}
}
Remove a node
www.objectrocket.com
91
Elastic automatically detects topology changes.
In order to remove a node you need to drain it and then stop it
Where attribute:
_name :Match nodes by node names
_ip: Match nodes by IP addresses (the IP address associated with the hostname)
_host: Match nodes by hostnames
PUT _cluster/settings
{
"transient" : {
"cluster.routing.allocation.exclude.{attribute} ": ”<value>"
}
}
Remove a node
www.objectrocket.com
92
Additional considerations:
- Master-eligible node
- Seed nodes
- Space considerations
- Performance considerations
- If possible stop writes
- Do not allow new allocations ("cluster.routing.allocation.enable" : "none")
- Overhead from the shard drains
- Throttle (indices.recovery.max_bytes_per_sec)
- One node at a time (cluster.routing.allocation.disk.watermark)
Move shards manually (Reroute API)
- Flush and if possible stop writes
- Safe for Replicas, not recommended for Primaries (may lead to data loss)
Remove a node
www.objectrocket.com
93
Cancel the drain of a node by removing the node or reset the attribute
Where attribute:
_name :Match nodes by node names
_ip: Match nodes by IP addresses (the IP address associated with the hostname)
_host: Match nodes by hostnames
PUT _cluster/settings
{
"transient" : {
"cluster.routing.allocation.exclude.{attribute}": ""
}
}
Replace a node
www.objectrocket.com
94
Similar to remove a node with the difference that you need to add a node as well.
Simplest approach: add a new node and then drain the old node
Additional considerations:
- Master-eligible/Seed nodes
- Do not allow new allocations (cluster.routing.allocation.exclude._name)
- Overhead from drain/throttle (indices.recovery.max_bytes_per_sec)
- Space considerations
- Max amount of data each node can get. Watermark
Alternatively use the reroute API to drain the node
Working with Shards
www.objectrocket.com
95
Number of Shards/Replicas
- Defined on Index creation
- Number of Replicas changes dynamically
- Number of Shards can change using:
- shrink API
- split API
- reindex API
Why increase the number of shards:
- Index size
- Performance considerations
- Hard limits (LUCENE-5843)
Almost same reasons apply when you decreasing the number of shards
Shrink API
www.objectrocket.com
96
Shrinks an existing index into a new one with fewer primary shards:
- Target index must be a factor of the number of shards in the source index
- If a prime number it can only be shrunk into a single primary shard
- Before shrinking, a (primary or replica) copy of every shard in the index must be present on the
same node
Works as follows:
- First, it creates a new target index with the same definition as the source index, but with a
smaller number of primary shards.
- Then it hard-links segments from the source index into the target index.
- Finally, it recovers the target index as though it were a closed index which had just been re-
opened.
Shrink API
www.objectrocket.com
97
In order to shrink an index, the index must be marked as read-only, and a copy of every shard in the
index must be relocated to the same node and have health green
Note that it may take a while…
Check progress using GET _cat/recovery?v
curl -X PUT ”<host>:<port>/my_source_index/_settings" -H 'Content-
Type: application/json' -d'
{
"settings": {
"index.routing.allocation.require._name": "shrink_node_name",
"index.blocks.write": true
}}’
Shrink API
www.objectrocket.com
98
Finally its time to shrink the index:
It is similar to create index api – almost same arguments
Some constraints apply
curl -X POST
”<host>:<port>/my_source_index/_shrink/my_target_index" -H
'Content-Type: application/json' -d'
{
"settings": {
"index.number_of_replicas": <number>,
"index.number_of_shards": <number>,
"index.routing.allocation.require._name": null,
"index.blocks.write": null
}}’
Split API
www.objectrocket.com
99
Splits an existing index into a new index:
- The original primary shard is split into two or more primary shards.
- The number of splits is determined by the index.number_of_routing_shards setting
The _split API requires the source index to be created with a specific number_of_routing_shards in
order to be split in the future. This requirement has been removed in Elasticsearch 7.0
Works as follows:
- First, it creates a new target index with a larger number of primary shards.
- Then it hard-links segments from the source index into the target index.
- Once the low level files are created all documents will be hashed again to delete documents that
belong to a different shard.
- Finally, it recovers the target index as though it were a closed index which had just been re-
opened.
Split API
www.objectrocket.com
100
In order to shrink an index, the index must be marked as read-only (assuming the index has
number_of_routing_shards set)
Split the index:
curl -X PUT ”<host>:<port>/my_source_index/_settings" -H 'Content-Type:
application/json' -d'
{
"settings": {
"index.blocks.write": true
}}’
curl -X POST
”<host>:<port:1>/my_source_index/_split/my_target_index?copy_settings=true" -
H 'Content-Type: application/json' -d'
{
"settings": {
"index.number_of_shards": 2
}}'
Reindex API - Definition
www.objectrocket.com
101
- Does not copy the settings of the source index
- version_type : internal/external
- source supports “query”, multi-indexes & remote location
- URL parameters: refresh, wait_for_completion, wait_for_active_shards, timeout, scroll and
requests_per_second
- Supports painless scripts to manipulate indexing
curl -X POST ”<host>:<port>/_reindex" -H 'Content-Type:
application/json' -d'
{
"source": {
"index": ”<source index>"
},
"dest": {
"index": ”<destination index>"
}}’
Reindex API – Response Body
www.objectrocket.com
102
"took": 1200,
"timed_out": false,
"total": 10,
"updated": 0,
"created": 10,
"deleted": 0,
"batches": 1,
"noops": 0,
"version_conflicts": 2,
"retries": {
"bulk": 0,
"search": 0},
"throttled_millis": 0,
"requests_per_second": 1,
"throttled_until_millis": 0,
"failures": [ ]
Total milliseconds the entire operation took
The number of documents that were successfully processed
Summary of the operation counts
The number of version conflicts that reindex hit
Throttling Statistics
Reindex API
www.objectrocket.com
103
Active Reindex jobs:
Cancel a Reindex job:
Re-Throttle:
Reindexing from a remote server:
- Use on-heap buffer that defaults to a maximum size of 100mb
- May need to use a smaller batch size
- Configure socket_timeout and connect_timeout. Both default to 30 seconds
POST _reindex/<id of the reindex>/_rethrottle?requests_per_second=-1
POST _tasks/<id of the reindex>/_cancel
GET _tasks?detailed=true&actions=*reindex
Snapshots - Backup
www.objectrocket.com
104
A snapshot is a backup taken from a running Elasticsearch cluster
Snapshots are taken incrementally
Version compatibility – one major version behind
You must register a snapshot repository before you can perform snapshot
Must exists on elasticsearch.yml: path.repo
curl -X GET
”<host>:<port>/_snapshot/_all"
curl -X PUT
”<host>:<port>/_snapshot/my_backup" -H
'Content-Type: application/json' -d'
{
"type": "fs",
"settings": {
"location": ”backup location"
}}’
Snapshots - Backup
www.objectrocket.com
105
Shared location: On elasticsearch.yml: path.repo: ["/mount/backups0", "/mount/backups1"]
Don’t forget to register it!!!
Registration options
location: Location of the snapshots
compress: Turns on compression of the snapshot files. Defaults to true.
chunk_size: Big files can be broken down into chunks. Defaults to null (unlimited chunk size)
max_restore_bytes_per_sec: Throttles per node restore rate. Defaults to 40mb/second
max_snapshot_bytes_per_sec: Throttles per node snapshot rate. Defaults to 40mb/second
readonly: Makes repository read-only. Defaults to false
Snapshots - Backup
www.objectrocket.com
106
wait_for_completion whether or not the request should return immediately after snapshot completion
ignore_unavailable: Ignores indexes that don’t exists
include_global_state: Prevent the cluster global state to be stored as part of the snapshot
curl -X PUT
”<host>:<port>/_snapshot/my_backup/snapshot_2?wait_for_completion=true" -
H 'Content-Type: application/json' -d'
{
"indices": "index_1,index_2,index_3",
"ignore_unavailable": true,
"include_global_state": false
}’
curl -X PUT ”<host>:<port>/_snapshot/my_backup/snapshot_1?wait_for_completion=true”
Snapshots - Backup
www.objectrocket.com
107
IN_PROGRESS: The snapshot is currently running.
SUCCESS: The snapshot finished and all shards were stored successfully.
FAILED: The snapshot finished with an error and failed to store any data.
PARTIAL: The global cluster state was stored, but data of at least one shard wasn’t stored
successfully.
INCOMPATIBLE: The snapshot was created with an old version of ES incompatible with the current
version of the cluster.
Delete snapshot:
Unregister Repo:
curl -X GET ”<host>:<port>/_snapshot/my_backup/snapshot_1"
curl -X DELETE ”<host>:<port>/_snapshot/my_backup/snapshot_2"
curl -X DELETE ”<host>:<port>/_snapshot/my_backup"
Snapshots - Restore
www.objectrocket.com
108
Check the progress:
Also supported:
- Partial restore
- Restore with different settings
- Restore to a different cluster
curl -X POST ”<host>:<port>/_snapshot/my_backup/snapshot_1/_restore”
curl -X GET ”<host>:<port>/_snapshot/_status”
curl -X GET ”<host>:<port>/_snapshot/my_backup/_status"
curl -X GET ”<host>:<port>/_snapshot/my_backup/snapshot_1/_status”
Snapshots - Restore
www.objectrocket.com
109
Restore with different settings
Select indices that should be restored
Renames indices on restore using
regular expression that supports
referencing the original text.
Restore global state
"index_settings": {"index.number_of_replicas": 0}
"ignore_index_settings": ["index.refresh_interval”]
curl -X POST
"localhost:9200/_snapshot/my_backup/snapshot_1/_res
tore" -H 'Content-Type: application/json' -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": true,
"rename_pattern": "index_(.+)",
"rename_replacement": "restored_index_$1"
}’
Plugins
www.objectrocket.com
110
Away to enhance the basic Elasticsearch functionality in a custom manner.
They range from:
- Mapping and analysis
- Discovery
- Security
- Management
- Alerting
- And many many more…
Installation: bin/elasticsearch-plugin install [plugin_name]
Considerations:
- Security
- Maintainability between version
We are heavily use Cerebro (https://github.com/lmenezes/cerebro) on our tutorial
www.objectrocket.com
111
Lab 3
Operating the cluster
Objectives:
Learn how to:
o Remove a node from a cluster.
o Use the ReIndex API
Steps:
1. Navigate to /Percona2018/Lab03
2. Read the instructions on Lab03.txt
3. Execute ./run_cluster.sh to begin
https://bit.ly/2D1tXL6
Troubleshooting
● Cluster health
● Improving Performance
● Diagnostics
www.objectrocket.com
112
Cluster health
www.objectrocket.com
113
• The cluster health API allows to get a very simple status on the health of the
cluster
• The health status is either green, yellow or red and exists at three levels: shard,
index, and cluster
• Shard health
‒ red: at least one primary shard is not allocated in the cluster
‒ yellow: all primaries are allocated but at least one replica is not
‒ green: all shards are allocated
• Index health
‒ status of the worst shard in that index
• Cluster health
‒ status of the worst index in the cluster
Cluster health
www.objectrocket.com
114
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 50.0
}
GET _cluster/health
Green status
www.objectrocket.com
115
• The state your cluster should have
– All of your primary and replica shards
are allocated and active
My_cluster
Node1 Node2
P0 R0
Node3
R0
PUT my_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 2
}
}
Yellow status
www.objectrocket.com
116
It means all your primary shards are allocated,
but one or more replicas are not.
- you may not have enough nodes in the
cluster, or a node may have failed
My_cluster
Node1 Node2
P0 R0
Node3
R0
PUT my_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 3
}
}
R0
Unassigned
Red status
www.objectrocket.com
117
• At least one primary shard is missing
- searches will return partial results and
indexing might fail
PUT my_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}
My_cluster
Node1 Node2
P0 R0
Node3
Resolve unassigned shards
www.objectrocket.com
118
Causes:
• Shard allocation is purposefully delayed
• Too many shards, not enough nodes
• You need to re-enable shard allocation
• Shard data no longer exists in the cluster
• Low disk watermark
• Multiple Elasticsearch versions
The _cat endpoint will tell you which shards are unassigned, and
why:
curl -XGET
localhost:9200/_cat/shards?h=index,shard,prirep,state,u
nassigned.reason| grep UNASSIGNED
Resolve unassigned shards
119
• You can also use the cluster allocation explain API to get more
information about shard allocation issues:
curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "testing",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
…
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
… {
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard
already exists" }]}]}
Reason 1 – Shard allocation delayed
www.objectrocket.com
120
• When a node leaves the cluster, the
master node temporarily delays
shard reallocation to avoid
needlessly wasting resources on
rebalancing shards, in the event the
original node is able to recover within
a certain period of time (one minute,
by default)
Modify the delay dynamically:
curl -XPUT 'localhost:9200/my_index/_settings' -d
'{
"settings": {
"index.unassigned.node_left.delayed_timeout":
"30s"
}
}'
Reason 2 – Not enough nodes
www.objectrocket.com
121
• As nodes join and leave the cluster, the master node reassigns shards
automatically, ensuring that multiple copies of a shard aren’t assigned to
the same node
• A shard may linger in an unassigned state if there are not enough nodes
to distribute the shards accordingly.
• Make sure that every index in your cluster is initialized with fewer
replicas per primary shard than the number of nodes in your cluster
Reason 3 – re-enable shard allocation
www.objectrocket.com
122
• Shard allocation is enabled by default on all nodes, but you may have disabled
shard allocation at some point (for example, in order to perform a rolling
restart) and forgotten to re-enable it.
• To enable shard allocation, update the _cluster settings API:
curl -XPUT 'localhost:9200/_cluster/settings' -d
'{ "transient":
{ "cluster.routing.allocation.enable" : "all"
}
}'
Reason 4 – Shard data no longer exists
www.objectrocket.com
123
• Primary shard is not available anymore because the index may have been created on
a node without any replicas (a technique used to speed up the initial
indexing process), and the node left the cluster before the data could be replicated.
• Another possibility is that a node may have encountered an issue while rebooting or
has storage issues
• In this scenario, you have to decide how to proceed: try to get the original node to
recover and rejoin the cluster (and do not force allocate the primary shard), or force
allocate the shard using the _reroute API and reindex the missing data using the
original data source, or from a backup.
Reason 4 – Shard data no longer exists
www.objectrocket.com
124
• To allocate an unassigned primary shard:
curl -XPOST 'localhost:9200/_cluster/reroute' -d
'{ "commands" :
[ { "allocate" :
{ "index" : "my_index", "shard" : 0, "node": "<NODE_NAME>",
"allow_primary": "true" } }]
}'
Warning! The caveat with forcing allocation of a primary shard is that you will be
assigning an “empty” shard. If the node that contained the original primary shard
data were to rejoin the cluster later, its data would be overwritten by the newly
created (empty) primary shard, because it would be considered a “newer”
version of the data.
Reason 5 – Low disk watermark
www.objectrocket.com
125
• Once a node has reached this level of disk usage, or what Elasticsearch calls a
“low disk watermark”, it will not be assigned more shards – default is 85%
• You can check the disk space on each node in your cluster (and see which shards
are stored on each of those nodes) by querying the _cat API:
curl -s –XGET 'localhost:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
5 260b 47.3gb 43.4gb 100.7gb 46 127.0.0.1 127.0.0.1 CSUXak2
Example response:
Reason 5 – Low disk watermark
www.objectrocket.com
126
Resolutions:
- add more nodes
- increase disk size
- increase low watermark threshold, if safe:
PUT /_cluster/settings -d
'{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "90%"
}
}'
Reason 6 – Multiple ES versions
www.objectrocket.com
127
• Usually encountered when in the middle of a rolling upgrade
• The master node will not assign a primary shard’s replicas to any node running an
older major version (1.x -> 2.x -> 5.x).
When nothing works
www.objectrocket.com
128
… or restore the affected index from an old snapshot
Poor performance
www.objectrocket.com
129
This can be a long discussion – see more in the `Best practices` chapter
You want to start by:
• Enable slow logging so you can Identify long running queries
• Run identified searches through the _profiling API to look at timing of
individual components
• Filter, filter, filter
Enable slow log
www.objectrocket.com
130
• Send a put request to the _cluster API to define the level of slow log that you want
to turn on: warn, info, debug, and trace
PUT /_cluster/settings
’{ "transient" :
{ "logger.index.search.slowlog" : "DEBUG",
"logger.index.indexing.slowlog" : "DEBUG" }
}'
• All slow logging is enabled on the index level:
PUT /my_index/_settings
'{"index.search.slowlog.threshold.query.warn" : "50ms",
"index.search.slowlog.threshold.fetch.warn": "50ms",
"index.indexing.slowlog.threshold.index.warn": "50ms"
}'
Profile
www.objectrocket.com
131
• The Profile API provides detailed timing information about the execution of individual
components in a search request and it can be very verbose, especially for complex
requests executed across many shards
• Usage:
GET /my_index/_search
{
"profile": true,
"query" : {
"match" : { "speaker": "KING HENRY IV" }
}
}
Filters
www.objectrocket.com
132
• One way to improve the performance of your searches is with filters. The filtered query
can be your best friend. It’s important to filter first because filter in a search does not
affect the outcome of the document score, so you use very little in terms of resources to
cut the search field down to size.
• A rule of thumb is to use filters when you can and queries when you must: when you
need the actual scoring from the queries.
• Also, filters can be cached.
Upgrade the
cluster ● Generals
● Upgrade path
● Before upgrading
● Rolling upgrades
● On rolling upgrades
● Full cluster restart upgrades
● Upgrades by re-indexing
● Re-indexing in place
● Moving through the versions
www.objectrocket.com
133
Generals
www.objectrocket.com
134
• Elasticsearch can read indices created in the previous major version. Older indices
must be re-indexed or deleted.
• From versions 5.0 ElasticSearch can usually be upgraded using rolling restarts so that the
service is not interrupted.
• Upgrades across major versions before 6.0 require a full cluster restart
• Backup backup backup
• Nodes will fail to start if incompatible indexes are being found
• You can reindex from a remote location so that you skip the backup/restore option
Upgrade path
www.objectrocket.com
135
• Any index created prior to 5.0 will need to be re-indexed into newer versions
Before upgrading
www.objectrocket.com
136
• Understand the changes that appeared in the new version by reviewing the Release
highlights and Release notes.
• Review the list of changes that can break your cluster.
• Check the deprecation log to see if any of your current features became absolute.
• Check for updated versions of your current plugins or compatibility with the new version.
• Upgrade your dev/QA/staging cluster before proceeding with the production cluster.
• Back up your data by taking a snapshot before upgrading. If you want to rollback, you will
need it. You can’t rollback unless you have a backup.
Rolling upgrades
www.objectrocket.com
137
1. As we’ve seen before, ES adjusts the balancing of shards based on topology. If we remove a
node just like that, it will think the node crashed and it will start redistributing the shards. Then
once more when we get the node back. For this, we need to disable shard allocation
• The shard recovery process is being helped by stopping indexing and using “POST
_flush/synced”
• At this point, the cluster is going to turn yellow because secondary shards from other nodes
will get promoted to primary after potential primaries and replica shards will become
unavailable but this doesn’t hurt the operation of the cluster. As we’ve discussed, as long as 1
shard from a replication group is available, the dataset is alive.
• Depending on the number of nodes you have left, be careful not to take out another J
curl -XPUT 'http://localhost:9200/_cluster/settings' -d
'{ "transient":
{ "cluster.routing.allocation.enable":
"none" }}'
Rolling upgrades
www.objectrocket.com
138
2. Stopping the node. This can be as easy as “service elasticsearch stop”.
3. Carry out the needed maintenance(depending on the packet manager, or the way ES has been
installed, you might want to run an yum update or to replace the binaries;
Be careful of the versions and plugins:
Ø An upper version node will join a cluster made out of lower version nodes but a lower version node
won’t join a cluster made out of upper version nodes;
Ø /usr/share/elasticsearch/bin/elasticsearch-plugin is a script provided by ES to handle plugins.
Upgrade these to the correct versions.
Ø During a rolling upgrade, primary shards assigned to a node running the new version cannot have
their replicas assigned to a node with the old version. The new version might have a different data
format that is not understood by the old version.
4. Starting the node;
Rolling upgrades
www.objectrocket.com
139
5. Make sure that everything has started correctly. Check the node’s logs for messages of the
sort:
6. Enable shard allocation(same command as at step 1 but use ”null” (the value not string), to
reset to default instead of “none” )
7. Check the cluster status and make sure everything has recovered. It can take a bit for the
shards to become available.
8. NEEEEEXT!
curl -X GET http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "democluster", ß
"status" : "green", ß
…
}
[2018-10-25T10:04:45,462][INFO ][o.e.n.Node ] [node2] initialized
[2018-10-25T10:04:45,462][INFO ][o.e.n.Node ] [node2] starting ...
[2018-10-25T10:04:45,729][INFO ][o.e.t.TransportService ] [node2] publish_address {134.213.56.244:9300}, bound_addresses
{[::]:9300}
[2018-10-25T10:04:50,465][INFO ][o.e.n.Node ] [node2] started. ß
On Rolling upgrades
www.objectrocket.com
140
• As mentioned before, in a yellow state, the cluster continues to operate normally.
• Because you might have a reduced number of replicas assigned, your performance might be
impacted. Plan this outside the normal working hours.
• New features will come into play when all the nodes are running the updated version.
• Again, we can’t rollback. Lower version nodes won’t join the cluster.
On Rolling upgrades
www.objectrocket.com
141
• If you have a network partition that will separate the newly updated nodes from the old ones,
when this gets solved, the old ones will fail with a message of the sort:
• In this case, you have no other choice than to stop the nodes and get them upgraded. Won’t
be rolling and you might have service interruption but there is no other alternative.
[2018-10-16T15:08:28,928][INFO ][o.e.d.z.ZenDiscovery ] [node3] failed to send join request to master
[{node1}{bWKRUNFXTEy1kBgQ1y2LvA}{Gxzb3blaR86CUL3gKLhnXA}{134.213.56.107}{134.213.56.107:9300}{ml.machine_memory=8196317
184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason
[RemoteTransportException[[node1][134.213.56.107:9300][internal:discovery/zen/join]]; nested: IllegalStateException[node
{node3}{Nt4eKRkvR6-
SZ_gg22lqTQ}{dQRBgGDwSo2Zr7W866e64w}{162.13.188.164}{162.13.188.164:9300}{ml.machine_memory=8196317184,
ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} is on version [6.3.2] that cannot deserialize the license format [4], upgrade node
to at least 6.4.0]; ]. ß
Full Cluster restart upgrade
www.objectrocket.com
142
• It was needed before version 6 when major versions were involved.
• v5.6 à v6 can be done with a rolling upgrade.
• It involves shutting down the cluster, upgrading the nodes then starting the cluster up.
1. Disable shard allocation so we don’t have the unnecessary IO after nodes are being stopped.
2. As briefly mentioned before, stop indexing and perform a “POST _flush/synced” will help with
shard recovery
curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "transient":
{ "cluster.routing.allocation.enable": "none" }}'
Full Cluster restart upgrade
www.objectrocket.com
143
3. Shut down all nodes. “service stop elasticsearch” or whatever works :)
4. Use your package manager to update elasticsearch on each node.
5. Upgrade the plugins with “/usr/share/elasticsearch/bin/elasticsearch-plugin”
6. Start the nodes up
7. Wait for the nodes to join the cluster.
8. Enable shard allocation.
9. Check that the cluster is back to normal before enabling indexing,
curl -X GET
http://localhost:9200/_cluster/h
ealth?pretty
{
"cluster_name" :
"democluster",
"status" : "yellow", ß
….
"number_of_nodes" : 1, ß
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"unassigned_shards" : 5,
…
}
Upgrades by re-indexing
www.objectrocket.com
144
• Elasticsearch can read indices created in the previous major version.
• V6 will read V5 indices but not the V2 or bellow. V5 will read V2 indices but not V1 or bellow
• Older indices will need to be re-indexed or dropped.
• If a node will detect an index that is incompatible, it will fail to start.
• Based on the above, trying to upgrade to a major version that is really far in front, is a bit
tricky if you don’t have a spare cluster. If you do, it’s actually quite easy.
Upgrades by re-indexing
www.objectrocket.com
145
• The easiest way in which you can move to a new version would be to create a cluster with
that version and use the remote indexing feature. When the new index will be created by the
new version for the new version.
• To do list for remote indexing:
1. Add the host and port to the new cluster’s elasticsearch.yml under reindex.remote.whitelist:
2. Create an index on the new cluster with the correct mappings and settings.
• Using number_of_replicas of 0 and refresh_interval -1 will speed up the next operation.
reindex.remote.whitelist: oldhost:oldport
Upgrades by re-indexing
www.objectrocket.com
146
3. Reindex from remote. Example: curl -X POST "localhost:9200/_reindex" -H
'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "http://oldhost:9200",
"username": "user",
"password": "pass"
},
"index": "source",
"query": {
"match": {
"test": "data"
}
}
},
"dest": {
"index": "dest"
}
}
'
Re-indexing in place
www.objectrocket.com
147
• In order to make an older version index work on a newer version cluster, you will need to
reindex to a new one. This will be done by the re-index API
curl -X POST "localhost:9200/_reindex" -H
'Content-Type: application/json' -d'
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
'
Re-indexing in place
www.objectrocket.com
148
1. If you want to maintain your mappings, create a new index and copy the mappings and
settings;
2. You can again disable the refresh_interval and number_of_replicas to make the operation
faster;
3. Reindex the documents to the new index;
4. Reset the refresh_interval and number_of_replicas to the wanted values;
5. Wait for the alias to turn green and it will do so when the replicas will get allocated
Re-indexing in place
www.objectrocket.com
149
6. In a single update, to avoid missed operations on the old index, you should:
• Delete the old index (let’s call it old index)
• Add an alias with the old index to the new index
• Add any aliases that existed on the old index to the new index. More aliases meaning more
”adds” curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -
d'
{
"actions" : [
{ "add": { "index": ”new_index", "alias": ”old_index" } },
{ "remove_index": { "index": ”old_index" } },
{ "add" : { "index" : ”new_index", "alias" : ”any_other_aliases" } }
]
}
'
Moving through the versions
www.objectrocket.com
150
ElasticSearch
V2
Perform a full
cluster restart
to version 5.6
Re-index the
V2 indexes in
place so they
work with 5.6
Perform a
rolling restart
to 6.x
Fully on
V5
Moving through the versions
www.objectrocket.com
151
ElasticSearch
V1
Perform a full
cluster restart
to V 2.4.X
Re-index the
1.X indices in
place so they
work on
V2.4.X
Perform a full
cluster restart
to V5.6
Re-index the
V2 indices so
they work on
V5
Perform a
rolling restart
to V6.X
Fully on
V2
Fully on
V5
www.objectrocket.com
152
Lab 4
Upgrading the cluster
Objectives:
Learn how to:
o Upgrade an elasticsearch cluster.
Steps:
1. Navigate to /Percona2018/Lab04
2. Read the instructions on Lab04.txt
3. Execute ./run_cluster.sh to begin
https://goo.gl/ddaVdS
Security
● Authentication
● Authorization
● Encryption
● Audit
www.objectrocket.com
153
Security
www.objectrocket.com
154
The Open Source version of ElasticSearch, does not provide
- Authentication
- Authorization
- Encryption
To overcome this we will use open-source:
- Firewall
- Reverse proxy
- Encryption tools
Alternative you can buy X-Pack which provides a different layer of Security
Firewall
www.objectrocket.com
155
Client communication:
Intra-cluster communication:
iptables -I INPUT 1 -p tcp --dport 9200:9300 -s IP_1, IP_2 -j ACCEPT
iptables -I INPUT 4 -p tcp --dport 9200:9300 -j REJECT
iptables -I INPUT 1 -p tcp --dport 9300:9400 -s IP_1, IP_2 -j ACCEPT
iptables -I INPUT 4 -p tcp --dport 9300:9400 -j REJECT
Firewall
www.objectrocket.com
156
DNS
SSH
Monitoring tools
Allow whatever port your monitoring tool uses.
iptables -A OUTPUT -p udp --dport 53 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -p udp --sport 53 -m state --state ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p tcp --dport 53 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp --sport 53 -m state --state ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp --dport ssh -j ACCEPT
iptables -A OUTPUT -p tcp --sport ssh -j ACCEPT
Reverse Proxy
www.objectrocket.com
157
client
client
client
ES node
ES node
Reverse
proxy -
Nginx
Advertise
9200 to 8080
HTTP request
ES:8080 Rules
HTTP request
Authentication
www.objectrocket.com
158
We are going to use nginx: ngx_http_auth_basic_module
On nginx.conf
1
2
3
4
1) Listens to 19200 port
2) Enables auth
3) Password file location
4) ES <host>:<port>
server {
listen *:19200;
location / {
auth_basic "Restricted";
auth_basic_user_file /var/data/nginx/.htpasswd;
proxy_pass http://localhost:9200;
proxy_read_timeout 90;
}
}
Authentication
www.objectrocket.com
159
Create users:
- htpasswd -c /var/data/nginx/.htpasswd <username>
- You will be prompt for the password
- Alternatively use the -b flag and provide the pass on cmd line
Access Elasticsearch:
curl <host> #Returns 301
curl <host>:19200 #Returns 401 Authorization Required
curl <username>:<password>@<host>:<19200> #Returns Elasticsearch output
Adding SSL to the mix
www.objectrocket.com
160
Use nginx as reverse proxy to encrypt client communication
On nginx.conf
Certificates:
- Can either obtained by a commercial website
- Self generated
ssl on;
ssl_certificate /etc/ssl/certs/<cert>.crt;
ssl_certificate_key /etc/ssl/private/<key>.key;
ssl_session_cache shared:SSL:10m;
Authorization
www.objectrocket.com
161
- Authentication alone is not enough.
- Once allowed access, the client can do whatever it wants in the cluster.
- Simplest way of authorization is to deny endpoints
location / {
auth_basic "Restricted";
auth_basic_user_file /var/data/nginx-elastic/.htpasswd;
if ($request_filename ~ _shutdown) {
return 403;
break;
}
1
2
1) If user requests for shutdown
2) Return 403
curl -X GET -k "esuser:esuser@es-node1-9200:19200/_cluster/nodes/_shutdown/"
Produces a 403 Forbidden
Authorization
www.objectrocket.com
162
Assign roles using nginx. For example a user
1
2
3
1) Listens to 19500 port
2) Enables auth
3) Regex match for endpoints
4) FW to ES <host>:<port>4
server {
listen 19500;
auth_basic "Restricted";
auth_basic_user_file /var/data/nginx/.htpasswd_users;
location / {
return 403;
}
location ~* ^(/_search|/_analyze) {
proxy_pass http://<es_node>;
proxy_redirect off;
}}
Encryption & Co
www.objectrocket.com
163
Protecting the data on disk is also essential.
LUKS (Linux Unified Key Setup)
- encrypts entire block devices
- cpus with AES-NI (Advanced Encryption Standard Instruction Set) can accelerate dm-crypt
- supports limited number of passwords
- Keep the keys in a safe place
Always audit:
- Access Logs
- Ports
- Backups
- Physical access
Working with
Data – Advanced
Operations
● Alias
● Bulk API
● Aggregations
● …
www.objectrocket.com
164
Pagination
www.objectrocket.com
165
• By default, Elasticsearch will return the first 10 hits of your query. The size
parameter is used to specify the number of hits.
GET shakspeare/_search?pretty
{
"size": 20,
"query": {
"match": {
"play_name": "Hamlet"}
}
}
But this is just the first page of
hits
Pagination - from
www.objectrocket.com
166
• Add the from parameter to a query to specify the offset from the first result you
want to fetch (it defaults to 0).
GET shakespeare/_search?pretty
{
"from": 20,
"size": 20,
"query": {
"match": {
"play_name": "Hamlet"}
}
}
Get the next page of hits
Pagination - Scroll
www.objectrocket.com
167
• While a search request returns a single “page” of results, the scroll API can be used to
retrieve large numbers of results (or even all results) from a single search request, in
much the same way as you would use a cursor on a traditional database
• To initiate a scroll search, add the scroll parameter to your search query
GET shakespeare/_search?scroll=1m {
"size": 1000,
"query": {
"match_all": {}
}
}
If the scroll is idle for more than 1
minute, then delete it
Maximum number of hits to return
Pagination - Scroll
www.objectrocket.com
168
• The result from the above request includes the first page of results and
a _scroll_id, which should be passed to the scroll API in order to retrieve the next
batch of results.
POST /_search/scroll {
"scroll" : "1m",
"scroll_id" :
"DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9la
VYtZndUQlNsdDcwakFMNjU1QQ=="
}
Note that the URL should not
include the index name - this is
specified in the
original search request instead.
Search multiple fields
www.objectrocket.com
169
• The multi_match query provides a convenient short hand for running a match query
against multiple fields ‒ by default, the _score from the best field is used (a
best_fields search)
GET shakespeare/_search?pretty -d '{
"query": { "multi_match": {
"query": "Hamlet", "fields": [
"play_name",
"speaker",
"text_entry"
],
"type": "best_fields"
}
}
}'
3 fields are queried (which
results in 3 scores) and the best
score is used
Search – per-field boosting
www.objectrocket.com
170
• If we want to add more weight to hits on a differents field, in this example, let's say
we're more interested in speaker field than play_name – we can boost the score of
a field using the caret (^) symbol
GET shakespeare/_search?pretty -d '{
"query": { "multi_match": {
"query": "Hamlet", "fields": [
"play_name",
"speaker^2",
"text_entry"
],
"type": "best_fields"
}
}
}'
We get the same number of
hits, but the top hits are
different.
Misspelled words - fuzziness
www.objectrocket.com
171
• Fuzzy matching treats two words that are “fuzzily” similar as if they were the same word
- Fuzziness is something that can be assigned a value
- It refers to the number of character modifications, known as edits, to make two words match
- Can be set to 0,1or 2, or can be set to“auto”
Fuzziness = 1 Fuzziness = 2
"Hamled" "Hamlled"
d-> t l-> d->t
"Hamlet" "Hamlet"
Add fuzziness to a query
www.objectrocket.com
172
GET shakespeare/_search?pretty -d
'{
"query": {
"match": {
"play_name": "Hamled" }
}
}'
GET shakespeare/_search?pretty -d
'{
"query": {
"match": {
"play_name": {
"query": "Hamled",
"fuzziness": 1 }}
}}'
0 hits 4244 hits
Search exact terms
www.objectrocket.com
173
• If we need to search for the exact text, we use the match query, which
understands how the field has been analyzed, and search on the keyword field:
GET shakespeare/_search?pretty -d
'{
"query": {
"match": {
"text_entry.keyword": "To be, or not to be: that is the question"
}
}
}'
Exactly 1 hit
Sorting
www.objectrocket.com
174
• The results of a query are returned in the order of relevancy, _score descending is the
default sorting for a query
• A query can contain a sort clause that specifies one or more fields to sort on, as well
as the order (asc or desc)
GET /shakespeare/_search?pretty -d '{
"query": {
"match": {
"text_entry": "question"
}
},
"sort": [
{"play_name": {"order": "desc"}
} ]
}'
"hits" : [
{
"_index" : "shakespeare",
"_type" : "doc",
"_id" : "55924",
"_score" : null,
"_source" : {.....}
If _score is not a field in the sort cause, is
not calculated => less compute resources
Highlighting
www.objectrocket.com
175
• A common use case for search results is to highlight the matched terms.
GET /shakespeare/_search?pretty -d
'{
"query": {
"match_phrase": {
"text_entry": "Hamlet" }
},
"highlight": {
"fields": {
"text_entry": {}
} }
}'
"_source" : {
"type" : "line",
"line_id" : 36184,
"play_name" : "Hamlet",
"speech_number" : 99,
"line_number" : "5.1.269",
"speaker" : "QUEEN GERTRUDE",
"text_entry" : "Hamlet, Hamlet!"
},
"highlight" : {
"text_entry" : [
"<em>Hamlet</em>, <em>Hamlet</em>!"
]
}
}
The response contains a
highlight section
Range query
www.objectrocket.com
176
• Matches documents with fields that have terms within a certain range. The type of the
Lucene query depends on the field type, for string fields, the TermRangeQuery, while
for number/date fields, the query is a NumericRangeQuery
• The range query accepts the following parameters: gte, gt, lte, lt, boost
GET _search
{
"query": {
"range" : {
"age" : {
"gte" : 10,
"lte" : 20 }
}
}
}
Exists query
www.objectrocket.com
177
• Returns documents that have at least one non-null value in the original field:
• There isn't a missing query, instead use the exists query inside a must_not clause
GET /_search
{
"query": {
"exists" : { "field" : "user" }
}
}
GET /_search
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "user" }
}
}
}
}
Wildcard query
www.objectrocket.com
178
• Matches documents that have fields matching a wildcard expression;
• Supported wildcards are *, which matches any character sequence (including the empty
one), and ?, which matches any single character.
• Note that this query can be slow, as it needs to iterate over many terms. In order to
prevent extremely slow wildcard queries, a wildcard term should not start with one of
the wildcards * or ?
GET shakespeare/_search?pretty -d
{
"query": {
"wildcard" : { "play_name" : "Henry*" }
}
}
Regexp query
www.objectrocket.com
179
• The regexp query allows you to use regular expression term queries
• The "term queries" in that first sentence means that Elasticsearch will apply the
regexp to the terms produced by the tokenizer for that field, and not to the original text
of the field
• Note: The performance of a regexp query heavily depends on the regular expression
chosen. Matching everything like .* is very slow as well as using lookaround regular
expressions.
GET shakespeare/_search?pretty -d
{
"query": {
"regexp":{
"play_name": "H.*t"}
}
}
Aggregations
www.objectrocket.com
180
• Aggregations are a way to perform analytics on your indexed data
• There are four main types of aggregations:
- Metric: aggregations that keep track and compute metrics over a set of documents.
- Bucketing: aggregations that build buckets, where each bucket is associated with
a key and a document criterion. When the aggregation is executed, all the buckets criteria
are evaluated on every document in the context and when a criterion matches, the
document is considered to "fall in" the relevant bucket.
- Pipeline: aggregations that aggregate the output of other aggregations and their
associated metrics
- Matrix: aggregations that operate on multiple fields and produce a matrix result
based on the values extracted from the requested document fields. Unlike metric and
bucket aggregations, this aggregation family does not yet support scripting and its
functionality is currently experimental
Aggregations - Metric
www.objectrocket.com
181
• Most metrics are mathematical operations that output a single value: avg, sum, min,
max, cardinality
• Some metrics output multiple values: stats, percentiles, percentile_ranks
• Example: what's the maximum value of the "age" field
GET account/_search?pretty -d '{
"size": 0,
"aggs": {
"max_age": { "max": {
"field": "age" }
}
}
}'
"aggregations" : {
"max_age" : {
"value" : 40.0
}
}
}
Aggregations - bucket
www.objectrocket.com
182
• Bucket aggregations don’t calculate metrics over fields like the metrics aggregations
do, but instead, they create buckets of documents
• Bucket aggregations, as opposed to metrics aggregations, can hold sub-aggregations.
These sub-aggregations will be aggregated for the buckets created by their "parent"
bucket aggregation
• Terms aggregations is very handy, will dynamically create a new bucket for every
unique term it encounters of the specified field and get a feel of how your data looks
like
Aggregations
www.objectrocket.com
183
GET shakespeare/_search?pretty -d '{
"size": 0,
"aggs": {
"play_names": { "terms": {
"field": "play_name",
"size": 5 }
}
}
}'
• Example: What are the unique play names we have in our index
"size" - number of buckets to create
(default is 10)
Aggregations
www.objectrocket.com
184
"aggregations" : {
"play_names" : {
"doc_count_error_upper_bound" : 3045,
"sum_other_doc_count" : 91399,
"buckets" : [
{
"key" : "Hamlet",
"doc_count" : 4244
},
{
"key" : "Coriolanus",
"doc_count" : 3992
},
{
"key" : "Cymbeline",
"doc_count" : 3958
},
{
"key" : "Richard III",
"doc_count" : 3941
},
{
"key" : "Antony and Cleopatra",
"doc_count" : 3862
}
]}}}
• Notice each bucket has a “key”
that represents the distinct value
of “field”,
• and“doc_count”for the number of
docs in the bucket
Nesting buckets
www.objectrocket.com
185
GET shakespeare/_search?pretty -d '{
"size": 0,
"aggs": {
"play_names": {
"terms": {
"field": "play_name",
"size": 1
},
"aggs": {
"speakers": {
"terms": {
"field": "speaker",
"size": 5 } }
} }
} }'
The play names are bucketed, then,
within each play bucket, our documents
are bucketed by speaker.
Nesting buckets
www.objectrocket.com
186
"aggregations" : {
"play_names" : {
"doc_count_error_upper_bound" : 3395,
"sum_other_doc_count" : 107152,
"buckets" : [
{
"key" : "Hamlet",
"doc_count" : 4244,
"speakers" : {
"doc_count_error_upper_bound" : 48,
"sum_other_doc_count" : 1698,
"buckets" : [
{
"key" : "HAMLET",
"doc_count" : 1582
},
{
"key" : "KING CLAUDIUS",
"doc_count" : 594
},
{
"key" : "LORD POLONIUS",
"doc_count" : 370
The result of our nested aggregation
Notice two special values returned in a terms
aggregation:
- “doc_count_error_upper_bound”:
maximum number of missing documents that
could potentially have appeared in a bucket
- “sum_other_doc_count”: number of
documents that do not appear in any of the
buckets
Bucket sorting
www.objectrocket.com
187
• Sorting can be specified using using
“order”:
‒ _count sorts by their doc_count (default
in terms)
‒ _key sorts alphabetically (default in
histogram and date_histogram)
• Sorting can also be on a metric value in
a nested aggregation
GET shakespeare/_search?pretty -d '{
"size": 0,
"aggs": {
"play_names": { "terms": {
"field": "play_name",
"size": 5,
"order": {
"_count": "desc" } }
}
}
}'
www.objectrocket.com
188
Lab 5
Advanced Operation
Objectives:
Learn how to:
o Work with mappings
o Work with analyzers
Steps:
1. Navigate to /Percona2018/Lab05
2. Read the instructions on Lab05.txt
https://bit.ly/2D1tXL6
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

(ARC307) Infrastructure as Code
(ARC307) Infrastructure as Code(ARC307) Infrastructure as Code
(ARC307) Infrastructure as Code
 
Devops, the future is here, it's just not evenly distributed yet.
Devops, the future is here, it's just not evenly distributed yet.Devops, the future is here, it's just not evenly distributed yet.
Devops, the future is here, it's just not evenly distributed yet.
 
Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)
 
Terraform
TerraformTerraform
Terraform
 
Kubernetes Security
Kubernetes SecurityKubernetes Security
Kubernetes Security
 
Manage Development in Your Org with Salesforce Governance Framework
Manage Development in Your Org with Salesforce Governance FrameworkManage Development in Your Org with Salesforce Governance Framework
Manage Development in Your Org with Salesforce Governance Framework
 
Devops and git basics
Devops and git basicsDevops and git basics
Devops and git basics
 
Apache Superset - open source data exploration and visualization (Conclusion ...
Apache Superset - open source data exploration and visualization (Conclusion ...Apache Superset - open source data exploration and visualization (Conclusion ...
Apache Superset - open source data exploration and visualization (Conclusion ...
 
Terraform modules and best-practices - September 2018
Terraform modules and best-practices - September 2018Terraform modules and best-practices - September 2018
Terraform modules and best-practices - September 2018
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
 
Azure DevOps Best Practices Webinar
Azure DevOps Best Practices WebinarAzure DevOps Best Practices Webinar
Azure DevOps Best Practices Webinar
 
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
 
Red Hat Openshift Fundamentals.pptx
Red Hat Openshift Fundamentals.pptxRed Hat Openshift Fundamentals.pptx
Red Hat Openshift Fundamentals.pptx
 
Getting started with Elasticsearch in .net
Getting started with Elasticsearch in .netGetting started with Elasticsearch in .net
Getting started with Elasticsearch in .net
 
Discover salesforce, dev ops and Copado CI/CD automations
Discover salesforce, dev ops and Copado CI/CD automationsDiscover salesforce, dev ops and Copado CI/CD automations
Discover salesforce, dev ops and Copado CI/CD automations
 
Terraform: Infrastructure as Code
Terraform: Infrastructure as CodeTerraform: Infrastructure as Code
Terraform: Infrastructure as Code
 
Terraform in deployment pipeline
Terraform in deployment pipelineTerraform in deployment pipeline
Terraform in deployment pipeline
 
Terraform Introduction
Terraform IntroductionTerraform Introduction
Terraform Introduction
 
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeTerraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
 
Implementing an Application Security Pipeline in Jenkins
Implementing an Application Security Pipeline in JenkinsImplementing an Application Security Pipeline in Jenkins
Implementing an Application Security Pipeline in Jenkins
 

Similar a Elastic 101 tutorial - Percona Europe 2018

Chotot k8s experiences.pptx
Chotot k8s experiences.pptxChotot k8s experiences.pptx
Chotot k8s experiences.pptx
arptit
 

Similar a Elastic 101 tutorial - Percona Europe 2018 (20)

MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017
 
Sharded cluster tutorial
Sharded cluster tutorialSharded cluster tutorial
Sharded cluster tutorial
 
MongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster TutorialMongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster Tutorial
 
Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.
 
Percona Live 2017 ­- Sharded cluster tutorial
Percona Live 2017 ­- Sharded cluster tutorialPercona Live 2017 ­- Sharded cluster tutorial
Percona Live 2017 ­- Sharded cluster tutorial
 
Lecture 4 Cluster Computing
Lecture 4 Cluster ComputingLecture 4 Cluster Computing
Lecture 4 Cluster Computing
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Docker Security Paradigm
Docker Security ParadigmDocker Security Paradigm
Docker Security Paradigm
 
Namespaces and cgroups - the basis of Linux containers
Namespaces and cgroups - the basis of Linux containersNamespaces and cgroups - the basis of Linux containers
Namespaces and cgroups - the basis of Linux containers
 
Lab Manual Combaring Redis with Relational
Lab Manual Combaring Redis with RelationalLab Manual Combaring Redis with Relational
Lab Manual Combaring Redis with Relational
 
Artem Zhurbila - docker clusters (solit 2015)
Artem Zhurbila - docker clusters (solit 2015)Artem Zhurbila - docker clusters (solit 2015)
Artem Zhurbila - docker clusters (solit 2015)
 
Prosit google-cloud
Prosit google-cloudProsit google-cloud
Prosit google-cloud
 
Hands-on Lab - Combaring Redis with Relational
Hands-on Lab - Combaring Redis with RelationalHands-on Lab - Combaring Redis with Relational
Hands-on Lab - Combaring Redis with Relational
 
Testing kubernetes and_open_shift_at_scale_20170209
Testing kubernetes and_open_shift_at_scale_20170209Testing kubernetes and_open_shift_at_scale_20170209
Testing kubernetes and_open_shift_at_scale_20170209
 
Hands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCacheHands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCache
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
 
Chotot k8s experiences.pptx
Chotot k8s experiences.pptxChotot k8s experiences.pptx
Chotot k8s experiences.pptx
 
Environment for training models
Environment for training modelsEnvironment for training models
Environment for training models
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
DevOps Meetup ansible
DevOps Meetup   ansibleDevOps Meetup   ansible
DevOps Meetup ansible
 

Más de Antonios Giannopoulos

Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration VariablesAntonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos
 

Más de Antonios Giannopoulos (12)

Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic
Comparing Geospatial Implementation in MongoDB, Postgres, and ElasticComparing Geospatial Implementation in MongoDB, Postgres, and Elastic
Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic
 
Using MongoDB with Kafka - Use Cases and Best Practices
Using MongoDB with Kafka -  Use Cases and Best PracticesUsing MongoDB with Kafka -  Use Cases and Best Practices
Using MongoDB with Kafka - Use Cases and Best Practices
 
Sharding in MongoDB 4.2 #what_is_new
 Sharding in MongoDB 4.2 #what_is_new Sharding in MongoDB 4.2 #what_is_new
Sharding in MongoDB 4.2 #what_is_new
 
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Upgrading to MongoDB 4.0 from older versions
Upgrading to MongoDB 4.0 from older versionsUpgrading to MongoDB 4.0 from older versions
Upgrading to MongoDB 4.0 from older versions
 
How to upgrade to MongoDB 4.0 - Percona Europe 2018
How to upgrade to MongoDB 4.0 - Percona Europe 2018How to upgrade to MongoDB 4.0 - Percona Europe 2018
How to upgrade to MongoDB 4.0 - Percona Europe 2018
 
Triggers in MongoDB
Triggers in MongoDBTriggers in MongoDB
Triggers in MongoDB
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
 
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration VariablesAntonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
 
Introduction to Polyglot Persistence
Introduction to Polyglot Persistence Introduction to Polyglot Persistence
Introduction to Polyglot Persistence
 
MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals
 

Último

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 

Último (20)

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 

Elastic 101 tutorial - Percona Europe 2018

  • 1. Elastic 101 Antonios Giannopoulos DBA @ Rackspace/ObjectRocket Alex Cercel DBA @ Rackspace/ObjectRocket Mihai Aldoiu CDE @ Rackspace/ObjectRocket linkedin.com/in/antonis | linkedin.com/in/alexcercel | linkedin.com/in/aldoiu 1
  • 3. Overview • Introduction • Working with data • Scaling the cluster • Operating the cluster • Troubleshooting the cluster • Upgrade the cluster • Security best practices • Working with data – Advanced operations • Best Practices www.objectrocket.com 3
  • 4. www.objectrocket.com 4 Labs 1. Unzip the provided .vmdk file 2. Install and or Open VirtualBox 3. Select New 4. Enter A Name 5. Select Type: Linux 6. Select Version: Red Hat (64-bit) 7. Set Memory to at least 4096 (more won’t hurt) 8. Select "Use an existing ... disk file", select the provided .vmdk file 9. Select Create 10. Select Start 11. Login with username: elasticuser , password: elasticuser 12. Navigate to /Percona2018/Lab01 for the first lab. https://bit.ly/2D1tXL6
  • 5. Introduction ● Key Terms ● Installation ● Configuration files ● JVM fundamentals ● Lucene basics www.objectrocket.com 5
  • 6. What is elasticsearch? www.objectrocket.com 6 Lucene: - A search engine library entirely written in Java - Developed in 1999 by Doug Cutting - Suitable for any application that requires full text indexing and searching capability But: - Challenging to use - Not originally designed for scaling Elasticsearch: - Built on top of Lucene - Provides scaling - Language independent
  • 7. What is ELK stack? www.objectrocket.com 7 ElasticSearch: - The main datastore - Provides distributed search capabilities Logstash: - Parse & transform data for ingestion - Ingests from multiple of sources simultaneously Kibana: - An analytics and visualization platform - Search, visualize & interact with Elasticsearch data
  • 8. Installing Elasticsearch www.objectrocket.com 8 Download: Latest Version: https://www.elastic.co/downloads/elasticsearch Older Version: Navigate to https://www.elastic.co/downloads/past-releases The simplest way: 1) wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.2.tar.gz 2) wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.2.tar.gz.sha512 3) shasum -a 512 -c elasticsearch-6.3.2.tar.gz.sha512 (it should return elasticsearch-6.3.2.tar.gz: OK) 4) tar -xzf elasticsearch-6.3.2.tar.gz
  • 9. Installing Java www.objectrocket.com 9 ElasticSearch requires JRE (JavaSE runtime environment) or JDK (Java Development Kit) - OpenJDK CentOS: yum install java-1.8.0-openjdk - OpenJDK Ubuntu: apt-get install openjdk-8-jre ES versions 6, requires Java8 or higher https://www.elastic.co/support/matrix set JAVA_HOME appropriately - Create a file under /etc/profile.d for example jdk.sh - Add the following lines: export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-*" export PATH=$JAVA_HOME/bin:$PATH
  • 10. Start the server www.objectrocket.com 10 Create a user elasticuser* Using elasticuser execute: bin/elasticsearch After some noise: [INFO ][o.e.n.Node] [name] started How I know is up and running? *You can’t start ES using root $ curl -X GET "localhost:9200/" { "name" : "KG-_6s9", "cluster_name" : "elasticsearch", "cluster_uuid" : "T9uHpto6QtWRmsjzNFrReA", "version" : { "number" : "6.3.2", "build_flavor" : "default", "build_type" : "tar", "build_hash" : "053779d", "build_date" : "2018-07- 20T05:20:23.451332Z", "build_snapshot" : false, "lucene_version" : "7.3.1", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" }
  • 11. Explore the directories www.objectrocket.com 11 Folder Description Setting bin Contains the binary scipts, like elasticsearch config Contains the configuration files ES_PATH_CONF data Holds the data (shards/indexes) path.data lib Contains JAR files logs Contains the log files path.logs modules Contains the modules plugins Contains the plugins. Each plugin has its own subdirectory
  • 12. Configuration files www.objectrocket.com 12 elasticsearch.yml - The primary way of configuring a node. - Its a template which lists the most important settings for a production cluster jvm.options - JVM related options log4j2.properties - Elasticsearch uses Log4j 2 for logging Variables can be set either: -Using the configuration file: jvm.options: -Xms512mb - or, using command line ES_JAVA_OPTS="-Xms512m" ./bin/elasticsearch
  • 13. Elasticsearch.yml www.objectrocket.com 13 node.name - Every node should have a unique node.name - Set it to something meaningful (aws-zone1-objectrocket-es-01) cluster.name - A cluster is a set of nodes sharing the same cluster.name - Set it to something meaningful (production, qa, staging) path.data - Path to directory where to store the data (accepts multiple locations) path.logs - Path to log files
  • 14. Elasticsearch.yml www.objectrocket.com 14 cluster.name: production node.name: dc1-prd-es1 path.data: /data/es1 path.logs: /logs/es1 bin/elasticsearch -d -p 'elastic.pid' $ curl -X GET "localhost:9200/" { "name" : "dc1-prd-es1", "cluster_name" : "production", …
  • 15. jvm.Options www.objectrocket.com 15 Each Elasticsearch node runs on its own JVM instance JVM is a virtual machine that enables a computer to run Java programs The most important setting is the Heap Size: - Xms: Represents the initial size of total heap space - Xmx: Represents the maximum size of total heap space Best Practices - Set Xms and Xmx to the same size - Set Xmx to no more than 50% of your physical RAM - Do not set Xms and Xmx over 30ish GiB - Use the server version of OpenJDK - Lock the RAM for Heap bootstrap.memory_lock
  • 16. jvm.Options www.objectrocket.com 16 Heap Off Heap Indexing buffer Completion suggester Cluster state … and more Caches: - query cache (10%) - field data cache (unbounded) - …
  • 17. jvm.Options www.objectrocket.com 17 Garbage collector - It is a form of automatic memory management - Gets rid of objects which are not being used by a Java application anymore - Automatically reclaims memory for reuse Garbage collectors - ConcMarkSweepGC (CMS) - G1GC (has some Issues with JDK 8) Elasticsearch uses -XX:+UseConcMarkSweepGC GC threads -XX:ParallelGCThreads=N, where N varies on the platform -XX:ParallelCMSThreads=N , where N varies on the platform
  • 18. jvm.Options www.objectrocket.com 18 Eden s0 s1 Old Generation Perm New Gen -Xmn JVM Heap –Xms -Xmx XX: PermSize XX: MaxPermSize Minor GC Major GC or full GC 1) A new Page stored in Eden 2) After a GC if it survives it moves to s0 ,s1 3) After multiple GCs, s0 or s1 gets full then pages moves to Old Gen
  • 19. OS settings www.objectrocket.com 19 Disable swap - sysctl vm.swappiness=1 - Remove Swap File descriptors - Set nofile to 65536 - curl -X GET ”<host>:<port>/_nodes/stats/process?filter_path=**.max_file_descriptors” Virtual Memory - sysctl -w vm.max_map_count=262144 Max user process - nproc to 4096 DNS cache settings - networkaddress.cache.ttl=<timeout> - networkaddress.cache.negative.ttl=<timeout>
  • 20. Network settings www.objectrocket.com 20 Two network communication mechanisms in Elasticsearch - HTTP: which is how the Elasticsearch REST APIs are exposed - Transport: used for internal communication between nodes within the cluster Node 1 Client Node 2 HTTP Transport
  • 21. Network settings www.objectrocket.com 21 The REST APIs of Elasticsearch are exposed over HTTP - The HTTP module binds to localhost by default - Configure with http.host on elasticsearch.yml - Default port is the first available between 9200-9299 - Configure with http.port on elasticsearch.yml Each call that goes from one node to another uses the transport module - Transport binds to localhost by default - Configure with transport.host on elasticsearch.yml - Default port is the first available between 9300-9399 - Configure with transport.tcp.port on elasticsearch.yml
  • 22. Network settings www.objectrocket.com 22 network.host sets the bind host and the publish host at the same time network.publish_host - Defaults to network.host.network.publish_host. Multiple interfaces network.bind_host - Defaults to the “best” address from network.host. One interface only network.host value Description _[networkInterface]_ Addresses of a network interface, for example _en0_. _local_ Any loopback addresses on the system, for example 127.0.0.1. _site_ Any site-local addresses on the system, for example 192.168.0.1. _global_ Any globally-scoped addresses on the system, for example 8.8.8.8.
  • 23. Network settings www.objectrocket.com 23 Zen discovery - built in & default discovery module default - It provides unicast discovery, - Uses the transport module On elasticsearch.yml discovery.zen.ping.unicast.hosts: [”node1", ”node2"] Node 1 Node 2 Transport Node 3 1) Retrieves IP/ hostname from list of hosts 2) Tries all hosts until find a reachable one 3) If the cluster name matches, joins the cluster 4) If not, starts its own cluster
  • 24. Bootstrap tests www.objectrocket.com 24 Development mode: if it does not bind transport to an external interface (the default) Production mode: if it does bind transport to an external interface Bypass production mode: Set discovery.type to single-node Bootstrap Tests - Inspect a variety of Elasticsearch and system settings - A node in production mode must pass all Bootstrap tests to start - es.enforce.bootstrap.checks=true on jvm.options - Highly recommended to have this setting enabled
  • 25. Bootstrap tests www.objectrocket.com 25 List of Bootstrap Tests - Heap size check - File descriptor check - Memory lock check - Maximum number of threads check - Max file size check - Maximum size virtual memory check - Maximum map count check - Client JVM check - Use serial collector check - System call filter check - OnError and OnOutOfMemoryError checks - Early-access check - G1GC check - All permission check
  • 26. Lucene www.objectrocket.com 26 Lucene uses a data structure called Inverted Index. An Inverted Index, inverts a page-centric data structure (page->words) to a keyword- centric data structure (word->pages) Allow fast full text searches, at a cost of increased processing when a document is added to the database. 1) Give us your name 2) Give us your home number 3) Give us your home address Frequency Location give 3 1,2,3 us 3 1,2,3 your 3 1,2,3 name 1 1 number 1 2 home 2 2,3 address 1 3
  • 27. Lucene – Key Terms www.objectrocket.com 27 A Document is the unit of search and index. A Document consists of one or more Fields. A Field is simply a name-value pair. An index consists of one or more Documents. Indexing: involves adding Documents to an Index Searching: - involves retrieving Documents from an index. - Searching requires an index to have already been built - Returns a list of Hits
  • 28. Kibana www.objectrocket.com 28 Download: Latest Version: https://www.elastic.co/guide/en/kibana/current/targz.html Simplest way to install it: Run Kibana: kibana-6.3.2-linux-x86_64/bin/kibana Access Kibana: http://localhost:5601 wget https://artifacts.elastic.co/downloads/kibana/kibana-6.3.2-linux- x86_64.tar.gz shasum -a 512 kibana-6.3.2-linux-x86_64.tar.gz tar -xzf kibana-6.3.2-linux-x86_64.tar.gz
  • 30. www.objectrocket.com 30 Lab 1 Install and configure Elastic Objectives: Learn how to install and configure a standalone Elastic instance. Steps: 1. Navigate to /Percona2018/Lab01 2. Read the instructions on Lab01.txt https://bit.ly/2D1tXL6
  • 31. Working with Data ● Indexes ● Shards ● CRUD Operations ● Read Operations ● Mappings ● Analyzers www.objectrocket.com 31
  • 32. Working with Data - Index www.objectrocket.com 32 • An index in Elasticsearch is a logical way of grouping data: ‒ an index has a mapping that defines the fields in the index ‒ an index is a logical namespace that maps to where its contents are stored in the cluster • There are two different concepts in this definition: ‒ an index has some type of data schema mechanism ‒ an index has some type of mechanism to distribute data across a cluster
  • 33. An index means .... www.objectrocket.com 33 In the Elasticsearch world, index is used as a: ‒ Noun: a document is put into an index in Elasticsearch ‒ Verb: to index a document is to put the document into an index in Elasticsearch { "type":"line", "line_id":4, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.1", "speaker":"KING HENRY IV", "text_entry":"So shaken as we are, so wan with care," } { "type":"line", "line_id":5, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.2", "speaker":"KING HENRY IV", "text_entry":"Find we a time for frighted peace to pant" } { "type":"line", "line_id":6, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.3", "speaker":"KING HENRY IV", "text_entry":"And breathe short-winded accents of new broils"} My_index Documents are indexed to an index
  • 34. Define an index www.objectrocket.com 34 • Clients communicate with a cluster using Elasticsearch’s REST APIs • An index is defined using the Create Index API, which can be accomplished with a simple PUT command # curl -XPUT 'http://localhost:9200/my_index' -i HTTP/1.1 200 OK content-type: application/json; charset=UTF-8 content-length: 48 {"acknowledged":true,"shards_acknowledged":true}
  • 35. Shard www.objectrocket.com 35 • A shard is a single piece of an Elasticsearch index ‒ Indexes are partitioned into shards so they can be distributed across multiple nodes • Each shard is a standalone Lucene index ‒ The default number of shards for an index is 5. Number of shards can be changed at index creation time. My_index 0 2 4 3 1 Node 1 Node 2
  • 36. Working with Data - Document www.objectrocket.com 36 Documents must be JSON objects. • A document can be any text or numeric data you want to search and/or analyze ‒ Specifically, a document is a top-level object that is serialized into JSON and stored in Elasticsearch • Every document has a unique ID ‒ which either you provide, or Elasticsearch generates one for you { "type":"line", "line_id":4, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.1", "speaker":"KING HENRY IV", "text_entry":"So shaken as we are, so wan with care," }
  • 37. Index compression www.objectrocket.com 37 • Elasticsearch compresses your documents during indexing ‒ documents are grouped into blocks of 16KB, and then compressed together using LZ4 by default ‒ if your documents are larger than 16KB, you will have larger chunks that contain only one document • You can change the compression to DEFLATE using the index.codec setting: ‒ reduced storage size at slightly higher CPU usage PUT my_index { "settings": { "number_of_shards": 3, "number_of_replicas": 2, "index.codec" : "best_compression" } }
  • 38. Index a document www.objectrocket.com 38 The Index API is used to index a document ‒ use a PUT or a POST and add the document in the body request ‒ notice we specify the index, the type and an ID ‒ if no ID is provided, elasticsearch will generate one # curl -XPUT 'http://localhost:9200/my_index/my_type/1' -H 'Content-Type: application/json' -d '{ "line_id":5, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.2", "speaker":"KING HENRY IV", "text_entry":"Find we a time for frighted peace to pant" }' {"_index":"my_index","_type":"my_type","_id":"1","_version":1,"result":"created","_shar ds":{"total":2,"successful":2,"failed":0},"created":true}
  • 39. Index without specifying an ID www.objectrocket.com 39 You can leave off the id and let Elasticsearch generate one for you: ‒ But notice that only works with POST, not PUT ‒ The generated id comes back in the response # curl -XPOST 'http://localhost:9200/my_index/my_type/' -H 'Content-Type: application/json' -d ' {"line_id":6, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.3", "speaker":"KING HENRY IV", "text_entry":"And breathe short-winded accents of new broils" }' {"_index":"my_index","_type":"my_type","_id":"AWZIq227Unvtccn4Vvrz","_version":1,"resul t":"created","_shards":{"total":2,"successful":2,"failed":0},"created":true}
  • 40. Reindexing a document www.objectrocket.com 40 What do you think it happens if we add another document with the same ID ? curl -XPUT 'http://localhost:9200/my_index/my_type/1' -H 'Content-Type: application/json' -d ' { "new_field" : "new_value" }'
  • 41. ...Overwrites the document www.objectrocket.com 41 • The old field/value pairs of the document are gone ‒ the old document is deleted, and the new one gets indexed • Notice every document has a _version that is incremented whenever the document is changed # curl -XGET http://localhost:9200/my_index/my_type/1?pretty -H 'Content-Type: application/json' { "_index" : "my_index", "_type" : "my_type", "_id" : "1", "_version" : 2, "found" : true, "_source" : { "new_field" : "new_value" } }
  • 42. The _create endpoint www.objectrocket.com 42 If you do not want a document to be overwritten if it already exists, use the _create endpoint ‒ no indexing occurs and returns a 409 error message: # curl -XPUT 'http://localhost:9200/my_index/my_type/1/_create' -H 'Content-Type: application/json' -d ' {"new_field" : "new_value"}' {"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[my_type][ 1]: version conflict, document already exists (current version [2])","index_uuid":"JGY3Q_9NRjWe-wU- MlK44Q","shard":"3","index":"my_index"}],"type":"version_conflict_engine_exception","rea son":"[my_type][1]: version conflict, document already exists (current version [2])","index_uuid":"JGY3Q_9NRjWe-wU- MlK44Q","shard":"3","index":"my_index"},"status":409}
  • 43. Locking ? www.objectrocket.com 43 - Every indexed document has a version number - Elasticsearch uses Optimistic concurrency control without locking # curl -XPUT 'http://localhost:9200/my_index/my_type/1?version=3' -d '{ ... }' # 200 OK # curl -XPUT 'http://localhost:9200/my_index/my_type/1?version=2' -d '{ ... }' # 409 Conflict
  • 44. The _update endpoint www.objectrocket.com 44 To update fields in a document use the _update endpoint. - Make sure to add the “doc” context curl -XPOST 'http://localhost:9200/my_index/my_type/1/_update' -H 'Content-Type: application/json' -d ' { "doc": { "line_id":10, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.7", "speaker":"KING HENRY IV", "text_entry":"Nor more shall trenching war channel her fields" } }' {"_index":"my_index","_type":"my_type","_id":"1","_version":3,"result":"updated","_shar ds":{"total":2,"successful":2,"failed":0}}
  • 45. Retrieve a document www.objectrocket.com 45 Use GET to retrieve an indexed document ‒ Notice we specify the index, the type and an ID ‒ Returns a 200 code if document found or a 404 error if the document is not found # curl -XGET http://localhost:9200/my_index/my_type/1?pretty { "_index" : "my_index", "_type" : "my_type", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "line_id" : 5, "play_name" : "Henry IV", "speech_number" : 1, "line_number" : "1.1.2", "speaker" : "KING HENRY IV", "text_entry" : "Find we a time for frighted peace to pant" } }
  • 46. Deleting a document www.objectrocket.com 46 Use DELETE to delete an indexed document ‒ response code is 200 if the document is found, 404 if not # curl -XDELETE 'http://localhost:9200/my_index/my_type/1/' -H 'Content-Type: application/json' {"found":true,"_index":"my_index","_type":"my_type","_id":" 1","_version":7,"result":"deleted","_shards":{"total":2,"su ccessful":2,"failed":0}}
  • 47. A simple search www.objectrocket.com 47 Use a GET request sent to the _search endpoint ‒ every document is a hit for this search ‒ by default, Elasticsearch returns 10 hits curl -s -XGET 'http://localhost:9200/my_index/my_type/_search' -H 'Content-Type: application/json' { "took" : 1, "timed_out" : false, …. }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ ... ] } } Search for all docs in my_index Number of ms it took to process the query Number of documents there were hits for this query Array containing documents hit by the search criteria
  • 48. CRUD Operations Summary www.objectrocket.com 48 Index PUT my_index/my_type/4 Create PUT my_index/my_type/4/_create { "speaker":"KING HENRY IV", "text_entry":"To be commenced in strands afar remote." } Read GET my_index/my_type/4 Update POST my_index/my_type/4/_update { "my_type" : { "text_entry":"No more the thirsty entrance of this soil" } } Delete DELETE my_index/my_type/4
  • 49. Mapping – what is it? www.objectrocket.com 49 • Elasticsearch will index any document without knowing its details (number of fields, their data types, etc.) - dynamic mapping ‒ However, behind-the-scenes Elasticsearch assigns data types to your fields in a mapping. Mapping is the process of defining how a document, and the fields it contains, are stored and indexed A mapping is a schema definition that contains: ‒ names of fields ‒ data types of fields ‒ how the field should be indexed and stored by Lucene • Mappings map your complex JSON documents into the simple flat documents that Lucene expects.
  • 50. Defining a mapping www.objectrocket.com 50 • In most use cases, you will need to define your own mappings, but is not required. When you index a document, Elasticsearch dynamically creates or updates the mapping • Mappings are defined in the“mappings”section of an index. You can: ‒ define mappings at index creation, or ‒ add to a mapping of an existing index PUT my_index { "mappings": { define mapping here } }
  • 51. Let's view a mapping www.objectrocket.com 51 GET my_index/_mapping { "my_index" : { "mappings" : { "my_type" : { "properties" : { "line_id" : { "type" : "long" }, "line_number" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "play_name" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, ... The “properties” section contains the fields and data types in your documents
  • 52. Elasticsearch data types for fields www.objectrocket.com 52 • Simple types, including: ‒ text: for full text (analyzed) strings ‒ keyword: for exact value strings ‒ date: string formatted as dates, or numeric dates ‒ integer types: like byte, short, integer, long ‒ floating-point numbers: float, double, half_float, scaled_float ‒ boolean ‒ ip: for IPv4 or IPv6 addresses • Hierarchical Types: like object and nested • Specialized Types:geo_point, geo_shape and percolator • Range types and more
  • 53. Updating existing mapping www.objectrocket.com 53 • Existing field mappings cannot be updated. Changing the mapping would mean invalidating already indexed documents. - Instead, you should create a new index with the correct mappings and reindex your data into that index. There are some exceptions to this rule: • new properties can be added to Object datatype fields. • new multi-fields can be added to existing fields. • the ignore_above parameter can be updated.
  • 54. Prevent mapping explosion www.objectrocket.com 54 • Defining too many fields in an index is a condition that can lead to a mapping explosion, which can cause out of memory errors and difficult situations to recover from. - For example when using dynamic mapping and every new inserted documents introduces new fields. • The following settings allow you to limit the number of field mappings that can be created manually or dynamically index.mapping.total_fields.limit - maximum number of fields in an index, defaults to 1000 index.mapping.depth.limit - maximum depth for a field, which is measured as the number of inner objects, defaults to 20 index.mapping.nested_fields.limit - maximum number of nested fields in an index, defaults to 50
  • 55. Analysis www.objectrocket.com 55 • Analysis is the process of converting full text into terms (tokens) which are added to the inverted index for searching. - Analysis is performed by an analyzer which can be either a built-in analyzer or a custom analyzer defined per index. For example, at index time the built-in standard analyzer will first convert the sentence into distinct tokens: "Welcome to Percona Live - Open Source Database Conference 2018" [ welcome to percona live open source database conference 2018 ] Analyzer will lowercase each token, remove frequent stopwords
  • 56. The analyze api www.objectrocket.com 56 • The _analyze api can be used to test what an analyzer will to your text curl -s -XGET localhost:$ES_PORT/_analyze?analyzer=standard -d 'Welcome to Percona Live - Open Source Database Conference 2018' | python -m json.tool | grep token "tokens": [ "token": "welcome", "token": "to", "token": "percona", "token": "live", "token": "open", "token": "source", "token": "database", "token": "conference", "token": "2018",
  • 57. Built-in analyzers www.objectrocket.com 57 • Standard - the default analyzer • Simple – breaks text into terms whenever it encounters a character which is not a letter • Keyword – simply indexes the text exactly as is • Others include: ‒ whitespace, stop, pattern, language, and more are described in the docs at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html - custom analyzers built by you
  • 58. Analyzer components www.objectrocket.com 58 • An analyzer consists of three parts: 1. Character Filters 2. Tokenizer 3. Token Filters Character Filters Tokenizer Token FiltersInput Output string tokens string tokens
  • 59. Specifying an analyzer www.objectrocket.com 59 • At index time: PUT my_index { "mappings": { "_doc": { "properties": { "title": { "type": "text", "analyzer": "standard" } } } } } • At search time: Usually, the same analyzer should be applied at index time and at search time, to ensure that the terms in the query are in the same format as the terms in the inverted index. By default, queries will use the analyzer defined in the field mapping, but this can be overridden with the search_analyzer setting: PUT my_index { "mappings": { "_doc": { "properties": { "text": { "type": "text", "analyzer": "autocomplete", "search_analyzer": "standard" }}}}}
  • 60. Custom analyzer www.objectrocket.com 60 • Best described with an example, let's create a custom analyzer based on standard one, but which also removes stop words PUT my_index { "settings": { "analysis": { "filter": { "my_stopwords": { "type": "stop", "stopwords": ["to", "and", "or", "is", "the"] } }, "analyzer": { "my_content_analyzer": { "type": "custom", "char_filter": [], "tokenizer": "standard", "filter": ["lowercase","my_stopwords"] } } }}}
  • 61. Scaling the cluster ● 10 000ft view on scaling ● Node roles ● Adding a node to a cluster ● Understanding shards ● Replicas ● Read/Write model ● Sample Architectures www.objectrocket.com 61
  • 62. 10 000ft view on scaling www.objectrocket.com 62 • ElasticSearch has the potential to be always available as long as we take advantage of it’s scaling features. • With vertical scaling(better hardware) having its limitations, we’ll take a look at the horizontal scaling(more nodes in the same cluster). • If with other datastores, horizontal scaling has its challenges, such as sharding for MongoDB(Antonios has written an amazing tutorial on managing a sharded cluster; you must check it out), ElasticSearch is designed to be distributed by nature so as long as replicas as being used, the application development as well as the administration overheard to manage scaling out the cluster is minimal.
  • 63. 10 000ft view on scaling www.objectrocket.com 63 • We defined a shard as elements that compose the indexes and is, each, a Lucene index. • By default, ElasticSearch will create 5 per index, but if we have everything on one node and that node goes down? We face disaster. This is where replicas come in. • A replica of a shard is an exact copy of that element that lives on another node. • A node is simply an ElasticSearch process. One or more nodes with the same name under the “cluster.name” directive under the config file is/are making up a cluster.
  • 64. 10 000ft view on scaling www.objectrocket.com 64 • All nodes know about all others in the cluster and can also direct a request to another, if needed. • Nodes can handle both http(external) traffic as well as transport(inter cluster) traffic. They can also switch between these. If one node receives an HTTP request that should have been directed at another, it switches to TRANSPORT. • Nodes can have one or more roles in the cluster.
  • 65. Node Roles www.objectrocket.com 65 • Master-eligible node: A node that has ”node.master” set to true (default), which makes it eligible to be elected as the masternode, which controls the cluster and carries out administrative functions such as deleting and creating indexes. • Data node: A node that has ”node.data” set to true (default). Data nodes hold data and perform data related operations such as CRUD, search, and aggregations. • Ingest node: A node that has ”node.ingest” set to true (default). Ingest nodes are able to apply an ingest pipeline to a document in order to transform and enrich the document, such as adding a field that wasn’t there before, before indexing. With a heavy ingest load, it makes sense to use dedicated ingest nodes and to mark the master and data nodes as “node.ingest: false”
  • 66. Node Roles www.objectrocket.com 66 • Tribe node: A tribe node, configured via the tribe.* settings, is a special type of coordinating only node that can connect to multiple clusters and perform search and other operations across all connected clusters. In later versions of Elastic, this role became obsolete • Kibana node: In case Kibana is being used on a large scale with many users running complex queries, you can have a dedicated node or nodes for it. • To summarize, any node, by default, is master eligible, is acting as a data node as well as handling ingestions, including ingestion pipelines. As the cluster grows, in order to separate the overhead of different operations(maintaining the cluster, ingestion pipelines, connecting clusters etc), it makes sense to define roles.
  • 67. Adding a node to a cluster www.objectrocket.com 67 • To add a node or start a cluster,we need to set the directive “cluster.name” to a descriptive value in /etc/elasticsearch/elasticsearch.yml ; All nodes need to have the same cluster.name: • By default, ElasticSearch binds to the loopback interface so we must edit the networking section of the config file and bind the daemon to a specific ip or use 0.0.0.0 for all: • We must name our nodes, again, with descriptive values: • Nodes running on the same host will be auto-discovered but remote nodes will use zen discovery which will take a list of Ips that will assemble the cluster. The firewall must allow communication on 9200,9300:
  • 68. Adding a node to a cluster www.objectrocket.com 68 • Of course, there are more options that you can configure but for the sake of this exercise, these will be enough. • Once these are set, restart the daemon and a /_cluster/health?pretty should return something like: curl -X GET http://localhost:9200/_cluster/health?pretty { "cluster_name" : "democluster", ß …. "number_of_nodes" : 2, ß "number_of_data_nodes" : 2, ß …… }
  • 69. Understading shards www.objectrocket.com 69 A shard is a worker unit that holds data and can be assigned to nodes and is, itself a Lucene index. Think of a self contained search engine that handles a portion of data. ‒ An index is merely a virtual namespace which points to a number of shards My_index My_cluster Node1 Node2 shard shard shard shardshard An index is "split" into shards before any documents are indexed
  • 70. Primary and Replica www.objectrocket.com 70 • There are two types of shards: - primary: the original shards of an index - replicas: copies of the primary • Documents are replicated between a primary and its replicas - a primary and all replicas are guaranteed to be on different nodes My_cluster Node1 Node2 P0 P2 R3 P3P1 R1 R4 P4 R0 R2
  • 71. Number of Primary shards www.objectrocket.com 71 • Is fixed – default number of primary shard for an index is 5 • You can specify a different number of shards when you create the index. • Changing the number of shards after the index has been created can be done with the split or shrink index API but it’s NOT a trivial operation. It’s basically the same as reindexing. Plan accordingly. PUT my_new_index { "settings": { "number_of_shards": 3 } }
  • 72. Replicas are good for www.objectrocket.com 72 • High availability - We can lose a node and still have all the data available - After losing a primary, Elasticsearch will automatically promote a replica to a primary and start replicating unassigned replicas • Read throughput - Replicas can handle query/read requests from client applications - Allows you to scale your data and better utilize cluster resources You can change the number of replicas for an index at any time using the _settings endpoint: PUT my_index/_settings { "number_of_replicas": 2 }
  • 73. Replicas www.objectrocket.com 73 • Let’s play a bit with replicas. In this example I’ve indexed Shakespeare’s work again. Here is the cluster and the index: curl -X GET http://localhost:9200/_cluster/health?p retty { "cluster_name" : "democluster", "status" : "yellow", ß …. "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 5, "active_shards" : 5, "unassigned_shards" : 5, … } curl -XGET localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open shakespeare jkJ280IVT3mcfswXwBR1QA 5 1 111394 0 22.4mb 22.4mb Yellow indicates a problem. What do you think the problem is? What would be a solution here?
  • 74. Replicas www.objectrocket.com 74 • Replicas will get automatically assigned if the topology permits it. All I’ve done was to start a second node and: • We can change the replicas number, dynamically, in the index settings. This is a trivial operation, unlike changing the number of shards. curl -XGET localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open shakespeare jkJ280IVT3mcfswXwBR1QA 5 1 111394 0 44.9mb 22.4mb curl -X PUT "localhost:9200/shakespeare/_settings" -H 'Content- Type: application/json' -d' > { > "index" : { > "number_of_replicas" : 0 > } > } > ' {"acknowledged”}
  • 75. Write Path www.objectrocket.com 75 • The process of keeping the primary shard in sync with its replicas is called a data replication model. ElasticSearch’s data replication model is based on the primary-backup model. One primary and n backups. • This model runs on top of replication groups. We’ve seen before that as a default, we have 5 primary shards and each of these shards have 1 replica. In the above graph, we have a 2 replication groups and each primary has 3 replicas. • In the context of a replication group, primary shard is responsible for indexing and keeping the replicas up to date. In a replication group at a certain point, some replicas might be offline so the master node will keep a in-sync copies group with the ones that are and have received all the writes that the user has acknowledged. primary replica replica primary Replication group 1 Replication group 2 replica replica replica replica
  • 76. Write Path www.objectrocket.com 76 • The primary follows the flow of validating the incoming operation and the documents, execute it locally, forward the operation to all replicas in the in-sync list, ack the write once all the replicas from the list have run the operation. • write • Some notes about failure handling. In case a primary fails, the indexing will stop for 1 minute while the master promotes a new primary. The primary will check with his replicas to make sure he’s still primary and wasn’t demoted for whatever reason. An operation coming from a stale primary will be declined by the replicas. 1 2 3 local In-sync 1 2
  • 77. Read Path www.objectrocket.com 77 • The node that received the query(which is called the coordinating node) will find the relevant shards for the read request, select an active copy of the data(primary or replica; it will round robin) from a replication group, send the read request to the selected copies, combine the results and respond. • The requests to each shard are single threads but more than one shard can be done in parallel.
  • 78. Read Path www.objectrocket.com 78 • Because we’re talking about roundrobin when we were talking about the active shard, this is where adding more replicas will help. Any new request will hit a different replica so the work is spread. • The failure handling is way easier. If for some reason a response is not received, the coordinating node will resubmit the read request to the relevant replication group, pick a different replica and the same flow reapplies.
  • 79. Sample Architectures www.objectrocket.com 79 • For lightweight searches and where the data can be reindexed without suffering from loss, the single node cluster is not unseen. • A basic deployment with data resilience is the two node cluster. Most SaaS providers start with this deployment. • The two node model can be scaled as much as it’s needed but is usually recommended in case you are running just basic indexing/search operations. In case more granularity is needed, the data can be reindexed with a higher number of shards and replicas across. • In case the number of nodes in the cluster gets really high or the operations get complex, it’s time to separate the roles. Separating the nodes also needs to take in consideration the cases where you would lose one or more nodes of a specific role. For instance if you’re using ingestion only nodes, data only nodes and master only nodes, you need to consider what happens if you lose one or more.
  • 80. Sample Architectures www.objectrocket.com 80 • ObjectRocket starts with 4 ingestion nodes, 2 kibana, 2 data and 3 master nodes. • We don’t care how many client nodes we lose as long as we have 1 remaining. • The master nodes pick a active master based on quorum.This helps with split brain. • Data nodes, of course, we can lose at maximum one. • Consider redundant components as much as possible. • We will cover security in a later chapter. By default, in the community version, there is no built in security. In this case, Firewall limitations are a must have.
  • 81. www.objectrocket.com 81 Lab 2 Scaling the cluster Objectives: Adding nodes to your cluster Change the number of Replicas Steps: 1. Navigate to /Percona2018/Lab02/ 2. Read the instructions on Lab02.txt https://bit.ly/2D1tXL6
  • 82. Operating the cluster ● Working with nodes ● Working with shards ● Reindex ● Backup/Restore ● Plugins www.objectrocket.com 82
  • 83. Cheatsheet www.objectrocket.com 83 curl -X GET ”<host>:<port>/_cluster/settings” curl -X GET " ”<host>:<port>/_cluster/settings?include_defaults=true” curl -X PUT ”<host>:<port>/_cluster/settings" -H 'Content-Type: application/json' -d' { "persistent" : { ”name of the setting" : value }}' curl -X PUT ”<host>:<port>/_cluster/settings" -H 'Content-Type: application/json' -d' { "transient" : { ”name of the setting" : null }}'
  • 84. Shard Allocation www.objectrocket.com 84 Allow control over how and where shards are allocated Shard Allocation settings (cluster.routing.allocation) - enable - node_concurrent_incoming_recoveries - node_concurrent_outgoing_recoveries - node_concurrent_recoveries - same_shard.host Shard Rebalancing settings (cluster.routing.rebalance) - enable - allow_rebalance - cluster_concurrent_rebalance
  • 85. Shard Allocation - Disk www.objectrocket.com 85 cluster.routing.allocation.disk.threshold_enabled: Defaults to true Low: Do not allocate new shards. Defaults to 85% High: Try to relocate shards. Defaults to 90% Flood_stage: Enforces a read-only index block. Must be released manually. Defaults to 95% cluster.info.update.interval How often Elasticsearch should check on disk usage (Defaults to 30s) cluster.routing.allocation.disk.include_relocations: Defaults to true – Could lead to false alerts curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d' { "transient": { "cluster.routing.allocation.disk.watermark.low": "100gb", "cluster.routing.allocation.disk.watermark.high": "50gb", "cluster.routing.allocation.disk.watermark.flood_stage": "10gb", "cluster.info.update.interval": "1m” }}
  • 86. Shard Allocation – Rack/Zone www.objectrocket.com 86 Make Elasticsearch aware of the topology - it can ensure that the primary shard and its replica shards are spread across different - Physical servers (node.attr.phy_host) - Racks (node.attr.rack_id) - Availability Zones (node.attr.zone) - Minimize the risk of losing all shard copies at the same time - Minimize latency Configuration: cluster.routing.allocation.awareness.attributes: zone, rack_id Force awareness: cluster.routing.allocation.awareness.force.zone.values: zone1,zone2 cluster.routing.allocation.awareness.attributes: zone
  • 87. Restart node(s) www.objectrocket.com 87 Elasticsearch wants your data to be fully replicated and evenly balanced. When a nodes go down: - The cluster immediately recognize the change - Rebalancing begins - Rebalancing takes time and can become costly During a planned maintenance you should hold off on rebalancing
  • 88. Restart node(s) www.objectrocket.com 88 Steps: 1) Flush pending indexing operations POST /_flush/synced 2) Disable shard allocation 3) Shut down a single node 4) Perform a maintenance PUT /_cluster/settings { "transient" : { "cluster.routing.allocation.enable" : "none” } }
  • 89. Restart node(s) www.objectrocket.com 89 5) Restart the node, and confirm that it joins the cluster. 6) Re-enable shard allocation as follows: 7) Check the cluster health PUT /_cluster/settings { "transient" : { "cluster.routing.allocation.enable" : "all" } }
  • 90. Restart node(s) www.objectrocket.com 90 You can also make Elastic less sensitive to changes The default for Master is to instruct shard relocations is 1m. During restarts we can lower the threshold. Useful setting for slow or unreliable networks. PUT _all/_settings { "settings": { "index.unassigned.node_left.delayed_timeout": "5m" } }
  • 91. Remove a node www.objectrocket.com 91 Elastic automatically detects topology changes. In order to remove a node you need to drain it and then stop it Where attribute: _name :Match nodes by node names _ip: Match nodes by IP addresses (the IP address associated with the hostname) _host: Match nodes by hostnames PUT _cluster/settings { "transient" : { "cluster.routing.allocation.exclude.{attribute} ": ”<value>" } }
  • 92. Remove a node www.objectrocket.com 92 Additional considerations: - Master-eligible node - Seed nodes - Space considerations - Performance considerations - If possible stop writes - Do not allow new allocations ("cluster.routing.allocation.enable" : "none") - Overhead from the shard drains - Throttle (indices.recovery.max_bytes_per_sec) - One node at a time (cluster.routing.allocation.disk.watermark) Move shards manually (Reroute API) - Flush and if possible stop writes - Safe for Replicas, not recommended for Primaries (may lead to data loss)
  • 93. Remove a node www.objectrocket.com 93 Cancel the drain of a node by removing the node or reset the attribute Where attribute: _name :Match nodes by node names _ip: Match nodes by IP addresses (the IP address associated with the hostname) _host: Match nodes by hostnames PUT _cluster/settings { "transient" : { "cluster.routing.allocation.exclude.{attribute}": "" } }
  • 94. Replace a node www.objectrocket.com 94 Similar to remove a node with the difference that you need to add a node as well. Simplest approach: add a new node and then drain the old node Additional considerations: - Master-eligible/Seed nodes - Do not allow new allocations (cluster.routing.allocation.exclude._name) - Overhead from drain/throttle (indices.recovery.max_bytes_per_sec) - Space considerations - Max amount of data each node can get. Watermark Alternatively use the reroute API to drain the node
  • 95. Working with Shards www.objectrocket.com 95 Number of Shards/Replicas - Defined on Index creation - Number of Replicas changes dynamically - Number of Shards can change using: - shrink API - split API - reindex API Why increase the number of shards: - Index size - Performance considerations - Hard limits (LUCENE-5843) Almost same reasons apply when you decreasing the number of shards
  • 96. Shrink API www.objectrocket.com 96 Shrinks an existing index into a new one with fewer primary shards: - Target index must be a factor of the number of shards in the source index - If a prime number it can only be shrunk into a single primary shard - Before shrinking, a (primary or replica) copy of every shard in the index must be present on the same node Works as follows: - First, it creates a new target index with the same definition as the source index, but with a smaller number of primary shards. - Then it hard-links segments from the source index into the target index. - Finally, it recovers the target index as though it were a closed index which had just been re- opened.
  • 97. Shrink API www.objectrocket.com 97 In order to shrink an index, the index must be marked as read-only, and a copy of every shard in the index must be relocated to the same node and have health green Note that it may take a while… Check progress using GET _cat/recovery?v curl -X PUT ”<host>:<port>/my_source_index/_settings" -H 'Content- Type: application/json' -d' { "settings": { "index.routing.allocation.require._name": "shrink_node_name", "index.blocks.write": true }}’
  • 98. Shrink API www.objectrocket.com 98 Finally its time to shrink the index: It is similar to create index api – almost same arguments Some constraints apply curl -X POST ”<host>:<port>/my_source_index/_shrink/my_target_index" -H 'Content-Type: application/json' -d' { "settings": { "index.number_of_replicas": <number>, "index.number_of_shards": <number>, "index.routing.allocation.require._name": null, "index.blocks.write": null }}’
  • 99. Split API www.objectrocket.com 99 Splits an existing index into a new index: - The original primary shard is split into two or more primary shards. - The number of splits is determined by the index.number_of_routing_shards setting The _split API requires the source index to be created with a specific number_of_routing_shards in order to be split in the future. This requirement has been removed in Elasticsearch 7.0 Works as follows: - First, it creates a new target index with a larger number of primary shards. - Then it hard-links segments from the source index into the target index. - Once the low level files are created all documents will be hashed again to delete documents that belong to a different shard. - Finally, it recovers the target index as though it were a closed index which had just been re- opened.
  • 100. Split API www.objectrocket.com 100 In order to shrink an index, the index must be marked as read-only (assuming the index has number_of_routing_shards set) Split the index: curl -X PUT ”<host>:<port>/my_source_index/_settings" -H 'Content-Type: application/json' -d' { "settings": { "index.blocks.write": true }}’ curl -X POST ”<host>:<port:1>/my_source_index/_split/my_target_index?copy_settings=true" - H 'Content-Type: application/json' -d' { "settings": { "index.number_of_shards": 2 }}'
  • 101. Reindex API - Definition www.objectrocket.com 101 - Does not copy the settings of the source index - version_type : internal/external - source supports “query”, multi-indexes & remote location - URL parameters: refresh, wait_for_completion, wait_for_active_shards, timeout, scroll and requests_per_second - Supports painless scripts to manipulate indexing curl -X POST ”<host>:<port>/_reindex" -H 'Content-Type: application/json' -d' { "source": { "index": ”<source index>" }, "dest": { "index": ”<destination index>" }}’
  • 102. Reindex API – Response Body www.objectrocket.com 102 "took": 1200, "timed_out": false, "total": 10, "updated": 0, "created": 10, "deleted": 0, "batches": 1, "noops": 0, "version_conflicts": 2, "retries": { "bulk": 0, "search": 0}, "throttled_millis": 0, "requests_per_second": 1, "throttled_until_millis": 0, "failures": [ ] Total milliseconds the entire operation took The number of documents that were successfully processed Summary of the operation counts The number of version conflicts that reindex hit Throttling Statistics
  • 103. Reindex API www.objectrocket.com 103 Active Reindex jobs: Cancel a Reindex job: Re-Throttle: Reindexing from a remote server: - Use on-heap buffer that defaults to a maximum size of 100mb - May need to use a smaller batch size - Configure socket_timeout and connect_timeout. Both default to 30 seconds POST _reindex/<id of the reindex>/_rethrottle?requests_per_second=-1 POST _tasks/<id of the reindex>/_cancel GET _tasks?detailed=true&actions=*reindex
  • 104. Snapshots - Backup www.objectrocket.com 104 A snapshot is a backup taken from a running Elasticsearch cluster Snapshots are taken incrementally Version compatibility – one major version behind You must register a snapshot repository before you can perform snapshot Must exists on elasticsearch.yml: path.repo curl -X GET ”<host>:<port>/_snapshot/_all" curl -X PUT ”<host>:<port>/_snapshot/my_backup" -H 'Content-Type: application/json' -d' { "type": "fs", "settings": { "location": ”backup location" }}’
  • 105. Snapshots - Backup www.objectrocket.com 105 Shared location: On elasticsearch.yml: path.repo: ["/mount/backups0", "/mount/backups1"] Don’t forget to register it!!! Registration options location: Location of the snapshots compress: Turns on compression of the snapshot files. Defaults to true. chunk_size: Big files can be broken down into chunks. Defaults to null (unlimited chunk size) max_restore_bytes_per_sec: Throttles per node restore rate. Defaults to 40mb/second max_snapshot_bytes_per_sec: Throttles per node snapshot rate. Defaults to 40mb/second readonly: Makes repository read-only. Defaults to false
  • 106. Snapshots - Backup www.objectrocket.com 106 wait_for_completion whether or not the request should return immediately after snapshot completion ignore_unavailable: Ignores indexes that don’t exists include_global_state: Prevent the cluster global state to be stored as part of the snapshot curl -X PUT ”<host>:<port>/_snapshot/my_backup/snapshot_2?wait_for_completion=true" - H 'Content-Type: application/json' -d' { "indices": "index_1,index_2,index_3", "ignore_unavailable": true, "include_global_state": false }’ curl -X PUT ”<host>:<port>/_snapshot/my_backup/snapshot_1?wait_for_completion=true”
  • 107. Snapshots - Backup www.objectrocket.com 107 IN_PROGRESS: The snapshot is currently running. SUCCESS: The snapshot finished and all shards were stored successfully. FAILED: The snapshot finished with an error and failed to store any data. PARTIAL: The global cluster state was stored, but data of at least one shard wasn’t stored successfully. INCOMPATIBLE: The snapshot was created with an old version of ES incompatible with the current version of the cluster. Delete snapshot: Unregister Repo: curl -X GET ”<host>:<port>/_snapshot/my_backup/snapshot_1" curl -X DELETE ”<host>:<port>/_snapshot/my_backup/snapshot_2" curl -X DELETE ”<host>:<port>/_snapshot/my_backup"
  • 108. Snapshots - Restore www.objectrocket.com 108 Check the progress: Also supported: - Partial restore - Restore with different settings - Restore to a different cluster curl -X POST ”<host>:<port>/_snapshot/my_backup/snapshot_1/_restore” curl -X GET ”<host>:<port>/_snapshot/_status” curl -X GET ”<host>:<port>/_snapshot/my_backup/_status" curl -X GET ”<host>:<port>/_snapshot/my_backup/snapshot_1/_status”
  • 109. Snapshots - Restore www.objectrocket.com 109 Restore with different settings Select indices that should be restored Renames indices on restore using regular expression that supports referencing the original text. Restore global state "index_settings": {"index.number_of_replicas": 0} "ignore_index_settings": ["index.refresh_interval”] curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_1/_res tore" -H 'Content-Type: application/json' -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": true, "rename_pattern": "index_(.+)", "rename_replacement": "restored_index_$1" }’
  • 110. Plugins www.objectrocket.com 110 Away to enhance the basic Elasticsearch functionality in a custom manner. They range from: - Mapping and analysis - Discovery - Security - Management - Alerting - And many many more… Installation: bin/elasticsearch-plugin install [plugin_name] Considerations: - Security - Maintainability between version We are heavily use Cerebro (https://github.com/lmenezes/cerebro) on our tutorial
  • 111. www.objectrocket.com 111 Lab 3 Operating the cluster Objectives: Learn how to: o Remove a node from a cluster. o Use the ReIndex API Steps: 1. Navigate to /Percona2018/Lab03 2. Read the instructions on Lab03.txt 3. Execute ./run_cluster.sh to begin https://bit.ly/2D1tXL6
  • 112. Troubleshooting ● Cluster health ● Improving Performance ● Diagnostics www.objectrocket.com 112
  • 113. Cluster health www.objectrocket.com 113 • The cluster health API allows to get a very simple status on the health of the cluster • The health status is either green, yellow or red and exists at three levels: shard, index, and cluster • Shard health ‒ red: at least one primary shard is not allocated in the cluster ‒ yellow: all primaries are allocated but at least one replica is not ‒ green: all shards are allocated • Index health ‒ status of the worst shard in that index • Cluster health ‒ status of the worst index in the cluster
  • 114. Cluster health www.objectrocket.com 114 { "cluster_name" : "my_cluster", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 5, "active_shards" : 5, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 5, "delayed_unassigned_shards": 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 50.0 } GET _cluster/health
  • 115. Green status www.objectrocket.com 115 • The state your cluster should have – All of your primary and replica shards are allocated and active My_cluster Node1 Node2 P0 R0 Node3 R0 PUT my_index { "settings": { "number_of_shards": 1, "number_of_replicas": 2 } }
  • 116. Yellow status www.objectrocket.com 116 It means all your primary shards are allocated, but one or more replicas are not. - you may not have enough nodes in the cluster, or a node may have failed My_cluster Node1 Node2 P0 R0 Node3 R0 PUT my_index { "settings": { "number_of_shards": 1, "number_of_replicas": 3 } } R0 Unassigned
  • 117. Red status www.objectrocket.com 117 • At least one primary shard is missing - searches will return partial results and indexing might fail PUT my_index { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } } My_cluster Node1 Node2 P0 R0 Node3
  • 118. Resolve unassigned shards www.objectrocket.com 118 Causes: • Shard allocation is purposefully delayed • Too many shards, not enough nodes • You need to re-enable shard allocation • Shard data no longer exists in the cluster • Low disk watermark • Multiple Elasticsearch versions The _cat endpoint will tell you which shards are unassigned, and why: curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,u nassigned.reason| grep UNASSIGNED
  • 119. Resolve unassigned shards 119 • You can also use the cluster allocation explain API to get more information about shard allocation issues: curl -XGET localhost:9200/_cluster/allocation/explain?pretty { "index" : "testing", "shard" : 0, "primary" : false, "current_state" : "unassigned", … "can_allocate" : "no", "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions" : [ { … { "decider" : "same_shard", "decision" : "NO", "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists" }]}]}
  • 120. Reason 1 – Shard allocation delayed www.objectrocket.com 120 • When a node leaves the cluster, the master node temporarily delays shard reallocation to avoid needlessly wasting resources on rebalancing shards, in the event the original node is able to recover within a certain period of time (one minute, by default) Modify the delay dynamically: curl -XPUT 'localhost:9200/my_index/_settings' -d '{ "settings": { "index.unassigned.node_left.delayed_timeout": "30s" } }'
  • 121. Reason 2 – Not enough nodes www.objectrocket.com 121 • As nodes join and leave the cluster, the master node reassigns shards automatically, ensuring that multiple copies of a shard aren’t assigned to the same node • A shard may linger in an unassigned state if there are not enough nodes to distribute the shards accordingly. • Make sure that every index in your cluster is initialized with fewer replicas per primary shard than the number of nodes in your cluster
  • 122. Reason 3 – re-enable shard allocation www.objectrocket.com 122 • Shard allocation is enabled by default on all nodes, but you may have disabled shard allocation at some point (for example, in order to perform a rolling restart) and forgotten to re-enable it. • To enable shard allocation, update the _cluster settings API: curl -XPUT 'localhost:9200/_cluster/settings' -d '{ "transient": { "cluster.routing.allocation.enable" : "all" } }'
  • 123. Reason 4 – Shard data no longer exists www.objectrocket.com 123 • Primary shard is not available anymore because the index may have been created on a node without any replicas (a technique used to speed up the initial indexing process), and the node left the cluster before the data could be replicated. • Another possibility is that a node may have encountered an issue while rebooting or has storage issues • In this scenario, you have to decide how to proceed: try to get the original node to recover and rejoin the cluster (and do not force allocate the primary shard), or force allocate the shard using the _reroute API and reindex the missing data using the original data source, or from a backup.
  • 124. Reason 4 – Shard data no longer exists www.objectrocket.com 124 • To allocate an unassigned primary shard: curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "allocate" : { "index" : "my_index", "shard" : 0, "node": "<NODE_NAME>", "allow_primary": "true" } }] }' Warning! The caveat with forcing allocation of a primary shard is that you will be assigning an “empty” shard. If the node that contained the original primary shard data were to rejoin the cluster later, its data would be overwritten by the newly created (empty) primary shard, because it would be considered a “newer” version of the data.
  • 125. Reason 5 – Low disk watermark www.objectrocket.com 125 • Once a node has reached this level of disk usage, or what Elasticsearch calls a “low disk watermark”, it will not be assigned more shards – default is 85% • You can check the disk space on each node in your cluster (and see which shards are stored on each of those nodes) by querying the _cat API: curl -s –XGET 'localhost:9200/_cat/allocation?v' shards disk.indices disk.used disk.avail disk.total disk.percent host ip node 5 260b 47.3gb 43.4gb 100.7gb 46 127.0.0.1 127.0.0.1 CSUXak2 Example response:
  • 126. Reason 5 – Low disk watermark www.objectrocket.com 126 Resolutions: - add more nodes - increase disk size - increase low watermark threshold, if safe: PUT /_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.watermark.low": "90%" } }'
  • 127. Reason 6 – Multiple ES versions www.objectrocket.com 127 • Usually encountered when in the middle of a rolling upgrade • The master node will not assign a primary shard’s replicas to any node running an older major version (1.x -> 2.x -> 5.x).
  • 128. When nothing works www.objectrocket.com 128 … or restore the affected index from an old snapshot
  • 129. Poor performance www.objectrocket.com 129 This can be a long discussion – see more in the `Best practices` chapter You want to start by: • Enable slow logging so you can Identify long running queries • Run identified searches through the _profiling API to look at timing of individual components • Filter, filter, filter
  • 130. Enable slow log www.objectrocket.com 130 • Send a put request to the _cluster API to define the level of slow log that you want to turn on: warn, info, debug, and trace PUT /_cluster/settings ’{ "transient" : { "logger.index.search.slowlog" : "DEBUG", "logger.index.indexing.slowlog" : "DEBUG" } }' • All slow logging is enabled on the index level: PUT /my_index/_settings '{"index.search.slowlog.threshold.query.warn" : "50ms", "index.search.slowlog.threshold.fetch.warn": "50ms", "index.indexing.slowlog.threshold.index.warn": "50ms" }'
  • 131. Profile www.objectrocket.com 131 • The Profile API provides detailed timing information about the execution of individual components in a search request and it can be very verbose, especially for complex requests executed across many shards • Usage: GET /my_index/_search { "profile": true, "query" : { "match" : { "speaker": "KING HENRY IV" } } }
  • 132. Filters www.objectrocket.com 132 • One way to improve the performance of your searches is with filters. The filtered query can be your best friend. It’s important to filter first because filter in a search does not affect the outcome of the document score, so you use very little in terms of resources to cut the search field down to size. • A rule of thumb is to use filters when you can and queries when you must: when you need the actual scoring from the queries. • Also, filters can be cached.
  • 133. Upgrade the cluster ● Generals ● Upgrade path ● Before upgrading ● Rolling upgrades ● On rolling upgrades ● Full cluster restart upgrades ● Upgrades by re-indexing ● Re-indexing in place ● Moving through the versions www.objectrocket.com 133
  • 134. Generals www.objectrocket.com 134 • Elasticsearch can read indices created in the previous major version. Older indices must be re-indexed or deleted. • From versions 5.0 ElasticSearch can usually be upgraded using rolling restarts so that the service is not interrupted. • Upgrades across major versions before 6.0 require a full cluster restart • Backup backup backup • Nodes will fail to start if incompatible indexes are being found • You can reindex from a remote location so that you skip the backup/restore option
  • 135. Upgrade path www.objectrocket.com 135 • Any index created prior to 5.0 will need to be re-indexed into newer versions
  • 136. Before upgrading www.objectrocket.com 136 • Understand the changes that appeared in the new version by reviewing the Release highlights and Release notes. • Review the list of changes that can break your cluster. • Check the deprecation log to see if any of your current features became absolute. • Check for updated versions of your current plugins or compatibility with the new version. • Upgrade your dev/QA/staging cluster before proceeding with the production cluster. • Back up your data by taking a snapshot before upgrading. If you want to rollback, you will need it. You can’t rollback unless you have a backup.
  • 137. Rolling upgrades www.objectrocket.com 137 1. As we’ve seen before, ES adjusts the balancing of shards based on topology. If we remove a node just like that, it will think the node crashed and it will start redistributing the shards. Then once more when we get the node back. For this, we need to disable shard allocation • The shard recovery process is being helped by stopping indexing and using “POST _flush/synced” • At this point, the cluster is going to turn yellow because secondary shards from other nodes will get promoted to primary after potential primaries and replica shards will become unavailable but this doesn’t hurt the operation of the cluster. As we’ve discussed, as long as 1 shard from a replication group is available, the dataset is alive. • Depending on the number of nodes you have left, be careful not to take out another J curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "transient": { "cluster.routing.allocation.enable": "none" }}'
  • 138. Rolling upgrades www.objectrocket.com 138 2. Stopping the node. This can be as easy as “service elasticsearch stop”. 3. Carry out the needed maintenance(depending on the packet manager, or the way ES has been installed, you might want to run an yum update or to replace the binaries; Be careful of the versions and plugins: Ø An upper version node will join a cluster made out of lower version nodes but a lower version node won’t join a cluster made out of upper version nodes; Ø /usr/share/elasticsearch/bin/elasticsearch-plugin is a script provided by ES to handle plugins. Upgrade these to the correct versions. Ø During a rolling upgrade, primary shards assigned to a node running the new version cannot have their replicas assigned to a node with the old version. The new version might have a different data format that is not understood by the old version. 4. Starting the node;
  • 139. Rolling upgrades www.objectrocket.com 139 5. Make sure that everything has started correctly. Check the node’s logs for messages of the sort: 6. Enable shard allocation(same command as at step 1 but use ”null” (the value not string), to reset to default instead of “none” ) 7. Check the cluster status and make sure everything has recovered. It can take a bit for the shards to become available. 8. NEEEEEXT! curl -X GET http://localhost:9200/_cluster/health?pretty { "cluster_name" : "democluster", ß "status" : "green", ß … } [2018-10-25T10:04:45,462][INFO ][o.e.n.Node ] [node2] initialized [2018-10-25T10:04:45,462][INFO ][o.e.n.Node ] [node2] starting ... [2018-10-25T10:04:45,729][INFO ][o.e.t.TransportService ] [node2] publish_address {134.213.56.244:9300}, bound_addresses {[::]:9300} [2018-10-25T10:04:50,465][INFO ][o.e.n.Node ] [node2] started. ß
  • 140. On Rolling upgrades www.objectrocket.com 140 • As mentioned before, in a yellow state, the cluster continues to operate normally. • Because you might have a reduced number of replicas assigned, your performance might be impacted. Plan this outside the normal working hours. • New features will come into play when all the nodes are running the updated version. • Again, we can’t rollback. Lower version nodes won’t join the cluster.
  • 141. On Rolling upgrades www.objectrocket.com 141 • If you have a network partition that will separate the newly updated nodes from the old ones, when this gets solved, the old ones will fail with a message of the sort: • In this case, you have no other choice than to stop the nodes and get them upgraded. Won’t be rolling and you might have service interruption but there is no other alternative. [2018-10-16T15:08:28,928][INFO ][o.e.d.z.ZenDiscovery ] [node3] failed to send join request to master [{node1}{bWKRUNFXTEy1kBgQ1y2LvA}{Gxzb3blaR86CUL3gKLhnXA}{134.213.56.107}{134.213.56.107:9300}{ml.machine_memory=8196317 184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[node1][134.213.56.107:9300][internal:discovery/zen/join]]; nested: IllegalStateException[node {node3}{Nt4eKRkvR6- SZ_gg22lqTQ}{dQRBgGDwSo2Zr7W866e64w}{162.13.188.164}{162.13.188.164:9300}{ml.machine_memory=8196317184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} is on version [6.3.2] that cannot deserialize the license format [4], upgrade node to at least 6.4.0]; ]. ß
  • 142. Full Cluster restart upgrade www.objectrocket.com 142 • It was needed before version 6 when major versions were involved. • v5.6 à v6 can be done with a rolling upgrade. • It involves shutting down the cluster, upgrading the nodes then starting the cluster up. 1. Disable shard allocation so we don’t have the unnecessary IO after nodes are being stopped. 2. As briefly mentioned before, stop indexing and perform a “POST _flush/synced” will help with shard recovery curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "transient": { "cluster.routing.allocation.enable": "none" }}'
  • 143. Full Cluster restart upgrade www.objectrocket.com 143 3. Shut down all nodes. “service stop elasticsearch” or whatever works :) 4. Use your package manager to update elasticsearch on each node. 5. Upgrade the plugins with “/usr/share/elasticsearch/bin/elasticsearch-plugin” 6. Start the nodes up 7. Wait for the nodes to join the cluster. 8. Enable shard allocation. 9. Check that the cluster is back to normal before enabling indexing, curl -X GET http://localhost:9200/_cluster/h ealth?pretty { "cluster_name" : "democluster", "status" : "yellow", ß …. "number_of_nodes" : 1, ß "number_of_data_nodes" : 1, "active_primary_shards" : 5, "active_shards" : 5, "unassigned_shards" : 5, … }
  • 144. Upgrades by re-indexing www.objectrocket.com 144 • Elasticsearch can read indices created in the previous major version. • V6 will read V5 indices but not the V2 or bellow. V5 will read V2 indices but not V1 or bellow • Older indices will need to be re-indexed or dropped. • If a node will detect an index that is incompatible, it will fail to start. • Based on the above, trying to upgrade to a major version that is really far in front, is a bit tricky if you don’t have a spare cluster. If you do, it’s actually quite easy.
  • 145. Upgrades by re-indexing www.objectrocket.com 145 • The easiest way in which you can move to a new version would be to create a cluster with that version and use the remote indexing feature. When the new index will be created by the new version for the new version. • To do list for remote indexing: 1. Add the host and port to the new cluster’s elasticsearch.yml under reindex.remote.whitelist: 2. Create an index on the new cluster with the correct mappings and settings. • Using number_of_replicas of 0 and refresh_interval -1 will speed up the next operation. reindex.remote.whitelist: oldhost:oldport
  • 146. Upgrades by re-indexing www.objectrocket.com 146 3. Reindex from remote. Example: curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d' { "source": { "remote": { "host": "http://oldhost:9200", "username": "user", "password": "pass" }, "index": "source", "query": { "match": { "test": "data" } } }, "dest": { "index": "dest" } } '
  • 147. Re-indexing in place www.objectrocket.com 147 • In order to make an older version index work on a newer version cluster, you will need to reindex to a new one. This will be done by the re-index API curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d' { "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } } '
  • 148. Re-indexing in place www.objectrocket.com 148 1. If you want to maintain your mappings, create a new index and copy the mappings and settings; 2. You can again disable the refresh_interval and number_of_replicas to make the operation faster; 3. Reindex the documents to the new index; 4. Reset the refresh_interval and number_of_replicas to the wanted values; 5. Wait for the alias to turn green and it will do so when the replicas will get allocated
  • 149. Re-indexing in place www.objectrocket.com 149 6. In a single update, to avoid missed operations on the old index, you should: • Delete the old index (let’s call it old index) • Add an alias with the old index to the new index • Add any aliases that existed on the old index to the new index. More aliases meaning more ”adds” curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' - d' { "actions" : [ { "add": { "index": ”new_index", "alias": ”old_index" } }, { "remove_index": { "index": ”old_index" } }, { "add" : { "index" : ”new_index", "alias" : ”any_other_aliases" } } ] } '
  • 150. Moving through the versions www.objectrocket.com 150 ElasticSearch V2 Perform a full cluster restart to version 5.6 Re-index the V2 indexes in place so they work with 5.6 Perform a rolling restart to 6.x Fully on V5
  • 151. Moving through the versions www.objectrocket.com 151 ElasticSearch V1 Perform a full cluster restart to V 2.4.X Re-index the 1.X indices in place so they work on V2.4.X Perform a full cluster restart to V5.6 Re-index the V2 indices so they work on V5 Perform a rolling restart to V6.X Fully on V2 Fully on V5
  • 152. www.objectrocket.com 152 Lab 4 Upgrading the cluster Objectives: Learn how to: o Upgrade an elasticsearch cluster. Steps: 1. Navigate to /Percona2018/Lab04 2. Read the instructions on Lab04.txt 3. Execute ./run_cluster.sh to begin https://goo.gl/ddaVdS
  • 153. Security ● Authentication ● Authorization ● Encryption ● Audit www.objectrocket.com 153
  • 154. Security www.objectrocket.com 154 The Open Source version of ElasticSearch, does not provide - Authentication - Authorization - Encryption To overcome this we will use open-source: - Firewall - Reverse proxy - Encryption tools Alternative you can buy X-Pack which provides a different layer of Security
  • 155. Firewall www.objectrocket.com 155 Client communication: Intra-cluster communication: iptables -I INPUT 1 -p tcp --dport 9200:9300 -s IP_1, IP_2 -j ACCEPT iptables -I INPUT 4 -p tcp --dport 9200:9300 -j REJECT iptables -I INPUT 1 -p tcp --dport 9300:9400 -s IP_1, IP_2 -j ACCEPT iptables -I INPUT 4 -p tcp --dport 9300:9400 -j REJECT
  • 156. Firewall www.objectrocket.com 156 DNS SSH Monitoring tools Allow whatever port your monitoring tool uses. iptables -A OUTPUT -p udp --dport 53 -m state --state NEW,ESTABLISHED -j ACCEPT iptables -A INPUT -p udp --sport 53 -m state --state ESTABLISHED -j ACCEPT iptables -A OUTPUT -p tcp --dport 53 -m state --state NEW,ESTABLISHED -j ACCEPT iptables -A INPUT -p tcp --sport 53 -m state --state ESTABLISHED -j ACCEPT iptables -A INPUT -p tcp --dport ssh -j ACCEPT iptables -A OUTPUT -p tcp --sport ssh -j ACCEPT
  • 157. Reverse Proxy www.objectrocket.com 157 client client client ES node ES node Reverse proxy - Nginx Advertise 9200 to 8080 HTTP request ES:8080 Rules HTTP request
  • 158. Authentication www.objectrocket.com 158 We are going to use nginx: ngx_http_auth_basic_module On nginx.conf 1 2 3 4 1) Listens to 19200 port 2) Enables auth 3) Password file location 4) ES <host>:<port> server { listen *:19200; location / { auth_basic "Restricted"; auth_basic_user_file /var/data/nginx/.htpasswd; proxy_pass http://localhost:9200; proxy_read_timeout 90; } }
  • 159. Authentication www.objectrocket.com 159 Create users: - htpasswd -c /var/data/nginx/.htpasswd <username> - You will be prompt for the password - Alternatively use the -b flag and provide the pass on cmd line Access Elasticsearch: curl <host> #Returns 301 curl <host>:19200 #Returns 401 Authorization Required curl <username>:<password>@<host>:<19200> #Returns Elasticsearch output
  • 160. Adding SSL to the mix www.objectrocket.com 160 Use nginx as reverse proxy to encrypt client communication On nginx.conf Certificates: - Can either obtained by a commercial website - Self generated ssl on; ssl_certificate /etc/ssl/certs/<cert>.crt; ssl_certificate_key /etc/ssl/private/<key>.key; ssl_session_cache shared:SSL:10m;
  • 161. Authorization www.objectrocket.com 161 - Authentication alone is not enough. - Once allowed access, the client can do whatever it wants in the cluster. - Simplest way of authorization is to deny endpoints location / { auth_basic "Restricted"; auth_basic_user_file /var/data/nginx-elastic/.htpasswd; if ($request_filename ~ _shutdown) { return 403; break; } 1 2 1) If user requests for shutdown 2) Return 403 curl -X GET -k "esuser:esuser@es-node1-9200:19200/_cluster/nodes/_shutdown/" Produces a 403 Forbidden
  • 162. Authorization www.objectrocket.com 162 Assign roles using nginx. For example a user 1 2 3 1) Listens to 19500 port 2) Enables auth 3) Regex match for endpoints 4) FW to ES <host>:<port>4 server { listen 19500; auth_basic "Restricted"; auth_basic_user_file /var/data/nginx/.htpasswd_users; location / { return 403; } location ~* ^(/_search|/_analyze) { proxy_pass http://<es_node>; proxy_redirect off; }}
  • 163. Encryption & Co www.objectrocket.com 163 Protecting the data on disk is also essential. LUKS (Linux Unified Key Setup) - encrypts entire block devices - cpus with AES-NI (Advanced Encryption Standard Instruction Set) can accelerate dm-crypt - supports limited number of passwords - Keep the keys in a safe place Always audit: - Access Logs - Ports - Backups - Physical access
  • 164. Working with Data – Advanced Operations ● Alias ● Bulk API ● Aggregations ● … www.objectrocket.com 164
  • 165. Pagination www.objectrocket.com 165 • By default, Elasticsearch will return the first 10 hits of your query. The size parameter is used to specify the number of hits. GET shakspeare/_search?pretty { "size": 20, "query": { "match": { "play_name": "Hamlet"} } } But this is just the first page of hits
  • 166. Pagination - from www.objectrocket.com 166 • Add the from parameter to a query to specify the offset from the first result you want to fetch (it defaults to 0). GET shakespeare/_search?pretty { "from": 20, "size": 20, "query": { "match": { "play_name": "Hamlet"} } } Get the next page of hits
  • 167. Pagination - Scroll www.objectrocket.com 167 • While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database • To initiate a scroll search, add the scroll parameter to your search query GET shakespeare/_search?scroll=1m { "size": 1000, "query": { "match_all": {} } } If the scroll is idle for more than 1 minute, then delete it Maximum number of hits to return
  • 168. Pagination - Scroll www.objectrocket.com 168 • The result from the above request includes the first page of results and a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results. POST /_search/scroll { "scroll" : "1m", "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9la VYtZndUQlNsdDcwakFMNjU1QQ==" } Note that the URL should not include the index name - this is specified in the original search request instead.
  • 169. Search multiple fields www.objectrocket.com 169 • The multi_match query provides a convenient short hand for running a match query against multiple fields ‒ by default, the _score from the best field is used (a best_fields search) GET shakespeare/_search?pretty -d '{ "query": { "multi_match": { "query": "Hamlet", "fields": [ "play_name", "speaker", "text_entry" ], "type": "best_fields" } } }' 3 fields are queried (which results in 3 scores) and the best score is used
  • 170. Search – per-field boosting www.objectrocket.com 170 • If we want to add more weight to hits on a differents field, in this example, let's say we're more interested in speaker field than play_name – we can boost the score of a field using the caret (^) symbol GET shakespeare/_search?pretty -d '{ "query": { "multi_match": { "query": "Hamlet", "fields": [ "play_name", "speaker^2", "text_entry" ], "type": "best_fields" } } }' We get the same number of hits, but the top hits are different.
  • 171. Misspelled words - fuzziness www.objectrocket.com 171 • Fuzzy matching treats two words that are “fuzzily” similar as if they were the same word - Fuzziness is something that can be assigned a value - It refers to the number of character modifications, known as edits, to make two words match - Can be set to 0,1or 2, or can be set to“auto” Fuzziness = 1 Fuzziness = 2 "Hamled" "Hamlled" d-> t l-> d->t "Hamlet" "Hamlet"
  • 172. Add fuzziness to a query www.objectrocket.com 172 GET shakespeare/_search?pretty -d '{ "query": { "match": { "play_name": "Hamled" } } }' GET shakespeare/_search?pretty -d '{ "query": { "match": { "play_name": { "query": "Hamled", "fuzziness": 1 }} }}' 0 hits 4244 hits
  • 173. Search exact terms www.objectrocket.com 173 • If we need to search for the exact text, we use the match query, which understands how the field has been analyzed, and search on the keyword field: GET shakespeare/_search?pretty -d '{ "query": { "match": { "text_entry.keyword": "To be, or not to be: that is the question" } } }' Exactly 1 hit
  • 174. Sorting www.objectrocket.com 174 • The results of a query are returned in the order of relevancy, _score descending is the default sorting for a query • A query can contain a sort clause that specifies one or more fields to sort on, as well as the order (asc or desc) GET /shakespeare/_search?pretty -d '{ "query": { "match": { "text_entry": "question" } }, "sort": [ {"play_name": {"order": "desc"} } ] }' "hits" : [ { "_index" : "shakespeare", "_type" : "doc", "_id" : "55924", "_score" : null, "_source" : {.....} If _score is not a field in the sort cause, is not calculated => less compute resources
  • 175. Highlighting www.objectrocket.com 175 • A common use case for search results is to highlight the matched terms. GET /shakespeare/_search?pretty -d '{ "query": { "match_phrase": { "text_entry": "Hamlet" } }, "highlight": { "fields": { "text_entry": {} } } }' "_source" : { "type" : "line", "line_id" : 36184, "play_name" : "Hamlet", "speech_number" : 99, "line_number" : "5.1.269", "speaker" : "QUEEN GERTRUDE", "text_entry" : "Hamlet, Hamlet!" }, "highlight" : { "text_entry" : [ "<em>Hamlet</em>, <em>Hamlet</em>!" ] } } The response contains a highlight section
  • 176. Range query www.objectrocket.com 176 • Matches documents with fields that have terms within a certain range. The type of the Lucene query depends on the field type, for string fields, the TermRangeQuery, while for number/date fields, the query is a NumericRangeQuery • The range query accepts the following parameters: gte, gt, lte, lt, boost GET _search { "query": { "range" : { "age" : { "gte" : 10, "lte" : 20 } } } }
  • 177. Exists query www.objectrocket.com 177 • Returns documents that have at least one non-null value in the original field: • There isn't a missing query, instead use the exists query inside a must_not clause GET /_search { "query": { "exists" : { "field" : "user" } } } GET /_search { "query": { "bool": { "must_not": { "exists": { "field": "user" } } } } }
  • 178. Wildcard query www.objectrocket.com 178 • Matches documents that have fields matching a wildcard expression; • Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. • Note that this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ? GET shakespeare/_search?pretty -d { "query": { "wildcard" : { "play_name" : "Henry*" } } }
  • 179. Regexp query www.objectrocket.com 179 • The regexp query allows you to use regular expression term queries • The "term queries" in that first sentence means that Elasticsearch will apply the regexp to the terms produced by the tokenizer for that field, and not to the original text of the field • Note: The performance of a regexp query heavily depends on the regular expression chosen. Matching everything like .* is very slow as well as using lookaround regular expressions. GET shakespeare/_search?pretty -d { "query": { "regexp":{ "play_name": "H.*t"} } }
  • 180. Aggregations www.objectrocket.com 180 • Aggregations are a way to perform analytics on your indexed data • There are four main types of aggregations: - Metric: aggregations that keep track and compute metrics over a set of documents. - Bucketing: aggregations that build buckets, where each bucket is associated with a key and a document criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in the context and when a criterion matches, the document is considered to "fall in" the relevant bucket. - Pipeline: aggregations that aggregate the output of other aggregations and their associated metrics - Matrix: aggregations that operate on multiple fields and produce a matrix result based on the values extracted from the requested document fields. Unlike metric and bucket aggregations, this aggregation family does not yet support scripting and its functionality is currently experimental
  • 181. Aggregations - Metric www.objectrocket.com 181 • Most metrics are mathematical operations that output a single value: avg, sum, min, max, cardinality • Some metrics output multiple values: stats, percentiles, percentile_ranks • Example: what's the maximum value of the "age" field GET account/_search?pretty -d '{ "size": 0, "aggs": { "max_age": { "max": { "field": "age" } } } }' "aggregations" : { "max_age" : { "value" : 40.0 } } }
  • 182. Aggregations - bucket www.objectrocket.com 182 • Bucket aggregations don’t calculate metrics over fields like the metrics aggregations do, but instead, they create buckets of documents • Bucket aggregations, as opposed to metrics aggregations, can hold sub-aggregations. These sub-aggregations will be aggregated for the buckets created by their "parent" bucket aggregation • Terms aggregations is very handy, will dynamically create a new bucket for every unique term it encounters of the specified field and get a feel of how your data looks like
  • 183. Aggregations www.objectrocket.com 183 GET shakespeare/_search?pretty -d '{ "size": 0, "aggs": { "play_names": { "terms": { "field": "play_name", "size": 5 } } } }' • Example: What are the unique play names we have in our index "size" - number of buckets to create (default is 10)
  • 184. Aggregations www.objectrocket.com 184 "aggregations" : { "play_names" : { "doc_count_error_upper_bound" : 3045, "sum_other_doc_count" : 91399, "buckets" : [ { "key" : "Hamlet", "doc_count" : 4244 }, { "key" : "Coriolanus", "doc_count" : 3992 }, { "key" : "Cymbeline", "doc_count" : 3958 }, { "key" : "Richard III", "doc_count" : 3941 }, { "key" : "Antony and Cleopatra", "doc_count" : 3862 } ]}}} • Notice each bucket has a “key” that represents the distinct value of “field”, • and“doc_count”for the number of docs in the bucket
  • 185. Nesting buckets www.objectrocket.com 185 GET shakespeare/_search?pretty -d '{ "size": 0, "aggs": { "play_names": { "terms": { "field": "play_name", "size": 1 }, "aggs": { "speakers": { "terms": { "field": "speaker", "size": 5 } } } } } }' The play names are bucketed, then, within each play bucket, our documents are bucketed by speaker.
  • 186. Nesting buckets www.objectrocket.com 186 "aggregations" : { "play_names" : { "doc_count_error_upper_bound" : 3395, "sum_other_doc_count" : 107152, "buckets" : [ { "key" : "Hamlet", "doc_count" : 4244, "speakers" : { "doc_count_error_upper_bound" : 48, "sum_other_doc_count" : 1698, "buckets" : [ { "key" : "HAMLET", "doc_count" : 1582 }, { "key" : "KING CLAUDIUS", "doc_count" : 594 }, { "key" : "LORD POLONIUS", "doc_count" : 370 The result of our nested aggregation Notice two special values returned in a terms aggregation: - “doc_count_error_upper_bound”: maximum number of missing documents that could potentially have appeared in a bucket - “sum_other_doc_count”: number of documents that do not appear in any of the buckets
  • 187. Bucket sorting www.objectrocket.com 187 • Sorting can be specified using using “order”: ‒ _count sorts by their doc_count (default in terms) ‒ _key sorts alphabetically (default in histogram and date_histogram) • Sorting can also be on a metric value in a nested aggregation GET shakespeare/_search?pretty -d '{ "size": 0, "aggs": { "play_names": { "terms": { "field": "play_name", "size": 5, "order": { "_count": "desc" } } } } }'
  • 188. www.objectrocket.com 188 Lab 5 Advanced Operation Objectives: Learn how to: o Work with mappings o Work with analyzers Steps: 1. Navigate to /Percona2018/Lab05 2. Read the instructions on Lab05.txt https://bit.ly/2D1tXL6