Elastic 101 tutorial - Percona Europe 2018

Elastic 101
Antonios Giannopoulos DBA @ Rackspace/ObjectRocket
Alex Cercel DBA @ Rackspace/ObjectRocket
Mihai Aldoiu CDE @ Rackspace/ObjectRocket
linkedin.com/in/antonis | linkedin.com/in/alexcercel | linkedin.com/in/aldoiu 1

Introduction
www.objectrocket.com
2
Antonios
Giannopoulos
Alex Cercel Mihai Aldoiu

Overview
• Introduction
• Working with data
• Scaling the cluster
• Operating the cluster
• Troubleshooting the cluster
• Upgrade the cluster
• Security best practices
• Working with data – Advanced
operations
• Best Practices
3

4
Labs
1. Unzip the provided .vmdk file
2. Install and or Open VirtualBox
3. Select New
4. Enter A Name
5. Select Type: Linux
6. Select Version: Red Hat (64-bit)
7. Set Memory to at least 4096 (more won’t hurt)
8. Select "Use an existing ... disk file", select the provided .vmdk file
9. Select Create
10. Select Start
11. Login with username: elasticuser , password: elasticuser
12. Navigate to /Percona2018/Lab01 for the first lab.
https://bit.ly/2D1tXL6

Introduction
● Key Terms
● Installation
● Configuration files
● JVM fundamentals
● Lucene basics
5

What is elasticsearch?
6
Lucene:
- A search engine library entirely written in Java
- Developed in 1999 by Doug Cutting
- Suitable for any application that requires full text indexing and searching capability
But:
- Challenging to use
- Not originally designed for scaling
Elasticsearch:
- Built on top of Lucene
- Provides scaling
- Language independent

What is ELK stack?
7
ElasticSearch:
- The main datastore
- Provides distributed search capabilities
Logstash:
- Parse & transform data for ingestion
- Ingests from multiple of sources simultaneously
Kibana:
- An analytics and visualization platform
- Search, visualize & interact with Elasticsearch data

Installing Elasticsearch
8
Download:
Latest Version: https://www.elastic.co/downloads/elasticsearch
Older Version: Navigate to https://www.elastic.co/downloads/past-releases
The simplest way:
1) wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.2.tar.gz
2) wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.2.tar.gz.sha512
3) shasum -a 512 -c elasticsearch-6.3.2.tar.gz.sha512 (it should return elasticsearch-6.3.2.tar.gz: OK)
4) tar -xzf elasticsearch-6.3.2.tar.gz

Installing Java
9
ElasticSearch requires JRE (JavaSE runtime environment) or JDK (Java
Development Kit)
- OpenJDK CentOS: yum install java-1.8.0-openjdk
- OpenJDK Ubuntu: apt-get install openjdk-8-jre
ES versions 6, requires Java8 or higher https://www.elastic.co/support/matrix
set JAVA_HOME appropriately
- Create a file under /etc/profile.d for example jdk.sh
- Add the following lines:
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-*"
export PATH=$JAVA_HOME/bin:$PATH

Start the server
10
Create a user elasticuser*
Using elasticuser execute:
bin/elasticsearch
After some noise:
[INFO ][o.e.n.Node] [name] started
How I know is up and running?
*You can’t start ES using root
$ curl -X GET "localhost:9200/"
{
"name" : "KG-_6s9",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "T9uHpto6QtWRmsjzNFrReA",
"version" : {
"number" : "6.3.2",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "053779d",
"build_date" : "2018-07-
20T05:20:23.451332Z",
"build_snapshot" : false,
"lucene_version" : "7.3.1",
"minimum_wire_compatibility_version" :
"5.6.0",
"minimum_index_compatibility_version" :
"5.0.0"
},
"tagline" : "You Know, for Search"
}

Explore the directories
11
Folder Description Setting
bin Contains the binary scipts, like elasticsearch
config Contains the configuration files ES_PATH_CONF
data Holds the data (shards/indexes) path.data
lib Contains JAR files
logs Contains the log files path.logs
modules Contains the modules
plugins Contains the plugins. Each plugin has its own subdirectory

Configuration files
12
elasticsearch.yml
- The primary way of configuring a node.
- Its a template which lists the most important settings for a production cluster
jvm.options
- JVM related options
log4j2.properties
- Elasticsearch uses Log4j 2 for logging
Variables can be set either:
-Using the configuration file: jvm.options: -Xms512mb
- or, using command line ES_JAVA_OPTS="-Xms512m" ./bin/elasticsearch

Elasticsearch.yml
13
node.name
- Every node should have a unique node.name
- Set it to something meaningful (aws-zone1-objectrocket-es-01)
cluster.name
- A cluster is a set of nodes sharing the same cluster.name
- Set it to something meaningful (production, qa, staging)
path.data
- Path to directory where to store the data (accepts multiple locations)
path.logs
- Path to log files

Elasticsearch.yml
14
cluster.name: production
node.name: dc1-prd-es1
path.data: /data/es1
path.logs: /logs/es1
bin/elasticsearch -d -p 'elastic.pid'
$ curl -X GET "localhost:9200/"
{
"name" : "dc1-prd-es1",
"cluster_name" : "production",
…

jvm.Options
15
Each Elasticsearch node runs on its own JVM instance
JVM is a virtual machine that enables a computer to run Java programs
The most important setting is the Heap Size:
- Xms: Represents the initial size of total heap space
- Xmx: Represents the maximum size of total heap space
Best Practices
- Set Xms and Xmx to the same size
- Set Xmx to no more than 50% of your physical RAM
- Do not set Xms and Xmx over 30ish GiB
- Use the server version of OpenJDK
- Lock the RAM for Heap bootstrap.memory_lock

jvm.Options
16
Heap Off Heap
Indexing buffer
Completion suggester
Cluster state
… and more
Caches:
- query cache (10%)
- field data cache (unbounded)
- …

jvm.Options
17
Garbage collector
- It is a form of automatic memory management
- Gets rid of objects which are not being used by a Java application anymore
- Automatically reclaims memory for reuse
Garbage collectors
- ConcMarkSweepGC (CMS)
- G1GC (has some Issues with JDK 8)
Elasticsearch uses -XX:+UseConcMarkSweepGC
GC threads
-XX:ParallelGCThreads=N, where N varies on the platform
-XX:ParallelCMSThreads=N , where N varies on the platform

jvm.Options
18
Eden s0 s1 Old Generation Perm
New Gen -Xmn
JVM Heap –Xms -Xmx
XX: PermSize
XX: MaxPermSize
Minor GC Major GC or full GC
1) A new Page stored in Eden
2) After a GC if it survives it moves to s0 ,s1
3) After multiple GCs, s0 or s1 gets full then pages
moves to Old Gen

OS settings
19
Disable swap
- sysctl vm.swappiness=1
- Remove Swap
File descriptors
- Set nofile to 65536
- curl -X GET ”<host>:<port>/_nodes/stats/process?filter_path=**.max_file_descriptors”
Virtual Memory
- sysctl -w vm.max_map_count=262144
Max user process
- nproc to 4096
DNS cache settings
- networkaddress.cache.ttl=<timeout>
- networkaddress.cache.negative.ttl=<timeout>

Network settings
20
Two network communication mechanisms in Elasticsearch
- HTTP: which is how the Elasticsearch REST APIs are exposed
- Transport: used for internal communication between nodes within the cluster
Node 1 Client
Node 2
HTTP
Transport

Network settings
21
The REST APIs of Elasticsearch are exposed over HTTP
- The HTTP module binds to localhost by default
- Configure with http.host on elasticsearch.yml
- Default port is the first available between 9200-9299
- Configure with http.port on elasticsearch.yml
Each call that goes from one node to another uses the transport module
- Transport binds to localhost by default
- Configure with transport.host on elasticsearch.yml
- Default port is the first available between 9300-9399
- Configure with transport.tcp.port on elasticsearch.yml

Network settings
22
network.host sets the bind host and the publish host at the same time
network.publish_host
- Defaults to network.host.network.publish_host. Multiple interfaces
network.bind_host
- Defaults to the “best” address from network.host. One interface only
network.host value Description
_[networkInterface]_ Addresses of a network interface, for example _en0_.
_local_ Any loopback addresses on the system, for example 127.0.0.1.
_site_ Any site-local addresses on the system, for example 192.168.0.1.
_global_ Any globally-scoped addresses on the system, for example 8.8.8.8.

Network settings
23
Zen discovery
- built in & default discovery module default
- It provides unicast discovery,
- Uses the transport module
On elasticsearch.yml
discovery.zen.ping.unicast.hosts: [”node1", ”node2"]
Node 1
Node 2
Transport
Node 3
1) Retrieves IP/ hostname from
list of hosts
2) Tries all hosts until find a
reachable one
3) If the cluster name matches,
joins the cluster
4) If not, starts its own cluster

Bootstrap tests
24
Development mode: if it does not bind transport to an external interface (the default)
Production mode: if it does bind transport to an external interface
Bypass production mode: Set discovery.type to single-node
Bootstrap Tests
- Inspect a variety of Elasticsearch and system settings
- A node in production mode must pass all Bootstrap tests to start
- es.enforce.bootstrap.checks=true on jvm.options
- Highly recommended to have this setting enabled

Bootstrap tests
25
List of Bootstrap Tests
- Heap size check
- File descriptor check
- Memory lock check
- Maximum number of threads check
- Max file size check
- Maximum size virtual memory check
- Maximum map count check
- Client JVM check
- Use serial collector check
- System call filter check
- OnError and OnOutOfMemoryError checks
- Early-access check
- G1GC check
- All permission check

Lucene
26
Lucene uses a data structure called Inverted Index.
An Inverted Index, inverts a page-centric data structure (page->words) to a keyword-
centric data structure (word->pages)
Allow fast full text searches, at a cost of increased processing when a document is
added to the database.
1) Give us your name
2) Give us your home number
3) Give us your home address
Frequency Location
give 3 1,2,3
us 3 1,2,3
your 3 1,2,3
name 1 1
number 1 2
home 2 2,3
address 1 3

Lucene – Key Terms
27
A Document is the unit of search and index.
A Document consists of one or more Fields. A Field is simply a name-value pair.
An index consists of one or more Documents.
Indexing: involves adding Documents to an Index
Searching:
- involves retrieving Documents from an index.
- Searching requires an index to have already been built
- Returns a list of Hits

Kibana
28
Download:
Latest Version: https://www.elastic.co/guide/en/kibana/current/targz.html
Simplest way to install it:
Run Kibana:
kibana-6.3.2-linux-x86_64/bin/kibana
Access Kibana:
http://localhost:5601
wget https://artifacts.elastic.co/downloads/kibana/kibana-6.3.2-linux-
x86_64.tar.gz
shasum -a 512 kibana-6.3.2-linux-x86_64.tar.gz
tar -xzf kibana-6.3.2-linux-x86_64.tar.gz

Kibana - Devtools
29

30
Lab 1
Install and configure Elastic
Objectives:
Learn how to install and configure a standalone Elastic instance.
Steps:
1. Navigate to /Percona2018/Lab01
2. Read the instructions on Lab01.txt

Working with
Data ● Indexes
● Shards
● CRUD Operations
● Read Operations
● Mappings
● Analyzers
31

Working with Data - Index
32
• An index in Elasticsearch is a logical way of grouping data:
‒ an index has a mapping that defines the fields in the index
‒ an index is a logical namespace that maps to where its contents are stored in the
cluster
• There are two different concepts in this definition:
‒ an index has some type of data schema mechanism
‒ an index has some type of mechanism to distribute data across a cluster

An index means ....
33
In the Elasticsearch world, index is used as a:
‒ Noun: a document is put into an index in Elasticsearch
‒ Verb: to index a document is to put the document into an index in Elasticsearch
{
"type":"line",
"line_id":4,
"play_name":"Henry IV",
"speech_number":1,
"line_number":"1.1.1",
"speaker":"KING HENRY IV",
"text_entry":"So shaken as we
are, so wan with care,"
}
{
"type":"line",
"line_id":5,
"speech_number":1,
"text_entry":"Find we a time for
frighted peace to pant"
}
{
"type":"line",
"line_id":6,
"speech_number":1,
"text_entry":"And breathe short-winded
accents of new broils"}
My_index
Documents are indexed to an index

Define an index
34
• Clients communicate with a cluster using Elasticsearch’s REST APIs
• An index is defined using the Create Index API, which can be accomplished with a
simple PUT command
# curl -XPUT 'http://localhost:9200/my_index' -i
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 48
{"acknowledged":true,"shards_acknowledged":true}

Shard
35
• A shard is a single piece of an Elasticsearch index
‒ Indexes are partitioned into shards so they can be distributed across multiple nodes
• Each shard is a standalone Lucene index
‒ The default number of shards for an index is 5. Number of shards can be changed at index
creation time.
My_index
0
2
4
3
1
Node 1 Node 2

Working with Data - Document
36
Documents must be JSON objects.
• A document can be any text or numeric data you want to search and/or analyze
‒ Specifically, a document is a top-level object that is serialized into JSON and
stored in Elasticsearch
• Every document has a unique ID
‒ which either you provide, or Elasticsearch generates one for you
{
"type":"line",
"line_id":4,
"speech_number":1,
"text_entry":"So shaken as we are, so wan with care,"
}

Index compression
37
• Elasticsearch compresses your documents during indexing
‒ documents are grouped into blocks of 16KB, and then compressed together using LZ4
by default
‒ if your documents are larger than 16KB, you will have larger chunks that contain only
one document
• You can change the compression to DEFLATE using the index.codec setting:
‒ reduced storage size at slightly higher CPU usage
PUT my_index
{ "settings": {
"number_of_shards": 3,
"number_of_replicas": 2,
"index.codec" : "best_compression"
}
}

Index a document
38
The Index API is used to index a document
‒ use a PUT or a POST and add the document in the body request
‒ notice we specify the index, the type and an ID
‒ if no ID is provided, elasticsearch will generate one
# curl -XPUT 'http://localhost:9200/my_index/my_type/1' -H 'Content-Type:
application/json' -d
'{
"line_id":5,
"speech_number":1,
"text_entry":"Find we a time for frighted peace to pant"
}'
{"_index":"my_index","_type":"my_type","_id":"1","_version":1,"result":"created","_shar
ds":{"total":2,"successful":2,"failed":0},"created":true}

Index without specifying an ID
39
You can leave off the id and let Elasticsearch generate one for you:
‒ But notice that only works with POST, not PUT
‒ The generated id comes back in the response
# curl -XPOST 'http://localhost:9200/my_index/my_type/' -H 'Content-Type:
application/json' -d '
{"line_id":6,
"speech_number":1,
"text_entry":"And breathe short-winded accents of new broils"
}'
{"_index":"my_index","_type":"my_type","_id":"AWZIq227Unvtccn4Vvrz","_version":1,"resul
t":"created","_shards":{"total":2,"successful":2,"failed":0},"created":true}

Reindexing a document
40
What do you think it happens if we add another document with the same ID ?
curl -XPUT 'http://localhost:9200/my_index/my_type/1' -H
'Content-Type: application/json' -d '
{
"new_field" : "new_value"
}'

...Overwrites the document
41
• The old field/value pairs of the document are gone
‒ the old document is deleted, and the new one gets indexed
• Notice every document has a _version that is incremented whenever the document is
changed # curl -XGET http://localhost:9200/my_index/my_type/1?pretty -H
'Content-Type: application/json'
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "1",
"_version" : 2,
"found" : true,
"_source" : {
"new_field" : "new_value"
}
}

The _create endpoint
42
If you do not want a document to be overwritten if it already exists, use the _create
endpoint
‒ no indexing occurs and returns a 409 error message:
# curl -XPUT 'http://localhost:9200/my_index/my_type/1/_create' -H 'Content-Type:
{"new_field" : "new_value"}'
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[my_type][
1]: version conflict, document already exists (current version
[2])","index_uuid":"JGY3Q_9NRjWe-wU-
MlK44Q","shard":"3","index":"my_index"}],"type":"version_conflict_engine_exception","rea
son":"[my_type][1]: version conflict, document already exists (current version
[2])","index_uuid":"JGY3Q_9NRjWe-wU-
MlK44Q","shard":"3","index":"my_index"},"status":409}

Locking ?
43
- Every indexed document has a version number
- Elasticsearch uses Optimistic concurrency control without locking
# curl -XPUT 'http://localhost:9200/my_index/my_type/1?version=3' -d
'{
...
}'
# 200 OK
# curl -XPUT 'http://localhost:9200/my_index/my_type/1?version=2' -d
'{
...
}'
# 409 Conflict

The _update endpoint
44
To update fields in a document use the _update endpoint.
- Make sure to add the “doc” context
curl -XPOST 'http://localhost:9200/my_index/my_type/1/_update' -H 'Content-Type:
{ "doc":
{
"line_id":10,
"speech_number":1,
"text_entry":"Nor more shall trenching war channel her fields"
}
}'
{"_index":"my_index","_type":"my_type","_id":"1","_version":3,"result":"updated","_shar
ds":{"total":2,"successful":2,"failed":0}}

Retrieve a document
45
Use GET to retrieve an indexed document
‒ Notice we specify the index, the type and an ID
‒ Returns a 200 code if document found or a 404 error if the document is not found
# curl -XGET http://localhost:9200/my_index/my_type/1?pretty
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"line_id" : 5,
"play_name" : "Henry IV",
"speech_number" : 1,
"line_number" : "1.1.2",
"speaker" : "KING HENRY IV",
"text_entry" : "Find we a time for frighted peace to pant"
}
}

Deleting a document
46
Use DELETE to delete an indexed document
‒ response code is 200 if the document is found, 404 if not
# curl -XDELETE 'http://localhost:9200/my_index/my_type/1/'
-H 'Content-Type: application/json'
{"found":true,"_index":"my_index","_type":"my_type","_id":"
1","_version":7,"result":"deleted","_shards":{"total":2,"su
ccessful":2,"failed":0}}

A simple search
47
Use a GET request sent to the _search endpoint
‒ every document is a hit for this search
‒ by default, Elasticsearch returns 10 hits
curl -s -XGET 'http://localhost:9200/my_index/my_type/_search'
-H 'Content-Type: application/json'
{
"took" : 1,
"timed_out" : false,
….
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ ...
]
} }
Search for all
docs in my_index
Number of ms it took to process the query
Number of documents there were hits for
this query
Array containing documents hit by the
search criteria

CRUD Operations Summary
48
Index PUT my_index/my_type/4
Create PUT my_index/my_type/4/_create
{ "speaker":"KING HENRY IV",
"text_entry":"To be commenced in strands afar
remote." }
Read GET my_index/my_type/4
Update POST my_index/my_type/4/_update
{ "my_type" : { "text_entry":"No more the
thirsty entrance of this soil"
}
}
Delete DELETE my_index/my_type/4

Mapping – what is it?
49
• Elasticsearch will index any document without knowing its details (number of fields,
their data types, etc.) - dynamic mapping
‒ However, behind-the-scenes Elasticsearch assigns data types to your fields in a
mapping. Mapping is the process of defining how a document, and the fields it contains,
are stored and indexed
A mapping is a schema definition that contains:
‒ names of fields
‒ data types of fields
‒ how the field should be indexed and stored by Lucene
• Mappings map your complex JSON documents into the simple flat documents that
Lucene expects.

Defining a mapping
50
• In most use cases, you will need to define your own mappings, but is not required.
When you index a document, Elasticsearch dynamically creates or updates the
mapping
• Mappings are defined in the“mappings”section of an index. You can:
‒ define mappings at index creation, or
‒ add to a mapping of an existing index
PUT my_index
{
"mappings": {
define mapping here
}
}

Let's view a mapping
51
GET my_index/_mapping
{
"my_index" : {
"mappings" : {
"my_type" : {
"properties" : {
"line_id" : {
"type" : "long"
},
"line_number" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"play_name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
...
The “properties” section
contains the fields and data
types in your documents

Elasticsearch data types for fields
52
• Simple types, including:
‒ text: for full text (analyzed) strings
‒ keyword: for exact value strings
‒ date: string formatted as dates, or numeric dates
‒ integer types: like byte, short, integer, long
‒ floating-point numbers: float, double, half_float, scaled_float
‒ boolean
‒ ip: for IPv4 or IPv6 addresses
• Hierarchical Types: like object and nested
• Specialized Types:geo_point, geo_shape and percolator
• Range types and more

Updating existing mapping
53
• Existing field mappings cannot be updated. Changing the mapping would mean
invalidating already indexed documents.
- Instead, you should create a new index with the correct mappings
and reindex your data into that index.
There are some exceptions to this rule:
• new properties can be added to Object datatype fields.
• new multi-fields can be added to existing fields.
• the ignore_above parameter can be updated.

Prevent mapping explosion
54
• Defining too many fields in an index is a condition that can lead to a mapping
explosion, which can cause out of memory errors and difficult situations to recover
from.
- For example when using dynamic mapping and every new inserted documents
introduces new fields.
• The following settings allow you to limit the number of field mappings that can be
created manually or dynamically
index.mapping.total_fields.limit - maximum number of fields in an index, defaults to 1000
index.mapping.depth.limit - maximum depth for a field, which is measured as the number of
inner objects, defaults to 20
index.mapping.nested_fields.limit - maximum number of nested fields in an index,
defaults to 50

Analysis
55
• Analysis is the process of converting full text into terms (tokens) which are added to
the inverted index for searching.
- Analysis is performed by an analyzer which can be either a built-in analyzer or
a custom analyzer defined per index.
For example, at index time the built-in standard analyzer will first convert the sentence
into distinct tokens:
"Welcome to Percona Live - Open Source Database Conference 2018"
[ welcome to percona live open source database conference 2018 ]
Analyzer will lowercase each token, remove frequent stopwords

The analyze api
56
• The _analyze api can be used to test what an analyzer will to your text
curl -s -XGET localhost:$ES_PORT/_analyze?analyzer=standard -d
'Welcome to Percona Live - Open Source Database Conference 2018' |
python -m json.tool | grep token
"tokens": [
"token": "welcome",
"token": "to",
"token": "percona",
"token": "live",
"token": "open",
"token": "source",
"token": "database",
"token": "conference",
"token": "2018",

Built-in analyzers
57
• Standard - the default analyzer
• Simple – breaks text into terms whenever it encounters a character which is not a letter
• Keyword – simply indexes the text exactly as is
• Others include:
‒ whitespace, stop, pattern, language, and more are described in the docs at
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
- custom analyzers built by you

Analyzer components
58
• An analyzer consists of three parts:
1. Character Filters
2. Tokenizer
3. Token Filters
Character
Filters
Tokenizer Token FiltersInput Output
string
tokens
string
tokens

Specifying an analyzer
59
• At index time:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
• At search time:
Usually, the same analyzer should be applied at index
time and at search time, to ensure that the terms in the
query are in the same format as the terms in the inverted
index.
By default, queries will use the analyzer defined in
the field mapping, but this can be overridden with
the search_analyzer setting:
PUT my_index {
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard" }}}}}

Custom analyzer
60
• Best described with an example, let's create a custom analyzer based on
standard one, but which also removes stop words
PUT my_index {
"settings": {
"analysis": {
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": ["to", "and", "or", "is", "the"]
} },
"analyzer": { "my_content_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "standard",
"filter": ["lowercase","my_stopwords"] } } }}}

Scaling the cluster
● 10 000ft view on scaling
● Node roles
● Adding a node to a cluster
● Understanding shards
● Replicas
● Read/Write model
● Sample Architectures
61

10 000ft view on scaling
62
• ElasticSearch has the potential to be always available as long as we take advantage
of it’s scaling features.
• With vertical scaling(better hardware) having its limitations, we’ll take a look at the
horizontal scaling(more nodes in the same cluster).
• If with other datastores, horizontal scaling has its challenges, such as sharding for
MongoDB(Antonios has written an amazing tutorial on managing a sharded cluster;
you must check it out), ElasticSearch is designed to be distributed by nature so as
long as replicas as being used, the application development as well as the
administration overheard to manage scaling out the cluster is minimal.

63
• We defined a shard as elements that compose the indexes and is, each, a Lucene index.
• By default, ElasticSearch will create 5 per index, but if we have everything on one node and that
node goes down? We face disaster. This is where replicas come in.
• A replica of a shard is an exact copy of that element that lives on another node.
• A node is simply an ElasticSearch process. One or more nodes with the same name under the
“cluster.name” directive under the config file is/are making up a cluster.

64
• All nodes know about all others in the cluster and can also direct a request to another,
if needed.
• Nodes can handle both http(external) traffic as well as transport(inter cluster) traffic.
They can also switch between these. If one node receives an HTTP request that
should have been directed at another, it switches to TRANSPORT.
• Nodes can have one or more roles in the cluster.

Node Roles
65
• Master-eligible node: A node that has ”node.master” set to true (default), which makes it eligible
to be elected as the masternode, which controls the cluster and carries out administrative functions
such as deleting and creating indexes.
• Data node: A node that has ”node.data” set to true (default). Data nodes hold data and perform
data related operations such as CRUD, search, and aggregations.
• Ingest node: A node that has ”node.ingest” set to true (default). Ingest nodes are able to apply
an ingest pipeline to a document in order to transform and enrich the document, such as adding a
field that wasn’t there before, before indexing. With a heavy ingest load, it makes sense to use
dedicated ingest nodes and to mark the master and data nodes as “node.ingest: false”

Node Roles
66
• Tribe node: A tribe node, configured via the tribe.* settings, is a special type of coordinating
only node that can connect to multiple clusters and perform search and other operations across all
connected clusters. In later versions of Elastic, this role became obsolete
• Kibana node: In case Kibana is being used on a large scale with many users running
complex queries, you can have a dedicated node or nodes for it.
• To summarize, any node, by default, is master eligible, is acting as a data node as well as
handling ingestions, including ingestion pipelines. As the cluster grows, in order to separate the
overhead of different operations(maintaining the cluster, ingestion pipelines, connecting clusters
etc), it makes sense to define roles.

Adding a node to a cluster
67
• To add a node or start a cluster,we need to set the directive “cluster.name” to a descriptive value in
/etc/elasticsearch/elasticsearch.yml ; All nodes need to have the same cluster.name:
• By default, ElasticSearch binds to the loopback interface so we must edit the networking section of the
config file and bind the daemon to a specific ip or use 0.0.0.0 for all:
• We must name our nodes, again, with descriptive values:
• Nodes running on the same host will be auto-discovered but remote nodes will use zen discovery which
will take a list of Ips that will assemble the cluster. The firewall must allow communication on 9200,9300:

Adding a node to a cluster
68
• Of course, there are more options that you can configure but for the sake of this exercise, these
will be enough.
• Once these are set, restart the daemon and a /_cluster/health?pretty should return something like:
curl -X GET http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "democluster", ß
….
"number_of_nodes" : 2, ß
"number_of_data_nodes" : 2, ß
……
}

Understading shards
69
A shard is a worker unit that holds data and can be assigned to nodes and is, itself a
Lucene index. Think of a self contained search engine that handles a portion of data.
‒ An index is merely a virtual namespace which points to a number of shards
My_index
My_cluster
Node1 Node2
shard shard shard shardshard
An index is "split" into shards
before any documents are
indexed

Primary and Replica
70
• There are two types of shards:
- primary: the original shards of an index
- replicas: copies of the primary
• Documents are replicated between a primary and its replicas
- a primary and all replicas are guaranteed to be on different nodes
My_cluster
Node1
Node2
P0 P2
R3
P3P1
R1 R4
P4
R0 R2

Number of Primary shards
71
• Is fixed – default number of primary shard for an index is 5
• You can specify a different number of shards when you create the index.
• Changing the number of shards after the index has been created can be done with the
split or shrink index API but it’s NOT a trivial operation. It’s basically the same as
reindexing. Plan accordingly.
PUT my_new_index
{
"settings": {
"number_of_shards": 3
}
}

Replicas are good for
72
• High availability
- We can lose a node and still have all the data available
- After losing a primary, Elasticsearch will automatically promote a replica to a
primary and start replicating unassigned replicas
• Read throughput
- Replicas can handle query/read requests from client applications
- Allows you to scale your data and better utilize cluster resources
You can change the number
of replicas for an index at
any time using the
_settings endpoint:
PUT my_index/_settings
{
"number_of_replicas": 2
}

Replicas
73
• Let’s play a bit with replicas. In this example I’ve indexed Shakespeare’s work again. Here is the
cluster and the index:
curl -X GET
http://localhost:9200/_cluster/health?p
retty
{
"cluster_name" : "democluster",
"status" : "yellow", ß
….
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"unassigned_shards" : 5,
…
}
curl -XGET localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open shakespeare jkJ280IVT3mcfswXwBR1QA 5 1 111394 0 22.4mb 22.4mb
Yellow indicates a problem. What do you think the
problem is? What would be a solution here?

Replicas
74
• Replicas will get automatically assigned if the topology permits it. All I’ve done was to start a
second node and:
• We can change the replicas number, dynamically, in the index settings. This is a trivial operation,
unlike changing the number of shards.
curl -XGET localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open shakespeare jkJ280IVT3mcfswXwBR1QA 5 1 111394 0 44.9mb 22.4mb
curl -X PUT "localhost:9200/shakespeare/_settings" -H 'Content-
Type: application/json' -d'
> {
> "index" : {
> "number_of_replicas" : 0
> }
> }
> '
{"acknowledged”}

Write Path
75
• The process of keeping the primary shard in sync with its replicas is called a data replication model.
ElasticSearch’s data replication model is based on the primary-backup model. One primary and n backups.
• This model runs on top of replication groups. We’ve seen before that as a default, we have 5 primary shards and each of
these shards have 1 replica. In the above graph, we have a 2 replication groups and each primary has 3 replicas.
• In the context of a replication group, primary shard is responsible for indexing and keeping the replicas up to date. In a
replication group at a certain point, some replicas might be offline so the master node will keep a in-sync copies group with
the ones that are and have received all the writes that the user has acknowledged.
primary
replica replica
primary
Replication group 1 Replication group 2
replica
replica
replica
replica

Write Path
76
• The primary follows the flow of validating the incoming operation and the documents, execute it locally, forward the operation
to all replicas in the in-sync list, ack the write once all the replicas from the list have run the operation.
• write
• Some notes about failure handling. In case a primary fails, the indexing will stop for 1 minute while the master promotes a
new primary. The primary will check with his replicas to make sure he’s still primary and wasn’t demoted for whatever reason.
An operation coming from a stale primary will be declined by the replicas.
1
2
3
local
In-sync
1
2

Read Path
77
• The node that received the query(which is called the coordinating node) will find the relevant shards
for the read request, select an active copy of the data(primary or replica; it will round robin) from a
replication group, send the read request to the selected copies, combine the results and respond.
• The requests to each shard are single threads but more than one shard can be done in parallel.

Read Path
78
• Because we’re talking about roundrobin when we were talking about the active shard, this is where
adding more replicas will help. Any new request will hit a different replica so the work is spread.
• The failure handling is way easier. If for some reason a response is not received, the coordinating
node will resubmit the read request to the relevant replication group, pick a different replica and the
same flow reapplies.

Sample Architectures
79
• For lightweight searches and where the data can be reindexed without suffering from loss, the single
node cluster is not unseen.
• A basic deployment with data resilience is the two node cluster. Most SaaS providers start with this
deployment.
• The two node model can be scaled as much as it’s needed but is usually recommended in case you
are running just basic indexing/search operations. In case more granularity is needed, the data can be
reindexed with a higher number of shards and replicas across.
• In case the number of nodes in the cluster gets really high or the operations get complex, it’s time to
separate the roles. Separating the nodes also needs to take in consideration the cases where you
would lose one or more nodes of a specific role. For instance if you’re using ingestion only nodes,
data only nodes and master only nodes, you need to consider what happens if you lose one or more.

Sample Architectures
80
• ObjectRocket starts with 4 ingestion nodes, 2 kibana, 2 data
and 3 master nodes.
• We don’t care how many client nodes we lose as long as we
have 1 remaining.
• The master nodes pick a active master based on
quorum.This helps with split brain.
• Data nodes, of course, we can lose at maximum one.
• Consider redundant components as much as possible.
• We will cover security in a later chapter. By default, in the
community version, there is no built in security. In this case,
Firewall limitations are a must have.

81
Lab 2
Scaling the cluster
Objectives:
Adding nodes to your cluster
Change the number of Replicas
Steps:
1. Navigate to /Percona2018/Lab02/

Operating the
cluster ● Working with nodes
● Working with shards
● Reindex
● Backup/Restore
● Plugins
82

Cheatsheet
83
curl -X GET ”<host>:<port>/_cluster/settings”
curl -X GET " ”<host>:<port>/_cluster/settings?include_defaults=true”
curl -X PUT ”<host>:<port>/_cluster/settings" -H 'Content-Type:
application/json' -d'
{
"persistent" : {
”name of the setting" : value
}}'
curl -X PUT ”<host>:<port>/_cluster/settings" -H 'Content-Type:
{
"transient" : {
”name of the setting" : null
}}'

Shard Allocation
84
Allow control over how and where shards are allocated
Shard Allocation settings (cluster.routing.allocation)
- enable
- node_concurrent_incoming_recoveries
- node_concurrent_outgoing_recoveries
- node_concurrent_recoveries
- same_shard.host
Shard Rebalancing settings (cluster.routing.rebalance)
- enable
- allow_rebalance
- cluster_concurrent_rebalance

Shard Allocation - Disk
85
cluster.routing.allocation.disk.threshold_enabled: Defaults to true
Low: Do not allocate new shards. Defaults to 85%
High: Try to relocate shards. Defaults to 90%
Flood_stage: Enforces a read-only index block. Must be released manually. Defaults to 95%
cluster.info.update.interval How often Elasticsearch should check on disk usage (Defaults to 30s)
cluster.routing.allocation.disk.include_relocations: Defaults to true – Could lead to false alerts
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type:
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "100gb",
"cluster.routing.allocation.disk.watermark.high": "50gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
"cluster.info.update.interval": "1m” }}

Shard Allocation – Rack/Zone
86
Make Elasticsearch aware of the topology
- it can ensure that the primary shard and its replica shards are spread across different
- Physical servers (node.attr.phy_host)
- Racks (node.attr.rack_id)
- Availability Zones (node.attr.zone)
- Minimize the risk of losing all shard copies at the same time
- Minimize latency
Configuration:
cluster.routing.allocation.awareness.attributes: zone, rack_id
Force awareness:
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
cluster.routing.allocation.awareness.attributes: zone

Restart node(s)
87
Elasticsearch wants your data to be fully replicated and evenly balanced.
When a nodes go down:
- The cluster immediately recognize the change
- Rebalancing begins
- Rebalancing takes time and can become costly
During a planned maintenance you should hold off on rebalancing

Restart node(s)
88
Steps:
1) Flush pending indexing operations POST /_flush/synced
2) Disable shard allocation
3) Shut down a single node
4) Perform a maintenance
PUT /_cluster/settings
{
"transient" : {
"cluster.routing.allocation.enable" : "none”
}
}

Restart node(s)
89
5) Restart the node, and confirm that it joins the cluster.
6) Re-enable shard allocation as follows:
7) Check the cluster health
{
"transient" : {
"cluster.routing.allocation.enable" : "all"
}
}

Restart node(s)
90
You can also make Elastic less sensitive to changes
The default for Master is to instruct shard relocations is 1m.
During restarts we can lower the threshold.
Useful setting for slow or unreliable networks.
PUT _all/_settings
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "5m"
}
}

Remove a node
91
Elastic automatically detects topology changes.
In order to remove a node you need to drain it and then stop it
Where attribute:
_name :Match nodes by node names
_ip: Match nodes by IP addresses (the IP address associated with the hostname)
_host: Match nodes by hostnames
PUT _cluster/settings
{
"transient" : {
"cluster.routing.allocation.exclude.{attribute} ": ”<value>"
}
}

Remove a node
92
Additional considerations:
- Master-eligible node
- Seed nodes
- Space considerations
- Performance considerations
- If possible stop writes
- Do not allow new allocations ("cluster.routing.allocation.enable" : "none")
- Overhead from the shard drains
- Throttle (indices.recovery.max_bytes_per_sec)
- One node at a time (cluster.routing.allocation.disk.watermark)
Move shards manually (Reroute API)
- Flush and if possible stop writes
- Safe for Replicas, not recommended for Primaries (may lead to data loss)

Remove a node
93
Cancel the drain of a node by removing the node or reset the attribute
Where attribute:
_name :Match nodes by node names
_ip: Match nodes by IP addresses (the IP address associated with the hostname)
_host: Match nodes by hostnames
PUT _cluster/settings
{
"transient" : {
"cluster.routing.allocation.exclude.{attribute}": ""
}
}

Replace a node
94
Similar to remove a node with the difference that you need to add a node as well.
Simplest approach: add a new node and then drain the old node
Additional considerations:
- Master-eligible/Seed nodes
- Do not allow new allocations (cluster.routing.allocation.exclude._name)
- Overhead from drain/throttle (indices.recovery.max_bytes_per_sec)
- Space considerations
- Max amount of data each node can get. Watermark
Alternatively use the reroute API to drain the node

Working with Shards
95
Number of Shards/Replicas
- Defined on Index creation
- Number of Replicas changes dynamically
- Number of Shards can change using:
- shrink API
- split API
- reindex API
Why increase the number of shards:
- Index size
- Performance considerations
- Hard limits (LUCENE-5843)
Almost same reasons apply when you decreasing the number of shards

Shrink API
96
Shrinks an existing index into a new one with fewer primary shards:
- Target index must be a factor of the number of shards in the source index
- If a prime number it can only be shrunk into a single primary shard
- Before shrinking, a (primary or replica) copy of every shard in the index must be present on the
same node
Works as follows:
- First, it creates a new target index with the same definition as the source index, but with a
smaller number of primary shards.
- Then it hard-links segments from the source index into the target index.
- Finally, it recovers the target index as though it were a closed index which had just been re-
opened.

Shrink API
97
In order to shrink an index, the index must be marked as read-only, and a copy of every shard in the
index must be relocated to the same node and have health green
Note that it may take a while…
Check progress using GET _cat/recovery?v
curl -X PUT ”<host>:<port>/my_source_index/_settings" -H 'Content-
Type: application/json' -d'
{
"settings": {
"index.routing.allocation.require._name": "shrink_node_name",
"index.blocks.write": true
}}’

Shrink API
98
Finally its time to shrink the index:
It is similar to create index api – almost same arguments
Some constraints apply
curl -X POST
”<host>:<port>/my_source_index/_shrink/my_target_index" -H
'Content-Type: application/json' -d'
{
"settings": {
"index.number_of_replicas": <number>,
"index.number_of_shards": <number>,
"index.routing.allocation.require._name": null,
"index.blocks.write": null
}}’

Split API
99
Splits an existing index into a new index:
- The original primary shard is split into two or more primary shards.
- The number of splits is determined by the index.number_of_routing_shards setting
The _split API requires the source index to be created with a specific number_of_routing_shards in
order to be split in the future. This requirement has been removed in Elasticsearch 7.0
Works as follows:
- First, it creates a new target index with a larger number of primary shards.
- Then it hard-links segments from the source index into the target index.
- Once the low level files are created all documents will be hashed again to delete documents that
belong to a different shard.
- Finally, it recovers the target index as though it were a closed index which had just been re-
opened.

Split API
100
In order to shrink an index, the index must be marked as read-only (assuming the index has
number_of_routing_shards set)
Split the index:
curl -X PUT ”<host>:<port>/my_source_index/_settings" -H 'Content-Type:
{
"settings": {
"index.blocks.write": true
}}’
curl -X POST
”<host>:<port:1>/my_source_index/_split/my_target_index?copy_settings=true" -
H 'Content-Type: application/json' -d'
{
"settings": {
"index.number_of_shards": 2
}}'

Reindex API - Definition
101
- Does not copy the settings of the source index
- version_type : internal/external
- source supports “query”, multi-indexes & remote location
- URL parameters: refresh, wait_for_completion, wait_for_active_shards, timeout, scroll and
requests_per_second
- Supports painless scripts to manipulate indexing
curl -X POST ”<host>:<port>/_reindex" -H 'Content-Type:
{
"source": {
"index": ”<source index>"
},
"dest": {
"index": ”<destination index>"
}}’

Reindex API – Response Body
102
"took": 1200,
"timed_out": false,
"total": 10,
"updated": 0,
"created": 10,
"deleted": 0,
"batches": 1,
"noops": 0,
"version_conflicts": 2,
"retries": {
"bulk": 0,
"search": 0},
"throttled_millis": 0,
"requests_per_second": 1,
"throttled_until_millis": 0,
"failures": [ ]
Total milliseconds the entire operation took
The number of documents that were successfully processed
Summary of the operation counts
The number of version conflicts that reindex hit
Throttling Statistics

Reindex API
103
Active Reindex jobs:
Cancel a Reindex job:
Re-Throttle:
Reindexing from a remote server:
- Use on-heap buffer that defaults to a maximum size of 100mb
- May need to use a smaller batch size
- Configure socket_timeout and connect_timeout. Both default to 30 seconds
POST _reindex/<id of the reindex>/_rethrottle?requests_per_second=-1
POST _tasks/<id of the reindex>/_cancel
GET _tasks?detailed=true&actions=*reindex

Snapshots - Backup
104
A snapshot is a backup taken from a running Elasticsearch cluster
Snapshots are taken incrementally
Version compatibility – one major version behind
You must register a snapshot repository before you can perform snapshot
Must exists on elasticsearch.yml: path.repo
curl -X GET
”<host>:<port>/_snapshot/_all"
curl -X PUT
”<host>:<port>/_snapshot/my_backup" -H
{
"type": "fs",
"settings": {
"location": ”backup location"
}}’

Snapshots - Backup
105
Shared location: On elasticsearch.yml: path.repo: ["/mount/backups0", "/mount/backups1"]
Don’t forget to register it!!!
Registration options
location: Location of the snapshots
compress: Turns on compression of the snapshot files. Defaults to true.
chunk_size: Big files can be broken down into chunks. Defaults to null (unlimited chunk size)
max_restore_bytes_per_sec: Throttles per node restore rate. Defaults to 40mb/second
max_snapshot_bytes_per_sec: Throttles per node snapshot rate. Defaults to 40mb/second
readonly: Makes repository read-only. Defaults to false

Snapshots - Backup
106
wait_for_completion whether or not the request should return immediately after snapshot completion
ignore_unavailable: Ignores indexes that don’t exists
include_global_state: Prevent the cluster global state to be stored as part of the snapshot
curl -X PUT
”<host>:<port>/_snapshot/my_backup/snapshot_2?wait_for_completion=true" -
H 'Content-Type: application/json' -d'
{
"indices": "index_1,index_2,index_3",
"ignore_unavailable": true,
"include_global_state": false
}’
curl -X PUT ”<host>:<port>/_snapshot/my_backup/snapshot_1?wait_for_completion=true”

Snapshots - Backup
107
IN_PROGRESS: The snapshot is currently running.
SUCCESS: The snapshot finished and all shards were stored successfully.
FAILED: The snapshot finished with an error and failed to store any data.
PARTIAL: The global cluster state was stored, but data of at least one shard wasn’t stored
successfully.
INCOMPATIBLE: The snapshot was created with an old version of ES incompatible with the current
version of the cluster.
Delete snapshot:
Unregister Repo:
curl -X GET ”<host>:<port>/_snapshot/my_backup/snapshot_1"
curl -X DELETE ”<host>:<port>/_snapshot/my_backup/snapshot_2"
curl -X DELETE ”<host>:<port>/_snapshot/my_backup"

Snapshots - Restore
108
Check the progress:
Also supported:
- Partial restore
- Restore with different settings
- Restore to a different cluster
curl -X POST ”<host>:<port>/_snapshot/my_backup/snapshot_1/_restore”
curl -X GET ”<host>:<port>/_snapshot/_status”
curl -X GET ”<host>:<port>/_snapshot/my_backup/_status"
curl -X GET ”<host>:<port>/_snapshot/my_backup/snapshot_1/_status”

Snapshots - Restore
109
Restore with different settings
Select indices that should be restored
Renames indices on restore using
regular expression that supports
referencing the original text.
Restore global state
"index_settings": {"index.number_of_replicas": 0}
"ignore_index_settings": ["index.refresh_interval”]
curl -X POST
"localhost:9200/_snapshot/my_backup/snapshot_1/_res
tore" -H 'Content-Type: application/json' -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": true,
"rename_pattern": "index_(.+)",
"rename_replacement": "restored_index_$1"
}’

Plugins
110
Away to enhance the basic Elasticsearch functionality in a custom manner.
They range from:
- Mapping and analysis
- Discovery
- Security
- Management
- Alerting
- And many many more…
Installation: bin/elasticsearch-plugin install [plugin_name]
Considerations:
- Security
- Maintainability between version
We are heavily use Cerebro (https://github.com/lmenezes/cerebro) on our tutorial

111
Lab 3
Operating the cluster
Objectives:
Learn how to:
o Remove a node from a cluster.
o Use the ReIndex API
Steps:
3. Execute ./run_cluster.sh to begin

Troubleshooting
● Cluster health
● Improving Performance
● Diagnostics
112

Cluster health
113
• The cluster health API allows to get a very simple status on the health of the
cluster
• The health status is either green, yellow or red and exists at three levels: shard,
index, and cluster
• Shard health
‒ red: at least one primary shard is not allocated in the cluster
‒ yellow: all primaries are allocated but at least one replica is not
‒ green: all shards are allocated
• Index health
‒ status of the worst shard in that index
• Cluster health
‒ status of the worst index in the cluster

Cluster health
114
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 50.0
}
GET _cluster/health

Green status
115
• The state your cluster should have
– All of your primary and replica shards
are allocated and active
My_cluster
Node1 Node2
P0 R0
Node3
R0
PUT my_index
{
"settings": {
}
}

Yellow status
116
It means all your primary shards are allocated,
but one or more replicas are not.
- you may not have enough nodes in the
cluster, or a node may have failed
My_cluster
Node1 Node2
P0 R0
Node3
R0
PUT my_index
{
"settings": {
}
}
R0
Unassigned

Red status
117
• At least one primary shard is missing
- searches will return partial results and
indexing might fail
PUT my_index
{
"settings": {
}
}
My_cluster
Node1 Node2
P0 R0
Node3

Resolve unassigned shards
118
Causes:
• Shard allocation is purposefully delayed
• Too many shards, not enough nodes
• You need to re-enable shard allocation
• Shard data no longer exists in the cluster
• Low disk watermark
• Multiple Elasticsearch versions
The _cat endpoint will tell you which shards are unassigned, and
why:
curl -XGET
localhost:9200/_cat/shards?h=index,shard,prirep,state,u
nassigned.reason| grep UNASSIGNED

Resolve unassigned shards
119
• You can also use the cluster allocation explain API to get more
information about shard allocation issues:
curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "testing",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
…
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
… {
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard
already exists" }]}]}

Reason 1 – Shard allocation delayed
120
• When a node leaves the cluster, the
master node temporarily delays
shard reallocation to avoid
needlessly wasting resources on
rebalancing shards, in the event the
original node is able to recover within
a certain period of time (one minute,
by default)
Modify the delay dynamically:
curl -XPUT 'localhost:9200/my_index/_settings' -d
'{
"settings": {
"index.unassigned.node_left.delayed_timeout":
"30s"
}
}'

Reason 2 – Not enough nodes
121
• As nodes join and leave the cluster, the master node reassigns shards
automatically, ensuring that multiple copies of a shard aren’t assigned to
the same node
• A shard may linger in an unassigned state if there are not enough nodes
to distribute the shards accordingly.
• Make sure that every index in your cluster is initialized with fewer
replicas per primary shard than the number of nodes in your cluster

Reason 3 – re-enable shard allocation
122
• Shard allocation is enabled by default on all nodes, but you may have disabled
shard allocation at some point (for example, in order to perform a rolling
restart) and forgotten to re-enable it.
• To enable shard allocation, update the _cluster settings API:
curl -XPUT 'localhost:9200/_cluster/settings' -d
'{ "transient":
{ "cluster.routing.allocation.enable" : "all"
}
}'

Reason 4 – Shard data no longer exists
123
• Primary shard is not available anymore because the index may have been created on
a node without any replicas (a technique used to speed up the initial
indexing process), and the node left the cluster before the data could be replicated.
• Another possibility is that a node may have encountered an issue while rebooting or
has storage issues
• In this scenario, you have to decide how to proceed: try to get the original node to
recover and rejoin the cluster (and do not force allocate the primary shard), or force
allocate the shard using the _reroute API and reindex the missing data using the
original data source, or from a backup.

Reason 4 – Shard data no longer exists
124
• To allocate an unassigned primary shard:
curl -XPOST 'localhost:9200/_cluster/reroute' -d
'{ "commands" :
[ { "allocate" :
{ "index" : "my_index", "shard" : 0, "node": "<NODE_NAME>",
"allow_primary": "true" } }]
}'
Warning! The caveat with forcing allocation of a primary shard is that you will be
assigning an “empty” shard. If the node that contained the original primary shard
data were to rejoin the cluster later, its data would be overwritten by the newly
created (empty) primary shard, because it would be considered a “newer”
version of the data.

Reason 5 – Low disk watermark
125
• Once a node has reached this level of disk usage, or what Elasticsearch calls a
“low disk watermark”, it will not be assigned more shards – default is 85%
• You can check the disk space on each node in your cluster (and see which shards
are stored on each of those nodes) by querying the _cat API:
curl -s –XGET 'localhost:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
5 260b 47.3gb 43.4gb 100.7gb 46 127.0.0.1 127.0.0.1 CSUXak2
Example response:

Reason 5 – Low disk watermark
126
Resolutions:
- add more nodes
- increase disk size
- increase low watermark threshold, if safe:
PUT /_cluster/settings -d
'{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "90%"
}
}'

Reason 6 – Multiple ES versions
127
• Usually encountered when in the middle of a rolling upgrade
• The master node will not assign a primary shard’s replicas to any node running an
older major version (1.x -> 2.x -> 5.x).

When nothing works
128
… or restore the affected index from an old snapshot

Poor performance
129
This can be a long discussion – see more in the `Best practices` chapter
You want to start by:
• Enable slow logging so you can Identify long running queries
• Run identified searches through the _profiling API to look at timing of
individual components
• Filter, filter, filter

Enable slow log
130
• Send a put request to the _cluster API to define the level of slow log that you want
to turn on: warn, info, debug, and trace
’{ "transient" :
{ "logger.index.search.slowlog" : "DEBUG",
"logger.index.indexing.slowlog" : "DEBUG" }
}'
• All slow logging is enabled on the index level:
PUT /my_index/_settings
'{"index.search.slowlog.threshold.query.warn" : "50ms",
"index.search.slowlog.threshold.fetch.warn": "50ms",
"index.indexing.slowlog.threshold.index.warn": "50ms"
}'

Profile
131
• The Profile API provides detailed timing information about the execution of individual
components in a search request and it can be very verbose, especially for complex
requests executed across many shards
• Usage:
GET /my_index/_search
{
"profile": true,
"query" : {
"match" : { "speaker": "KING HENRY IV" }
}
}

Filters
132
• One way to improve the performance of your searches is with filters. The filtered query
can be your best friend. It’s important to filter first because filter in a search does not
affect the outcome of the document score, so you use very little in terms of resources to
cut the search field down to size.
• A rule of thumb is to use filters when you can and queries when you must: when you
need the actual scoring from the queries.
• Also, filters can be cached.

Upgrade the
cluster ● Generals
● Upgrade path
● Before upgrading
● Rolling upgrades
● On rolling upgrades
● Full cluster restart upgrades
● Upgrades by re-indexing
● Re-indexing in place
● Moving through the versions
133

Generals
134
• Elasticsearch can read indices created in the previous major version. Older indices
must be re-indexed or deleted.
• From versions 5.0 ElasticSearch can usually be upgraded using rolling restarts so that the
service is not interrupted.
• Upgrades across major versions before 6.0 require a full cluster restart
• Backup backup backup
• Nodes will fail to start if incompatible indexes are being found
• You can reindex from a remote location so that you skip the backup/restore option

Upgrade path
135
• Any index created prior to 5.0 will need to be re-indexed into newer versions

Before upgrading
136
• Understand the changes that appeared in the new version by reviewing the Release
highlights and Release notes.
• Review the list of changes that can break your cluster.
• Check the deprecation log to see if any of your current features became absolute.
• Check for updated versions of your current plugins or compatibility with the new version.
• Upgrade your dev/QA/staging cluster before proceeding with the production cluster.
• Back up your data by taking a snapshot before upgrading. If you want to rollback, you will
need it. You can’t rollback unless you have a backup.

Rolling upgrades
137
1. As we’ve seen before, ES adjusts the balancing of shards based on topology. If we remove a
node just like that, it will think the node crashed and it will start redistributing the shards. Then
once more when we get the node back. For this, we need to disable shard allocation
• The shard recovery process is being helped by stopping indexing and using “POST
_flush/synced”
• At this point, the cluster is going to turn yellow because secondary shards from other nodes
will get promoted to primary after potential primaries and replica shards will become
unavailable but this doesn’t hurt the operation of the cluster. As we’ve discussed, as long as 1
shard from a replication group is available, the dataset is alive.
• Depending on the number of nodes you have left, be careful not to take out another J
curl -XPUT 'http://localhost:9200/_cluster/settings' -d
'{ "transient":
{ "cluster.routing.allocation.enable":
"none" }}'

Rolling upgrades
138
2. Stopping the node. This can be as easy as “service elasticsearch stop”.
3. Carry out the needed maintenance(depending on the packet manager, or the way ES has been
installed, you might want to run an yum update or to replace the binaries;
Be careful of the versions and plugins:
Ø An upper version node will join a cluster made out of lower version nodes but a lower version node
won’t join a cluster made out of upper version nodes;
Ø /usr/share/elasticsearch/bin/elasticsearch-plugin is a script provided by ES to handle plugins.
Upgrade these to the correct versions.
Ø During a rolling upgrade, primary shards assigned to a node running the new version cannot have
their replicas assigned to a node with the old version. The new version might have a different data
format that is not understood by the old version.
4. Starting the node;

Rolling upgrades
139
5. Make sure that everything has started correctly. Check the node’s logs for messages of the
sort:
6. Enable shard allocation(same command as at step 1 but use ”null” (the value not string), to
reset to default instead of “none” )
7. Check the cluster status and make sure everything has recovered. It can take a bit for the
shards to become available.
8. NEEEEEXT!
curl -X GET http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "democluster", ß
"status" : "green", ß
…
}
[2018-10-25T10:04:45,462][INFO ][o.e.n.Node ] [node2] initialized
[2018-10-25T10:04:45,462][INFO ][o.e.n.Node ] [node2] starting ...
[2018-10-25T10:04:45,729][INFO ][o.e.t.TransportService ] [node2] publish_address {134.213.56.244:9300}, bound_addresses
{[::]:9300}
[2018-10-25T10:04:50,465][INFO ][o.e.n.Node ] [node2] started. ß

On Rolling upgrades
140
• As mentioned before, in a yellow state, the cluster continues to operate normally.
• Because you might have a reduced number of replicas assigned, your performance might be
impacted. Plan this outside the normal working hours.
• New features will come into play when all the nodes are running the updated version.
• Again, we can’t rollback. Lower version nodes won’t join the cluster.

On Rolling upgrades
141
• If you have a network partition that will separate the newly updated nodes from the old ones,
when this gets solved, the old ones will fail with a message of the sort:
• In this case, you have no other choice than to stop the nodes and get them upgraded. Won’t
be rolling and you might have service interruption but there is no other alternative.
[2018-10-16T15:08:28,928][INFO ][o.e.d.z.ZenDiscovery ] [node3] failed to send join request to master
[{node1}{bWKRUNFXTEy1kBgQ1y2LvA}{Gxzb3blaR86CUL3gKLhnXA}{134.213.56.107}{134.213.56.107:9300}{ml.machine_memory=8196317
184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason
[RemoteTransportException[[node1][134.213.56.107:9300][internal:discovery/zen/join]]; nested: IllegalStateException[node
{node3}{Nt4eKRkvR6-
SZ_gg22lqTQ}{dQRBgGDwSo2Zr7W866e64w}{162.13.188.164}{162.13.188.164:9300}{ml.machine_memory=8196317184,
ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} is on version [6.3.2] that cannot deserialize the license format [4], upgrade node
to at least 6.4.0]; ]. ß

Full Cluster restart upgrade
142
• It was needed before version 6 when major versions were involved.
• v5.6 à v6 can be done with a rolling upgrade.
• It involves shutting down the cluster, upgrading the nodes then starting the cluster up.
1. Disable shard allocation so we don’t have the unnecessary IO after nodes are being stopped.
2. As briefly mentioned before, stop indexing and perform a “POST _flush/synced” will help with
shard recovery
curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "transient":
{ "cluster.routing.allocation.enable": "none" }}'

Full Cluster restart upgrade
143
3. Shut down all nodes. “service stop elasticsearch” or whatever works :)
4. Use your package manager to update elasticsearch on each node.
5. Upgrade the plugins with “/usr/share/elasticsearch/bin/elasticsearch-plugin”
6. Start the nodes up
7. Wait for the nodes to join the cluster.
8. Enable shard allocation.
9. Check that the cluster is back to normal before enabling indexing,
curl -X GET
http://localhost:9200/_cluster/h
ealth?pretty
{
"cluster_name" :
"democluster",
"status" : "yellow", ß
….
"number_of_nodes" : 1, ß
…
}

Upgrades by re-indexing
144
• Elasticsearch can read indices created in the previous major version.
• V6 will read V5 indices but not the V2 or bellow. V5 will read V2 indices but not V1 or bellow
• Older indices will need to be re-indexed or dropped.
• If a node will detect an index that is incompatible, it will fail to start.
• Based on the above, trying to upgrade to a major version that is really far in front, is a bit
tricky if you don’t have a spare cluster. If you do, it’s actually quite easy.

145
• The easiest way in which you can move to a new version would be to create a cluster with
that version and use the remote indexing feature. When the new index will be created by the
new version for the new version.
• To do list for remote indexing:
1. Add the host and port to the new cluster’s elasticsearch.yml under reindex.remote.whitelist:
2. Create an index on the new cluster with the correct mappings and settings.
• Using number_of_replicas of 0 and refresh_interval -1 will speed up the next operation.
reindex.remote.whitelist: oldhost:oldport

146
3. Reindex from remote. Example: curl -X POST "localhost:9200/_reindex" -H
{
"source": {
"remote": {
"host": "http://oldhost:9200",
"username": "user",
"password": "pass"
},
"index": "source",
"query": {
"match": {
"test": "data"
}
}
},
"dest": {
"index": "dest"
}
}
'

Re-indexing in place
147
• In order to make an older version index work on a newer version cluster, you will need to
reindex to a new one. This will be done by the re-index API
curl -X POST "localhost:9200/_reindex" -H
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
'

148
1. If you want to maintain your mappings, create a new index and copy the mappings and
settings;
2. You can again disable the refresh_interval and number_of_replicas to make the operation
faster;
3. Reindex the documents to the new index;
4. Reset the refresh_interval and number_of_replicas to the wanted values;
5. Wait for the alias to turn green and it will do so when the replicas will get allocated

149
6. In a single update, to avoid missed operations on the old index, you should:
• Delete the old index (let’s call it old index)
• Add an alias with the old index to the new index
• Add any aliases that existed on the old index to the new index. More aliases meaning more
”adds” curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -
d'
{
"actions" : [
{ "add": { "index": ”new_index", "alias": ”old_index" } },
{ "remove_index": { "index": ”old_index" } },
{ "add" : { "index" : ”new_index", "alias" : ”any_other_aliases" } }
]
}
'

Moving through the versions
150
ElasticSearch
V2
Perform a full
cluster restart
to version 5.6
Re-index the
V2 indexes in
place so they
work with 5.6
Perform a
rolling restart
to 6.x
Fully on
V5

Moving through the versions
151
ElasticSearch
V1
Perform a full
cluster restart
to V 2.4.X
Re-index the
1.X indices in
place so they
work on
V2.4.X
Perform a full
cluster restart
to V5.6
Re-index the
V2 indices so
they work on
V5
Perform a
rolling restart
to V6.X
Fully on
V2
Fully on
V5

152
Lab 4
Upgrading the cluster
Objectives:
Learn how to:
o Upgrade an elasticsearch cluster.
Steps:
3. Execute ./run_cluster.sh to begin
https://goo.gl/ddaVdS

Security
● Authentication
● Authorization
● Encryption
● Audit
153

Security
154
The Open Source version of ElasticSearch, does not provide
- Authentication
- Authorization
- Encryption
To overcome this we will use open-source:
- Firewall
- Reverse proxy
- Encryption tools
Alternative you can buy X-Pack which provides a different layer of Security

Firewall
155
Client communication:
Intra-cluster communication:
iptables -I INPUT 1 -p tcp --dport 9200:9300 -s IP_1, IP_2 -j ACCEPT
iptables -I INPUT 4 -p tcp --dport 9200:9300 -j REJECT
iptables -I INPUT 1 -p tcp --dport 9300:9400 -s IP_1, IP_2 -j ACCEPT
iptables -I INPUT 4 -p tcp --dport 9300:9400 -j REJECT

Firewall
156
DNS
SSH
Monitoring tools
Allow whatever port your monitoring tool uses.
iptables -A OUTPUT -p udp --dport 53 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -p udp --sport 53 -m state --state ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p tcp --dport 53 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp --sport 53 -m state --state ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp --dport ssh -j ACCEPT
iptables -A OUTPUT -p tcp --sport ssh -j ACCEPT

Reverse Proxy
157
client
client
client
ES node
ES node
Reverse
proxy -
Nginx
Advertise
9200 to 8080
HTTP request
ES:8080 Rules
HTTP request

Authentication
158
We are going to use nginx: ngx_http_auth_basic_module
On nginx.conf
1
2
3
4
1) Listens to 19200 port
2) Enables auth
3) Password file location
4) ES <host>:<port>
server {
listen *:19200;
location / {
auth_basic "Restricted";
auth_basic_user_file /var/data/nginx/.htpasswd;
proxy_pass http://localhost:9200;
proxy_read_timeout 90;
}
}

Authentication
159
Create users:
- htpasswd -c /var/data/nginx/.htpasswd <username>
- You will be prompt for the password
- Alternatively use the -b flag and provide the pass on cmd line
Access Elasticsearch:
curl <host> #Returns 301
curl <host>:19200 #Returns 401 Authorization Required
curl <username>:<password>@<host>:<19200> #Returns Elasticsearch output

Adding SSL to the mix
160
Use nginx as reverse proxy to encrypt client communication
On nginx.conf
Certificates:
- Can either obtained by a commercial website
- Self generated
ssl on;
ssl_certificate /etc/ssl/certs/<cert>.crt;
ssl_certificate_key /etc/ssl/private/<key>.key;
ssl_session_cache shared:SSL:10m;

Authorization
161
- Authentication alone is not enough.
- Once allowed access, the client can do whatever it wants in the cluster.
- Simplest way of authorization is to deny endpoints
location / {
auth_basic_user_file /var/data/nginx-elastic/.htpasswd;
if ($request_filename ~ _shutdown) {
return 403;
break;
}
1
2
1) If user requests for shutdown
2) Return 403
curl -X GET -k "esuser:esuser@es-node1-9200:19200/_cluster/nodes/_shutdown/"
Produces a 403 Forbidden

Authorization
162
Assign roles using nginx. For example a user
1
2
3
1) Listens to 19500 port
2) Enables auth
3) Regex match for endpoints
4) FW to ES <host>:<port>4
server {
listen 19500;
auth_basic_user_file /var/data/nginx/.htpasswd_users;
location / {
return 403;
}
location ~* ^(/_search|/_analyze) {
proxy_pass http://<es_node>;
proxy_redirect off;
}}

Encryption & Co
163
Protecting the data on disk is also essential.
LUKS (Linux Unified Key Setup)
- encrypts entire block devices
- cpus with AES-NI (Advanced Encryption Standard Instruction Set) can accelerate dm-crypt
- supports limited number of passwords
- Keep the keys in a safe place
Always audit:
- Access Logs
- Ports
- Backups
- Physical access

Working with
Data – Advanced
Operations
● Alias
● Bulk API
● Aggregations
● …
164

Pagination
165
• By default, Elasticsearch will return the first 10 hits of your query. The size
parameter is used to specify the number of hits.
GET shakspeare/_search?pretty
{
"size": 20,
"query": {
"match": {
"play_name": "Hamlet"}
}
}
But this is just the first page of
hits

Pagination - from
166
• Add the from parameter to a query to specify the offset from the first result you
want to fetch (it defaults to 0).
GET shakespeare/_search?pretty
{
"from": 20,
"size": 20,
"query": {
"match": {
"play_name": "Hamlet"}
}
}
Get the next page of hits

Pagination - Scroll
167
• While a search request returns a single “page” of results, the scroll API can be used to
retrieve large numbers of results (or even all results) from a single search request, in
much the same way as you would use a cursor on a traditional database
• To initiate a scroll search, add the scroll parameter to your search query
GET shakespeare/_search?scroll=1m {
"size": 1000,
"query": {
"match_all": {}
}
}
If the scroll is idle for more than 1
minute, then delete it
Maximum number of hits to return

Pagination - Scroll
168
• The result from the above request includes the first page of results and
a _scroll_id, which should be passed to the scroll API in order to retrieve the next
batch of results.
POST /_search/scroll {
"scroll" : "1m",
"scroll_id" :
"DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9la
VYtZndUQlNsdDcwakFMNjU1QQ=="
}
Note that the URL should not
include the index name - this is
specified in the
original search request instead.

Search multiple fields
169
• The multi_match query provides a convenient short hand for running a match query
against multiple fields ‒ by default, the _score from the best field is used (a
best_fields search)
GET shakespeare/_search?pretty -d '{
"query": { "multi_match": {
"query": "Hamlet", "fields": [
"play_name",
"speaker",
"text_entry"
],
"type": "best_fields"
}
}
}'
3 fields are queried (which
results in 3 scores) and the best
score is used

Search – per-field boosting
170
• If we want to add more weight to hits on a differents field, in this example, let's say
we're more interested in speaker field than play_name – we can boost the score of
a field using the caret (^) symbol
"query": { "multi_match": {
"query": "Hamlet", "fields": [
"play_name",
"speaker^2",
"text_entry"
],
"type": "best_fields"
}
}
}'
We get the same number of
hits, but the top hits are
different.

Misspelled words - fuzziness
171
• Fuzzy matching treats two words that are “fuzzily” similar as if they were the same word
- Fuzziness is something that can be assigned a value
- It refers to the number of character modifications, known as edits, to make two words match
- Can be set to 0,1or 2, or can be set to“auto”
Fuzziness = 1 Fuzziness = 2
"Hamled" "Hamlled"
d-> t l-> d->t
"Hamlet" "Hamlet"

Add fuzziness to a query
172
GET shakespeare/_search?pretty -d
'{
"query": {
"match": {
"play_name": "Hamled" }
}
}'
'{
"query": {
"match": {
"play_name": {
"query": "Hamled",
"fuzziness": 1 }}
}}'
0 hits 4244 hits

Search exact terms
173
• If we need to search for the exact text, we use the match query, which
understands how the field has been analyzed, and search on the keyword field:
'{
"query": {
"match": {
"text_entry.keyword": "To be, or not to be: that is the question"
}
}
}'
Exactly 1 hit

Sorting
174
• The results of a query are returned in the order of relevancy, _score descending is the
default sorting for a query
• A query can contain a sort clause that specifies one or more fields to sort on, as well
as the order (asc or desc)
GET /shakespeare/_search?pretty -d '{
"query": {
"match": {
"text_entry": "question"
}
},
"sort": [
{"play_name": {"order": "desc"}
} ]
}'
"hits" : [
{
"_index" : "shakespeare",
"_type" : "doc",
"_id" : "55924",
"_score" : null,
"_source" : {.....}
If _score is not a field in the sort cause, is
not calculated => less compute resources

Highlighting
175
• A common use case for search results is to highlight the matched terms.
GET /shakespeare/_search?pretty -d
'{
"query": {
"match_phrase": {
"text_entry": "Hamlet" }
},
"highlight": {
"fields": {
"text_entry": {}
} }
}'
"_source" : {
"type" : "line",
"line_id" : 36184,
"play_name" : "Hamlet",
"speech_number" : 99,
"line_number" : "5.1.269",
"speaker" : "QUEEN GERTRUDE",
"text_entry" : "Hamlet, Hamlet!"
},
"highlight" : {
"text_entry" : [
"<em>Hamlet</em>, <em>Hamlet</em>!"
]
}
}
The response contains a
highlight section

Range query
176
• Matches documents with fields that have terms within a certain range. The type of the
Lucene query depends on the field type, for string fields, the TermRangeQuery, while
for number/date fields, the query is a NumericRangeQuery
• The range query accepts the following parameters: gte, gt, lte, lt, boost
GET _search
{
"query": {
"range" : {
"age" : {
"gte" : 10,
"lte" : 20 }
}
}
}

Exists query
177
• Returns documents that have at least one non-null value in the original field:
• There isn't a missing query, instead use the exists query inside a must_not clause
GET /_search
{
"query": {
"exists" : { "field" : "user" }
}
}
GET /_search
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "user" }
}
}
}
}

Wildcard query
178
• Matches documents that have fields matching a wildcard expression;
• Supported wildcards are *, which matches any character sequence (including the empty
one), and ?, which matches any single character.
• Note that this query can be slow, as it needs to iterate over many terms. In order to
prevent extremely slow wildcard queries, a wildcard term should not start with one of
the wildcards * or ?
{
"query": {
"wildcard" : { "play_name" : "Henry*" }
}
}

Regexp query
179
• The regexp query allows you to use regular expression term queries
• The "term queries" in that first sentence means that Elasticsearch will apply the
regexp to the terms produced by the tokenizer for that field, and not to the original text
of the field
• Note: The performance of a regexp query heavily depends on the regular expression
chosen. Matching everything like .* is very slow as well as using lookaround regular
expressions.
{
"query": {
"regexp":{
"play_name": "H.*t"}
}
}

Aggregations
180
• Aggregations are a way to perform analytics on your indexed data
• There are four main types of aggregations:
- Metric: aggregations that keep track and compute metrics over a set of documents.
- Bucketing: aggregations that build buckets, where each bucket is associated with
a key and a document criterion. When the aggregation is executed, all the buckets criteria
are evaluated on every document in the context and when a criterion matches, the
document is considered to "fall in" the relevant bucket.
- Pipeline: aggregations that aggregate the output of other aggregations and their
associated metrics
- Matrix: aggregations that operate on multiple fields and produce a matrix result
based on the values extracted from the requested document fields. Unlike metric and
bucket aggregations, this aggregation family does not yet support scripting and its
functionality is currently experimental

Aggregations - Metric
181
• Most metrics are mathematical operations that output a single value: avg, sum, min,
max, cardinality
• Some metrics output multiple values: stats, percentiles, percentile_ranks
• Example: what's the maximum value of the "age" field
GET account/_search?pretty -d '{
"size": 0,
"aggs": {
"max_age": { "max": {
"field": "age" }
}
}
}'
"aggregations" : {
"max_age" : {
"value" : 40.0
}
}
}

Aggregations - bucket
182
• Bucket aggregations don’t calculate metrics over fields like the metrics aggregations
do, but instead, they create buckets of documents
• Bucket aggregations, as opposed to metrics aggregations, can hold sub-aggregations.
These sub-aggregations will be aggregated for the buckets created by their "parent"
bucket aggregation
• Terms aggregations is very handy, will dynamically create a new bucket for every
unique term it encounters of the specified field and get a feel of how your data looks
like

Aggregations
183
"size": 0,
"aggs": {
"play_names": { "terms": {
"field": "play_name",
"size": 5 }
}
}
}'
• Example: What are the unique play names we have in our index
"size" - number of buckets to create
(default is 10)

Aggregations
184
"aggregations" : {
"play_names" : {
"doc_count_error_upper_bound" : 3045,
"sum_other_doc_count" : 91399,
"buckets" : [
{
"key" : "Hamlet",
"doc_count" : 4244
},
{
"key" : "Coriolanus",
"doc_count" : 3992
},
{
"key" : "Cymbeline",
"doc_count" : 3958
},
{
"key" : "Richard III",
"doc_count" : 3941
},
{
"key" : "Antony and Cleopatra",
"doc_count" : 3862
}
]}}}
• Notice each bucket has a “key”
that represents the distinct value
of “field”,
• and“doc_count”for the number of
docs in the bucket

Nesting buckets
185
"size": 0,
"aggs": {
"play_names": {
"terms": {
"size": 1
},
"aggs": {
"speakers": {
"terms": {
"field": "speaker",
"size": 5 } }
} }
} }'
The play names are bucketed, then,
within each play bucket, our documents
are bucketed by speaker.

Nesting buckets
186
"aggregations" : {
"play_names" : {
"buckets" : [
{
"key" : "Hamlet",
"doc_count" : 4244,
"speakers" : {
"buckets" : [
{
"key" : "HAMLET",
"doc_count" : 1582
},
{
"key" : "KING CLAUDIUS",
"doc_count" : 594
},
{
"key" : "LORD POLONIUS",
"doc_count" : 370
The result of our nested aggregation
Notice two special values returned in a terms
aggregation:
- “doc_count_error_upper_bound”:
maximum number of missing documents that
could potentially have appeared in a bucket
- “sum_other_doc_count”: number of
documents that do not appear in any of the
buckets

Bucket sorting
187
• Sorting can be specified using using
“order”:
‒ _count sorts by their doc_count (default
in terms)
‒ _key sorts alphabetically (default in
histogram and date_histogram)
• Sorting can also be on a metric value in
a nested aggregation
"size": 0,
"aggs": {
"play_names": { "terms": {
"size": 5,
"order": {
"_count": "desc" } }
}
}
}'

188
Lab 5
Advanced Operation
Objectives:
Learn how to:
o Work with mappings
o Work with analyzers
Steps:

Elastic 101 tutorial - Percona Europe 2018

Elastic 101 tutorial - Percona Europe 2018

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Elastic 101 tutorial - Percona Europe 2018

Similar a Elastic 101 tutorial - Percona Europe 2018 (20)

Más de Antonios Giannopoulos

Más de Antonios Giannopoulos (12)

Último

Último (20)

Elastic 101 tutorial - Percona Europe 2018