by Darin Briskman, Technical Evangelist, AWS
SQL is a powerful tool to query data, but it doesn't cover everything you might need. Sometimes, the precision of SQL is a limitation, that can be overcome by using the flexibility and inherent ranking of search. Learn how to use AWS servcies to create fully managed solutions using Amazon Aurora and Amazon Elasticsearch Service to combine the power of query and search. Level: 200
2. AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Lex
Polly
Rekognition Machine
Learning
Deep Learning, MXNet
Database Migration
Schema Conversion
3. AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Lex
Polly
Rekognition Machine
Learning
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Deep Learning, MXNet
Database Migration
Schema Conversion
4. Multi-engine support
– Open Source
– Commercial
– Amazon Aurora
Automated provisioning, patching, scaling, backup/restore,
failover
Use with General Purpose SSD or Provisioned IOPS SSD
storage
High availability with RDS Multi-AZ
Amazon RDS: Cheaper, Easier, Better
5. High Availability Multi-AZ Deployments
Enterprise-grade fault tolerant
solution for production
databases
Automatic failover
Synchronous replication
Inexpensive & enabled with one click
6. Up To 5x Performance
Of High-end MySQL
Highly Available
and Durable
MySQL and
PostgreSQL
Compatible
1/10th The Cost Of
Commercial Grade
Databases
Fastest Growing
AWS Service, Ever
Amazon Aurora
Speed and Availability of Commercial, Cost-Effectiveness of Open Source
7. BINLOG DATA DOUBLE-WRITELOG FRM FILES
TYPE OF WRITE
MySQL with Replica
Storage MirrorStorage Mirror
DC 1 DC 2
StorageStorage
Primary
Instance
Replica
Instance
AZ 1 AZ 3
Primary
Instance
Amazon S3
AZ 2
Replica
Instance
ASYNC
4/6 QUORUM
DISTRIBUTED
WRITES
Replica
Instance
Amazon Aurora
780K transactions
7,388K I/Os per million txns (excludes mirroring, standby)
Average 7.4 I/Os per transaction
MySQL IO profile for 30 min. Sysbench run
27,378K transactions 35X MORE
0.95 I/Os per transaction 7.7X LESS
Aurora IO profile for 30 min. Sysbench run
Aurora- Faster Because it is Built for AWS
10. Amazon Elasticsearch Service
Data Flow
Amazon Route
53
Elastic Load
Balancing
AWS IAM
Amazon
CloudWatch
Elasticsearch API
AWS CloudTrail
11. Ways and means
• All data eventually enters at the domain endpoint
• Data can come in single documents (PUT) or batches
(_bulk)
• Some services have direct integration
13. Kinesis Firehose delivery architecture with
transformations
S3 bucket
source records
data source
source records
Amazon Elasticsearch
Service
Firehose
delivery stream
transformed
records
delivery failure
Data transformation
function
transformation failure
16. Elasticsearch works with structured JSON
{
"name" : {
"first" : "Jon",
"last" : "Smith",
}
"age": 26,
"city" : "palo alto",
"years_employed" : 4,
"interests" : [
"guitar",
"sports"
]
}
• Documents contain fields –
name/value pairs
• Fields can nest
• Value types include text,
numerics, dates, and geo
objects
• Field values can be single or
array
• When you send documents to
Elasticsearch they should arrive
as JSON*
*ES 5 can work with unstructured documents
17. If your data is not already in
structured JSON, you must
transform it, creating
structured JSON that
Elasticsearch "understands"
18. The most basic way to transform data
• Run a script in Amazon EC2, Lambda, etc. that reads data
from your data source, creates JSON documents, and ships
to Amazon Elasticsearch Service directly
19. Logstash simplifies transformation
• Logstash is open-source ETL over streams. Run colocated
with your application or read from your source
• Many input plugins and output plugins make it easy to
connect to Logstash
• Grok pattern matching to pull out values and re-write
Application
Instance
20. Elasticsearch 5 ingest processors
When you index documents, you can specify a pipeline.
The pipeline can have a series of processors that
pre-process the data before indexing.
Twenty processors are available, some are simple:
{ "append":
{ "field": "field1"
"value": ["item2", "item3", "item4"] } }
Others are more complex, like the Grok processor for
regex with aliased expressions.
21. Firehose transformations add robust delivery
S3 bucket
source records
data source
source records
Amazon Elasticsearch
Service
Firehose
delivery stream
transformed
records
delivery failure
Data transformation
function
transformation failure
• Inline calls to
Lambda for
free-form
changes to the
underlying data
• Failed
transforms
tracked and
delivered to S3
22. Firehose transformations add robust delivery
intermediate
Amazon S3
bucket
backup S3 bucket
source records
data source
source records
Amazon Elasticsearch
Service
Firehose
delivery stream
transformed
records transformed
records
transformation failure
delivery failure
• Inline calls to Lambda for free-form changes to the
underlying data
• Failed transforms tracked and delivered to S3
24. Cluster is a collection of nodes
Amazon ES cluster
1
3
3
1
Instance 1
2
1
1
2
Instance 2
3
2
2
3
Instance 3Dedicated master nodes
Data nodes: queries and updates
25. Data pattern
Amazon ES cluster
logs_01.21.2017
logs_01.22.2017
logs_01.23.2017
logs_01.24.2017
logs_01.25.2017
logs_01.26.2017
logs_01.27.2017
Shard 1
Shard 2
Shard 3
host
ident
auth
timestamp
etc.
Each index has
multiple shards
Each shard contains
a set of documents
Each document contains
a set of fields and values
One index per day
26. Indices and Mappings
Index: product
Type: cellphone
documentId
Fields: make (keyword), inventory
(int), location (geo point)
Type: reviews
documentId
Fields: make(keyword), review (text),
rating (float), date (date)
http://hostname/product/cellphone/1 http://hostname/product/reviews/1
28. Shards
- Indexes are split into multiple shards
- Primary shards are defined at index creation
- Defaults to 5 Primaries and 1 Replica Shard
- Shards allow
- Horizontal scale
- Distribute and parallelize the operations to increase
throughput
- Create replicas to provide high availability in case of failures
29. Shards … contd
- Shard is a Lucene index
- Number of Replica shards can be changed on the fly but
not the primary shards
- To change the number of primary shards, the index
needs to be re-created
- Shards are automatically balanced when cluster is re-
sized
30. 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
Document
Fields
host ident auth timestamp verb request status size
Field indexes
199.72.81.55
unicomp6.unicomp.net
199.120.110.21
burger.letters.com
199.120.110.21
205.212.115.106
d104.aa.net
1, 4, 8, 12, 30, 42, 58, 100...
Postings
Elasticsearch creates an index for
each field, containing the
decomposed values of those fields
31. host:199.72.81.55 AND verb:GET
1,
4,
8,
12,
30,
42,
58,
100
...
Look up
199.72.81.55 GET
1,
4,
9,
50,
58,
75,
90,
103
...
AND
Merge
1,
4,
58
Score
1.2,
3.7,
0.4
Sort
4,
1,
58
The index data structures support fast
retrieval and merging. Scoring and
sorting support best match retrieval
32. - Create Index called product
- Get list of Indices
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open product 95SQ4TS 5 1 0 0 260b 260b
$ curl –XPUT ‘http://hostname/product/’
Index and Document Command Examples
$ curl ‘http://hostname/_cat/indices’
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open product 95SQ4TS 5 1 0 0 260b 260b
34. What happens at Index Operation
http PUT – http://hostname/product/cellphone/1
Elasticsearch Cluster
Instance 1 Instance 2
1
2
32 1
3
Instance 3
1. Indexing operation
2. Shard determined is based on hashing with
document ID.
3. Current node forwards document to node
holding the primary shard
4. Primary shard ensures all replica shards
replay the same indexing operation
1
3
4
35. Mappings
1. Mappings are used to define types of documents.
2. Define various fields in a document
3. Mapping Types –
1. Core
1. Text or keyword
2. Numeric
3. Date
4. Boolean
2. Arrays and Multi-fields
1. Arrays – “tags” : [“blue”,”red”]
2. Multi-fields – Index same data with different settings
3. Pre-defined fields
1. _ttl, _size
2. _uid, _id, _type, _index
3. _all, _source
36. Mapping command examples
curl -XPUT ’http://hostname/product' -H 'Content-Type: application/json' –d‘
{
"mappings": {
"cellphone": {
"properties": {
"make": {
"type": "text"
}
}
}
}
}’
Create an index called product with mapping, cellphone and field make
as type text –
37. Mapping command examples
curl -XPUT ’http://hostname/product/_mapping/reviews' -H 'Content-Type:
application/json' -d’
{
"properties": {
”review": {
"type": "text"
},
“rating”: {
“type”: “int”
}
}
}’
Add a new mapping, reviews, with fields review, as string and rating, as
int, to existing index, product –
38. Mapping command examples
curl -XPUT ’http://hostname/product/_mapping/cellphone' -H 'Content-Type:
application/json' -d’
{
"properties": {
”inventory": {
"type": ”int"
}
}
}’
Add a new field, inventory as integer, to existing mapping, cellphone in
index product –