Adding Search to Relational Databases

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Darin Briskman
AWS Technical Evangelist
briskman@amazon.com
Adding Search to Relational
Databases

AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Lex
Polly
Rekognition Machine
Learning
Deep Learning, MXNet
Database Migration
Schema Conversion

AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Lex
Polly
Rekognition Machine
Learning
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Deep Learning, MXNet
Database Migration
Schema Conversion

Multi-engine support
– Open Source
– Commercial
– Amazon Aurora
Automated provisioning, patching, scaling, backup/restore,
failover
Use with General Purpose SSD or Provisioned IOPS SSD
storage
High availability with RDS Multi-AZ
Amazon RDS: Cheaper, Easier, Better

High Availability Multi-AZ Deployments
Enterprise-grade fault tolerant
solution for production
databases
Automatic failover
Synchronous replication
Inexpensive & enabled with one click

Up To 5x Performance
Of High-end MySQL
Highly Available
and Durable
MySQL and
PostgreSQL
Compatible
1/10th The Cost Of
Commercial Grade
Databases
Fastest Growing
AWS Service, Ever
Amazon Aurora
Speed and Availability of Commercial, Cost-Effectiveness of Open Source

BINLOG DATA DOUBLE-WRITELOG FRM FILES
TYPE OF WRITE
MySQL with Replica
Storage MirrorStorage Mirror
DC 1 DC 2
StorageStorage
Primary
Instance
Replica
Instance
AZ 1 AZ 3
Primary
Instance
Amazon S3
AZ 2
Replica
Instance
ASYNC
4/6 QUORUM
DISTRIBUTED
WRITES
Replica
Instance
Amazon Aurora
780K transactions
7,388K I/Os per million txns (excludes mirroring, standby)
Average 7.4 I/Os per transaction
MySQL IO profile for 30 min. Sysbench run
27,378K transactions 35X MORE
0.95 I/Os per transaction 7.7X LESS
Aurora IO profile for 30 min. Sysbench run
Aurora- Faster Because it is Built for AWS

Search text searchfacetingstructured searchsort by relevance

Amazon Elasticsearch Service
Data Flow
Amazon Route
53
Elastic Load
Balancing
AWS IAM
Amazon
CloudWatch
Elasticsearch API
AWS CloudTrail

Ways and means
• All data eventually enters at the domain endpoint
• Data can come in single documents (PUT) or batches
(_bulk)
• Some services have direct integration

Kinesis Firehose delivery architecture with
transformations
S3 bucket
source records
data source
source records
Amazon Elasticsearch
Service
Firehose
delivery stream
transformed
records
delivery failure
Data transformation
function
transformation failure

Integration with Amazon Lambda
VPC
Flow Logs
CloudTrail
Audit Logs
S3
Access
Logs
ELB
Access
Logs
CloudFront
Access
Logs SNS
Notiﬁcations
DynamoDB
Streams
SES
Inbound
Email
Cognito
Events
Kinesis
Streams
CloudWatch
Events &
Alarms
Conﬁg
Rules
S3
CloudWatch
Logs
Lambda
Service

Transforming data for Amazon
Elasticsearch Service

Elasticsearch works with structured JSON
{
"name" : {
"first" : "Jon",
"last" : "Smith",
}
"age": 26,
"city" : "palo alto",
"years_employed" : 4,
"interests" : [
"guitar",
"sports"
]
}
• Documents contain fields –
name/value pairs
• Fields can nest
• Value types include text,
numerics, dates, and geo
objects
• Field values can be single or
array
• When you send documents to
Elasticsearch they should arrive
as JSON*
*ES 5 can work with unstructured documents

If your data is not already in
structured JSON, you must
transform it, creating
structured JSON that
Elasticsearch "understands"

The most basic way to transform data
• Run a script in Amazon EC2, Lambda, etc. that reads data
from your data source, creates JSON documents, and ships
to Amazon Elasticsearch Service directly

Logstash simplifies transformation
• Logstash is open-source ETL over streams. Run colocated
with your application or read from your source
• Many input plugins and output plugins make it easy to
connect to Logstash
• Grok pattern matching to pull out values and re-write
Application
Instance

Elasticsearch 5 ingest processors
When you index documents, you can specify a pipeline.
The pipeline can have a series of processors that
pre-process the data before indexing.
Twenty processors are available, some are simple:
{ "append":
{ "field": "field1"
"value": ["item2", "item3", "item4"] } }
Others are more complex, like the Grok processor for
regex with aliased expressions.

Firehose transformations add robust delivery
S3 bucket
source records
data source
source records
Service
Firehose
delivery stream
transformed
records
delivery failure
Data transformation
function
• Inline calls to
Lambda for
free-form
changes to the
underlying data
• Failed
transforms
tracked and
delivered to S3

Firehose transformations add robust delivery
intermediate
Amazon S3
bucket
backup S3 bucket
source records
data source
source records
Service
Firehose
delivery stream
transformed
records transformed
records
delivery failure
• Inline calls to Lambda for free-form changes to the
underlying data
• Failed transforms tracked and delivered to S3

Common transformations
• Rewrite to JSON format
• Decorate documents with data from other sources
• Rectify dates

Cluster is a collection of nodes
Amazon ES cluster
1
3
3
1
Instance 1
2
1
1
2
Instance 2
3
2
2
3
Instance 3Dedicated master nodes
Data nodes: queries and updates

Data pattern
Amazon ES cluster
logs_01.21.2017
logs_01.22.2017
logs_01.23.2017
logs_01.24.2017
logs_01.25.2017
logs_01.26.2017
logs_01.27.2017
Shard 1
Shard 2
Shard 3
host
ident
auth
timestamp
etc.
Each index has
multiple shards
Each shard contains
a set of documents
Each document contains
a set of fields and values
One index per day

Indices and Mappings
Index: product
Type: cellphone
documentId
Fields: make (keyword), inventory
(int), location (geo point)
Type: reviews
documentId
Fields: make(keyword), review (text),
rating (float), date (date)
http://hostname/product/cellphone/1 http://hostname/product/reviews/1

Physical Layout
Elasticsearch Cluster
/product/cellphone/3
1
2
3
Instance 1 Instance 2 Instance 3
Cluster
- 3 Instances
- 3 Primary Shards
- 1 Replica per
primary
1 1
2
2
33
Index Operation on documents
spreads it across Shards

Shards
- Indexes are split into multiple shards
- Primary shards are defined at index creation
- Defaults to 5 Primaries and 1 Replica Shard
- Shards allow
- Horizontal scale
- Distribute and parallelize the operations to increase
throughput
- Create replicas to provide high availability in case of failures

Shards … contd
- Shard is a Lucene index
- Number of Replica shards can be changed on the fly but
not the primary shards
- To change the number of primary shards, the index
needs to be re-created
- Shards are automatically balanced when cluster is re-
sized

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
Document
Fields
host ident auth timestamp verb request status size
Field indexes
199.72.81.55
unicomp6.unicomp.net
199.120.110.21
burger.letters.com
199.120.110.21
205.212.115.106
d104.aa.net
1, 4, 8, 12, 30, 42, 58, 100...
Postings
Elasticsearch creates an index for
each field, containing the
decomposed values of those fields

host:199.72.81.55 AND verb:GET
1,
4,
8,
12,
30,
42,
58,
100
...
Look up
199.72.81.55 GET
1,
4,
9,
50,
58,
75,
90,
103
...
AND
Merge
1,
4,
58
Score
1.2,
3.7,
0.4
Sort
4,
1,
58
The index data structures support fast
retrieval and merging. Scoring and
sorting support best match retrieval

- Create Index called product
- Get list of Indices
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open product 95SQ4TS 5 1 0 0 260b 260b
$ curl –XPUT ‘http://hostname/product/’
Index and Document Command Examples
$ curl ‘http://hostname/_cat/indices’
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open product 95SQ4TS 5 1 0 0 260b 260b

Index and Document Command Examples ..
- Indexing a document
- Retrieving a document
$ curl -XPUT ’http://hostname/product/cellphone/1' -H 'Content-Type:
application/json' -d’
{
”make": ”Apple”,
“inventory”: 100
}’
$ curl -XGET ’http://hostname/product/cellphone/1’
{
"_index" : ”product",
"_type" : ”cellphone",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : { ”make": ”Apple”, “inventory: 100 }
}

What happens at Index Operation
http PUT – http://hostname/product/cellphone/1
Elasticsearch Cluster
Instance 1 Instance 2
1
2
32 1
3
Instance 3
1. Indexing operation
2. Shard determined is based on hashing with
document ID.
3. Current node forwards document to node
holding the primary shard
4. Primary shard ensures all replica shards
replay the same indexing operation
1
3
4

Mappings
1. Mappings are used to define types of documents.
2. Define various fields in a document
3. Mapping Types –
1. Core
1. Text or keyword
2. Numeric
3. Date
4. Boolean
2. Arrays and Multi-fields
1. Arrays – “tags” : [“blue”,”red”]
2. Multi-fields – Index same data with different settings
3. Pre-defined fields
1. _ttl, _size
2. _uid, _id, _type, _index
3. _all, _source

Mapping command examples
curl -XPUT ’http://hostname/product' -H 'Content-Type: application/json' –d‘
{
"mappings": {
"cellphone": {
"properties": {
"make": {
"type": "text"
}
}
}
}
}’
Create an index called product with mapping, cellphone and field make
as type text –

curl -XPUT ’http://hostname/product/_mapping/reviews' -H 'Content-Type:
{
"properties": {
”review": {
"type": "text"
},
“rating”: {
“type”: “int”
}
}
}’
Add a new mapping, reviews, with fields review, as string and rating, as
int, to existing index, product –

curl -XPUT ’http://hostname/product/_mapping/cellphone' -H 'Content-Type:
{
"properties": {
”inventory": {
"type": ”int"
}
}
}’
Add a new field, inventory as integer, to existing mapping, cellphone in
index product –

Adding Search to Relational Databases

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Adding Search to Relational Databases

Similar a Adding Search to Relational Databases (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Adding Search to Relational Databases