SlideShare una empresa de Scribd logo
1 de 44
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Darin Briskman
AWS Technical Evangelist
briskman@amazon.com
Adding Search to
Amazon DynamoDB
AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Lex
Polly
Rekognition Machine
Learning
Deep Learning, MXNet
Database Migration
Schema Conversion
AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Lex
Polly
Rekognition Machine
Learning
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Deep Learning, MXNet
Database Migration
Schema Conversion
Schemaless data model
Consistent low latency performance
Predictable provisioned throughput
Seamless scalability with no storage limits
High durability & availability (replication across 3 facilities)
Easy administration – we scale for you!
Low cost
DynamoDB
DAXApp
DynamoDB Accelerator (DAX) offers caching without
coding for sub-millisecond read latency and up to 10x
throughput
DynamoDB: Non-Relational
Managed Database Service
Availability Zone A
Partition A
Host 4 Host 6
Availability Zone B Availability Zone C
Partition APartition A Partition CPartition C Partition C
Host 5
Partition B
Host 1 Host 3Host 2
Partition B
Host 7 Host 9Host 8
Partition B
CustomerOrdersTable
Data is always
replicated to three
Availability Zones
3-way replication
OrderId: 1
CustomerId: 1
ASIN: [B00X4WHP5E]
Hash(1) = 7B
Highly available and durable
Partition A
Availability Zone A
Partition A
Host 4 Host 6
Availability Zone B Availability Zone C
Partition APartition A Partition CPartition C Partition C
Host 5
Partition B
Host 1 Host 3Host 2
Partition B
Host 7 Host 9Host 8
Partition B
CustomerOrdersTable
Data is always
replicated to three
Availability Zones
3-way replication
OrderId: 1
CustomerId: 1
ASIN: [B00X4WHP5E]
Hash(1) = 7B
Highly available and durable
Partition A
Consistently fast at any scale
ConsistentSingle-Digit Millisecond Latency
Requests(millions)
Latency(milliseconds)
Scales throughput automatically (Auto Scaling)
Specify: 1) Target capacity in percent 2) Upper and lower bound
Partition Key
Mandatory
Key-value access pattern
Determines data distribution
Optional
Model 1:N relationships
Enables rich query capabilities
DynamoDB table
A1
(partition key)
A2
(sort key)
A3 A4 A7
A1
(partition key)
A2
(sort key)
A6 A4 A5
A1
(partition key)
A2
(sort key)
A1
(partition key)
A2
(sort key)
A3 A4 A5
SortKey
Table
Items
Local secondary indexes
10 GB max per
partition key,
i.e. LSIs limit the
# of sort keys!
A1
(partition key)
A3
(sort key)
A2 A4 A5
A1
(partition key)
A4
(sort key)
A2 A3 A5
A1
(partition key)
A5
(sort key)
A2 A3 A4
• Alternate sort key
attribute
• Index is local to a
partition key
Reads and writes
provisioned
separately for GSIs
INCLUDE A2
A
LL
KEYS_ONLY
A3
(partition key)
A1
(table key)
A2 A4 A7
A3
(partition key)
A1
(table key)
A3
(partition key)
A1
(table key)
A2
• Alternate partition
(+sort) key
• Sparse
• Can be added or
removed anytime
A3
(partition key)
A1
(table key)
A2 A4 A7
A3
(partition key)
A1
(table key)
A2
A3
(partition key)
A1
(table key)
Global secondary indexes
DynamoDB Streams
Partition A
Partition B
Partition C
Ordered stream of item
changes
Exactly once, strictly
ordered by key
Highly durable, scalable
24 hour retention
Sub-second latency
Compatible with Kinesis
Client Library
DynamoDB Streams
1
Shards have a lineage and
automatically close after time
or when the associated
DynamoDB partition splits
2
3
Updates
KCL
Worker
Amazon
Kinesis Client
Library
Application
KCL
Worker
KCL
Worker
GetRecords
DynamoDB
Table
DynamoDB Stream
Shards
DynamoDB Streams and Triggers
AWS Lambda
function
Amazon SNS
 Implemented as AWS Lambda functions
 Scale automatically
 C#, Java, Node.js, Python
Triggers
Amazon ES
Amazon ElastiCache
Integration with Amazon S3
Integration with Amazon DynamoDB
Integration with Amazon EMR
The Elasticsearch-Hadoop or ES-Hadoop connector enables
several Hadoop stack applications running on EMR or EC2 to
power real-time search and analytics with Amazon Elasticsearch
as well as beautiful visualizations with Kibana.
• seamlessly moves data between Hadoop and ElasticSearch
and allows indexing of Hadoop Data (HDFS/EMRFS) to and
query from Amazon Elasticsearch.
Amazon
EMR
Amazon ES
ES-Hadoop
ES-Hadoop Connector – for Spark & Friends
Hadoop Applications on
EMR/EC2
STORM
Amazon
Elasticsearch
Index data to
Elasticsearch Cluster
* Query data from
Elasticsearch Cluster
ES-Hadoop Connector
0
Analyze
Search
Visualize
Discover
* With Spark SQL, at runtime, Spark SQL translates to Query DSL. Data is filtered at source.
ES-Hadoop Connector – considerations
ES-Hadoop
• Performance:
Since Amazon Elasticsearch cluster nodes are not collocated
on EMR cluster nodes, local discovery should be disabled so
the ES-Hadoop Connector only connects through the declared
es.nodes during all operations, including reads and writes.
es.nodes.wan.only should be set to true
Since partition to partition architecture or parallelism cannot be
achieved, performance may be impacted at scale and ES-
Hadoop connector tasks should be tested for bottlenecks.
ES-Hadoop Connector – considerations (contd.)
ES-Hadoop
• Security:
• For EMR Cluster in a public subnet, use IP-based access
policy with Amazon Elasticsearch to whitelist EMR IPs.
• For EMR Cluster in a private subnet, use Identity-based
access policy with Amazon Elasticsearch and install AWS
ES/Kibana Proxy on EMR nodes via bootstrap action.
Kinesis Firehose delivery architecture with
transformations
S3 bucket
source records
data source
source records
Amazon Elasticsearch
Service
Firehose
delivery stream
transformed
records
delivery failure
Data transformation
function
transformation failure
Integration with Amazon Lambda
VPC
Flow Logs
CloudTrail
Audit Logs
S3
Access
Logs
ELB
Access
Logs
CloudFront
Access
Logs SNS
Notifications
DynamoDB
Streams
SES
Inbound
Email
Cognito
Events
Kinesis
Streams
CloudWatch
Events &
Alarms
Config
Rules
Amazon Elasticsearch
Service
Elasticsearch works with structured JSON
{
"name" : {
"first" : "Jon",
"last" : "Smith",
}
"age": 26,
"city" : "palo alto",
"years_employed" : 4,
"interests" : [
"guitar",
"sports"
]
}
• Documents contain fields –
name/value pairs
• Fields can nest
• Value types include text,
numerics, dates, and geo
objects
• Field values can be single or
array
• When you send documents to
Elasticsearch they should arrive
as JSON*
*ES 5 can work with unstructured documents
If your data is not already in
structured JSON, you must
transform it, creating
structured JSON that
Elasticsearch "understands"
The most basic way to transform data
• Run a script in Amazon EC2, Lambda, etc. that reads data
from your data source, creates JSON documents, and ships
to Amazon Elasticsearch Service directly
Logstash simplifies transformation
• Logstash is open-source ETL over streams. Run colocated
with your application or read from your source
• Many input plugins and output plugins make it easy to
connect to Logstash
• Grok pattern matching to pull out values and re-write
Application
Instance
Elasticsearch 5 ingest processors
When you index documents, you can specify a pipeline.
The pipeline can have a series of processors that
pre-process the data before indexing.
Twenty processors are available, some are simple:
{ "append":
{ "field": "field1"
"value": ["item2", "item3", "item4"] } }
Others are more complex, like the Grok processor for
regex with aliased expressions.
Firehose transformations add robust delivery
S3 bucket
source records
data source
source records
Amazon Elasticsearch
Service
Firehose
delivery stream
transformed
records
delivery failure
Data transformation
function
transformation failure
• Inline calls to
Lambda for
free-form
changes to the
underlying data
• Failed
transforms
tracked and
delivered to S3
Firehose transformations add robust delivery
intermediate
Amazon S3
bucket
backup S3 bucket
source records
data source
source records
Amazon Elasticsearch
Service
Firehose
delivery stream
transformed
records transformed
records
transformation failure
delivery failure
• Inline calls to Lambda for free-form changes to the
underlying data
• Failed transforms tracked and delivered to S3
Common transformations
• Rewrite to JSON format
• Decorate documents with data from other sources
• Rectify dates
Cluster is a collection of nodes
Amazon ES cluster
1
3
3
1
Instance 1
2
1
1
2
Instance 2
3
2
2
3
Instance 3Dedicated master nodes
Data nodes: queries and updates
Data pattern
Amazon ES cluster
logs_01.21.2017
logs_01.22.2017
logs_01.23.2017
logs_01.24.2017
logs_01.25.2017
logs_01.26.2017
logs_01.27.2017
Shard 1
Shard 2
Shard 3
host
ident
auth
timestamp
etc.
Each index has
multiple shards
Each shard contains
a set of documents
Each document contains
a set of fields and values
One index per day
Indices and Mappings
Index: product
Type: cellphone
documentId
Fields: make (keyword), inventory
(int), location (geo point)
Type: reviews
documentId
Fields: make(keyword), review (text),
rating (float), date (date)
http://hostname/product/cellphone/1 http://hostname/product/reviews/1
Physical Layout
Elasticsearch Cluster
/product/cellphone/3
1
/product/cellphone/2
/product/cellphone/1
2
3
Instance 1 Instance 2 Instance 3
Cluster
- 3 Instances
- 3 Primary Shards
- 1 Replica per
primary
1 1
2
2
33
Index Operation on documents
spreads it across Shards
Shards
- Indexes are split into multiple shards
- Primary shards are defined at index creation
- Defaults to 5 Primaries and 1 Replica Shard
- Shards allow
- Horizontal scale
- Distribute and parallelize the operations to increase
throughput
- Create replicas to provide high availability in case of failures
Shards … contd
- Shard is a Lucene index
- Number of Replica shards can be changed on the fly but
not the primary shards
- To change the number of primary shards, the index
needs to be re-created
- Shards are automatically balanced when cluster is re-
sized
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
Document
Fields
host ident auth timestamp verb request status size
Field indexes
199.72.81.55
unicomp6.unicomp.net
199.120.110.21
burger.letters.com
199.120.110.21
205.212.115.106
d104.aa.net
1, 4, 8, 12, 30, 42, 58, 100...
Postings
Elasticsearch creates an index for
each field, containing the
decomposed values of those fields
host:199.72.81.55 AND verb:GET
1,
4,
8,
12,
30,
42,
58,
100
...
Look up
199.72.81.55 GET
1,
4,
9,
50,
58,
75,
90,
103
...
AND
Merge
1,
4,
58
Score
1.2,
3.7,
0.4
Sort
4,
1,
58
The index data structures support fast
retrieval and merging. Scoring and
sorting support best match retrieval
- Create Index called product
- Get list of Indices
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open product 95SQ4TS 5 1 0 0 260b 260b
$ curl –XPUT ‘http://hostname/product/’
Index and Document Command Examples
$ curl ‘http://hostname/_cat/indices’
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open product 95SQ4TS 5 1 0 0 260b 260b
Index and Document Command Examples ..
- Indexing a document
- Retrieving a document
$ curl -XPUT ’http://hostname/product/cellphone/1' -H 'Content-Type:
application/json' -d’
{
”make": ”Apple”,
“inventory”: 100
}’
$ curl -XGET ’http://hostname/product/cellphone/1’
{
"_index" : ”product",
"_type" : ”cellphone",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : { ”make": ”Apple”, “inventory: 100 }
}
What happens at Index Operation
http PUT – http://hostname/product/cellphone/1
Elasticsearch Cluster
Instance 1 Instance 2
1
2
32 1
3
Instance 3
1. Indexing operation
2. Shard determined is based on hashing with
document ID.
3. Current node forwards document to node
holding the primary shard
4. Primary shard ensures all replica shards
replay the same indexing operation
1
3
4
Mappings
1. Mappings are used to define types of documents.
2. Define various fields in a document
3. Mapping Types –
1. Core
1. Text or keyword
2. Numeric
3. Date
4. Boolean
2. Arrays and Multi-fields
1. Arrays – “tags” : [“blue”,”red”]
2. Multi-fields – Index same data with different settings
3. Pre-defined fields
1. _ttl, _size
2. _uid, _id, _type, _index
3. _all, _source
Mapping command examples
curl -XPUT ’http://hostname/product' -H 'Content-Type: application/json' –d‘
{
"mappings": {
"cellphone": {
"properties": {
"make": {
"type": "text"
}
}
}
}
}’
Create an index called product with mapping, cellphone and field make
as type text –
Mapping command examples
curl -XPUT ’http://hostname/product/_mapping/reviews' -H 'Content-Type:
application/json' -d’
{
"properties": {
”review": {
"type": "text"
},
“rating”: {
“type”: “int”
}
}
}’
Add a new mapping, reviews, with fields review, as string and rating, as
int, to existing index, product –
Mapping command examples
curl -XPUT ’http://hostname/product/_mapping/cellphone' -H 'Content-Type:
application/json' -d’
{
"properties": {
”inventory": {
"type": ”int"
}
}
}’
Add a new field, inventory as integer, to existing mapping, cellphone in
index product –

Más contenido relacionado

La actualidad más candente

ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
Amazon Web Services
 
Virtual AWSome Day Training Sept 2017
Virtual AWSome Day Training Sept 2017Virtual AWSome Day Training Sept 2017
Virtual AWSome Day Training Sept 2017
Amazon Web Services
 
AWSome Day 2016 - Module 1: AWS Introduction and History
AWSome Day 2016 - Module 1: AWS Introduction and HistoryAWSome Day 2016 - Module 1: AWS Introduction and History
AWSome Day 2016 - Module 1: AWS Introduction and History
Amazon Web Services
 

La actualidad más candente (20)

Big Data and Alexa_Voice-Enabled Analytics
Big Data and Alexa_Voice-Enabled Analytics Big Data and Alexa_Voice-Enabled Analytics
Big Data and Alexa_Voice-Enabled Analytics
 
Getting Started with AWS Lambda and Serverless
Getting Started with AWS Lambda and ServerlessGetting Started with AWS Lambda and Serverless
Getting Started with AWS Lambda and Serverless
 
AWS Services for Data Migration - AWS Online Tech Talks
AWS Services for Data Migration - AWS Online Tech TalksAWS Services for Data Migration - AWS Online Tech Talks
AWS Services for Data Migration - AWS Online Tech Talks
 
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
 
Analyzing Your Web and Application Logs
Analyzing Your Web and Application Logs Analyzing Your Web and Application Logs
Analyzing Your Web and Application Logs
 
Amazon Aurora_Deep Dive
Amazon Aurora_Deep DiveAmazon Aurora_Deep Dive
Amazon Aurora_Deep Dive
 
SRV205 Architectures and Strategies for Building Modern Applications on AWS
 SRV205 Architectures and Strategies for Building Modern Applications on AWS SRV205 Architectures and Strategies for Building Modern Applications on AWS
SRV205 Architectures and Strategies for Building Modern Applications on AWS
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
Replicate and Manage Data Using Managed Databases and Serverless Technologies
Replicate and Manage Data Using Managed Databases and Serverless Technologies Replicate and Manage Data Using Managed Databases and Serverless Technologies
Replicate and Manage Data Using Managed Databases and Serverless Technologies
 
Building High Availability Apps on Lightsail: Load Balancing and Block Storag...
Building High Availability Apps on Lightsail: Load Balancing and Block Storag...Building High Availability Apps on Lightsail: Load Balancing and Block Storag...
Building High Availability Apps on Lightsail: Load Balancing and Block Storag...
 
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser... SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 
Amazon S3_Updates and Best Practices
Amazon S3_Updates and Best Practices Amazon S3_Updates and Best Practices
Amazon S3_Updates and Best Practices
 
SID304 Threat Detection and Remediation with Amazon GuardDuty
 SID304 Threat Detection and Remediation with Amazon GuardDuty SID304 Threat Detection and Remediation with Amazon GuardDuty
SID304 Threat Detection and Remediation with Amazon GuardDuty
 
Virtual AWSome Day Training Sept 2017
Virtual AWSome Day Training Sept 2017Virtual AWSome Day Training Sept 2017
Virtual AWSome Day Training Sept 2017
 
AWSome Day 2016 - Module 1: AWS Introduction and History
AWSome Day 2016 - Module 1: AWS Introduction and HistoryAWSome Day 2016 - Module 1: AWS Introduction and History
AWSome Day 2016 - Module 1: AWS Introduction and History
 
AWS 101
AWS 101AWS 101
AWS 101
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 

Similar a Adding Search to Amazon DynamoDB

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 

Similar a Adding Search to Amazon DynamoDB (20)

Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Connecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineConnecting Your Data Analytics Pipeline
Connecting Your Data Analytics Pipeline
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
HSBC and AWS Day - Database Options on AWS
HSBC and AWS Day - Database Options on AWSHSBC and AWS Day - Database Options on AWS
HSBC and AWS Day - Database Options on AWS
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
AWS Webcast - Build high-scale applications with Amazon DynamoDB
AWS Webcast - Build high-scale applications with Amazon DynamoDBAWS Webcast - Build high-scale applications with Amazon DynamoDB
AWS Webcast - Build high-scale applications with Amazon DynamoDB
 
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksMigrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Adding Search to Amazon DynamoDB

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Darin Briskman AWS Technical Evangelist briskman@amazon.com Adding Search to Amazon DynamoDB
  • 2. AWS Data Services to Accelerate Your Move to the Cloud RDS Open Source RDS Commercial Aurora Migration for DB Freedom DynamoDB & DAX ElastiCache EMR Amazon Redshift Redshift Spectrum AthenaElasticsearch Service QuickSightGlue Databases to Elevate your Apps Relational Non-Relational & In-Memory Analytics to Engage your Data Inline Data Warehousing Reporting Data Lake Amazon AI to Drive the Future Lex Polly Rekognition Machine Learning Deep Learning, MXNet Database Migration Schema Conversion
  • 3. AWS Data Services to Accelerate Your Move to the Cloud RDS Open Source RDS Commercial Aurora Migration for DB Freedom DynamoDB & DAX ElastiCache EMR Amazon Redshift Redshift Spectrum AthenaElasticsearch Service QuickSightGlue Lex Polly Rekognition Machine Learning Databases to Elevate your Apps Relational Non-Relational & In-Memory Analytics to Engage your Data Inline Data Warehousing Reporting Data Lake Amazon AI to Drive the Future Deep Learning, MXNet Database Migration Schema Conversion
  • 4. Schemaless data model Consistent low latency performance Predictable provisioned throughput Seamless scalability with no storage limits High durability & availability (replication across 3 facilities) Easy administration – we scale for you! Low cost DynamoDB DAXApp DynamoDB Accelerator (DAX) offers caching without coding for sub-millisecond read latency and up to 10x throughput DynamoDB: Non-Relational Managed Database Service
  • 5. Availability Zone A Partition A Host 4 Host 6 Availability Zone B Availability Zone C Partition APartition A Partition CPartition C Partition C Host 5 Partition B Host 1 Host 3Host 2 Partition B Host 7 Host 9Host 8 Partition B CustomerOrdersTable Data is always replicated to three Availability Zones 3-way replication OrderId: 1 CustomerId: 1 ASIN: [B00X4WHP5E] Hash(1) = 7B Highly available and durable Partition A
  • 6. Availability Zone A Partition A Host 4 Host 6 Availability Zone B Availability Zone C Partition APartition A Partition CPartition C Partition C Host 5 Partition B Host 1 Host 3Host 2 Partition B Host 7 Host 9Host 8 Partition B CustomerOrdersTable Data is always replicated to three Availability Zones 3-way replication OrderId: 1 CustomerId: 1 ASIN: [B00X4WHP5E] Hash(1) = 7B Highly available and durable Partition A
  • 7. Consistently fast at any scale ConsistentSingle-Digit Millisecond Latency Requests(millions) Latency(milliseconds)
  • 8. Scales throughput automatically (Auto Scaling) Specify: 1) Target capacity in percent 2) Upper and lower bound
  • 9. Partition Key Mandatory Key-value access pattern Determines data distribution Optional Model 1:N relationships Enables rich query capabilities DynamoDB table A1 (partition key) A2 (sort key) A3 A4 A7 A1 (partition key) A2 (sort key) A6 A4 A5 A1 (partition key) A2 (sort key) A1 (partition key) A2 (sort key) A3 A4 A5 SortKey Table Items
  • 10. Local secondary indexes 10 GB max per partition key, i.e. LSIs limit the # of sort keys! A1 (partition key) A3 (sort key) A2 A4 A5 A1 (partition key) A4 (sort key) A2 A3 A5 A1 (partition key) A5 (sort key) A2 A3 A4 • Alternate sort key attribute • Index is local to a partition key
  • 11. Reads and writes provisioned separately for GSIs INCLUDE A2 A LL KEYS_ONLY A3 (partition key) A1 (table key) A2 A4 A7 A3 (partition key) A1 (table key) A3 (partition key) A1 (table key) A2 • Alternate partition (+sort) key • Sparse • Can be added or removed anytime A3 (partition key) A1 (table key) A2 A4 A7 A3 (partition key) A1 (table key) A2 A3 (partition key) A1 (table key) Global secondary indexes
  • 12. DynamoDB Streams Partition A Partition B Partition C Ordered stream of item changes Exactly once, strictly ordered by key Highly durable, scalable 24 hour retention Sub-second latency Compatible with Kinesis Client Library DynamoDB Streams 1 Shards have a lineage and automatically close after time or when the associated DynamoDB partition splits 2 3 Updates KCL Worker Amazon Kinesis Client Library Application KCL Worker KCL Worker GetRecords DynamoDB Table DynamoDB Stream Shards
  • 13. DynamoDB Streams and Triggers AWS Lambda function Amazon SNS  Implemented as AWS Lambda functions  Scale automatically  C#, Java, Node.js, Python Triggers Amazon ES Amazon ElastiCache
  • 16. Integration with Amazon EMR The Elasticsearch-Hadoop or ES-Hadoop connector enables several Hadoop stack applications running on EMR or EC2 to power real-time search and analytics with Amazon Elasticsearch as well as beautiful visualizations with Kibana. • seamlessly moves data between Hadoop and ElasticSearch and allows indexing of Hadoop Data (HDFS/EMRFS) to and query from Amazon Elasticsearch. Amazon EMR Amazon ES ES-Hadoop
  • 17. ES-Hadoop Connector – for Spark & Friends Hadoop Applications on EMR/EC2 STORM Amazon Elasticsearch Index data to Elasticsearch Cluster * Query data from Elasticsearch Cluster ES-Hadoop Connector 0 Analyze Search Visualize Discover * With Spark SQL, at runtime, Spark SQL translates to Query DSL. Data is filtered at source.
  • 18. ES-Hadoop Connector – considerations ES-Hadoop • Performance: Since Amazon Elasticsearch cluster nodes are not collocated on EMR cluster nodes, local discovery should be disabled so the ES-Hadoop Connector only connects through the declared es.nodes during all operations, including reads and writes. es.nodes.wan.only should be set to true Since partition to partition architecture or parallelism cannot be achieved, performance may be impacted at scale and ES- Hadoop connector tasks should be tested for bottlenecks.
  • 19. ES-Hadoop Connector – considerations (contd.) ES-Hadoop • Security: • For EMR Cluster in a public subnet, use IP-based access policy with Amazon Elasticsearch to whitelist EMR IPs. • For EMR Cluster in a private subnet, use Identity-based access policy with Amazon Elasticsearch and install AWS ES/Kibana Proxy on EMR nodes via bootstrap action.
  • 20. Kinesis Firehose delivery architecture with transformations S3 bucket source records data source source records Amazon Elasticsearch Service Firehose delivery stream transformed records delivery failure Data transformation function transformation failure
  • 21. Integration with Amazon Lambda VPC Flow Logs CloudTrail Audit Logs S3 Access Logs ELB Access Logs CloudFront Access Logs SNS Notifications DynamoDB Streams SES Inbound Email Cognito Events Kinesis Streams CloudWatch Events & Alarms Config Rules Amazon Elasticsearch Service
  • 22. Elasticsearch works with structured JSON { "name" : { "first" : "Jon", "last" : "Smith", } "age": 26, "city" : "palo alto", "years_employed" : 4, "interests" : [ "guitar", "sports" ] } • Documents contain fields – name/value pairs • Fields can nest • Value types include text, numerics, dates, and geo objects • Field values can be single or array • When you send documents to Elasticsearch they should arrive as JSON* *ES 5 can work with unstructured documents
  • 23. If your data is not already in structured JSON, you must transform it, creating structured JSON that Elasticsearch "understands"
  • 24. The most basic way to transform data • Run a script in Amazon EC2, Lambda, etc. that reads data from your data source, creates JSON documents, and ships to Amazon Elasticsearch Service directly
  • 25. Logstash simplifies transformation • Logstash is open-source ETL over streams. Run colocated with your application or read from your source • Many input plugins and output plugins make it easy to connect to Logstash • Grok pattern matching to pull out values and re-write Application Instance
  • 26. Elasticsearch 5 ingest processors When you index documents, you can specify a pipeline. The pipeline can have a series of processors that pre-process the data before indexing. Twenty processors are available, some are simple: { "append": { "field": "field1" "value": ["item2", "item3", "item4"] } } Others are more complex, like the Grok processor for regex with aliased expressions.
  • 27. Firehose transformations add robust delivery S3 bucket source records data source source records Amazon Elasticsearch Service Firehose delivery stream transformed records delivery failure Data transformation function transformation failure • Inline calls to Lambda for free-form changes to the underlying data • Failed transforms tracked and delivered to S3
  • 28. Firehose transformations add robust delivery intermediate Amazon S3 bucket backup S3 bucket source records data source source records Amazon Elasticsearch Service Firehose delivery stream transformed records transformed records transformation failure delivery failure • Inline calls to Lambda for free-form changes to the underlying data • Failed transforms tracked and delivered to S3
  • 29. Common transformations • Rewrite to JSON format • Decorate documents with data from other sources • Rectify dates
  • 30. Cluster is a collection of nodes Amazon ES cluster 1 3 3 1 Instance 1 2 1 1 2 Instance 2 3 2 2 3 Instance 3Dedicated master nodes Data nodes: queries and updates
  • 31. Data pattern Amazon ES cluster logs_01.21.2017 logs_01.22.2017 logs_01.23.2017 logs_01.24.2017 logs_01.25.2017 logs_01.26.2017 logs_01.27.2017 Shard 1 Shard 2 Shard 3 host ident auth timestamp etc. Each index has multiple shards Each shard contains a set of documents Each document contains a set of fields and values One index per day
  • 32. Indices and Mappings Index: product Type: cellphone documentId Fields: make (keyword), inventory (int), location (geo point) Type: reviews documentId Fields: make(keyword), review (text), rating (float), date (date) http://hostname/product/cellphone/1 http://hostname/product/reviews/1
  • 33. Physical Layout Elasticsearch Cluster /product/cellphone/3 1 /product/cellphone/2 /product/cellphone/1 2 3 Instance 1 Instance 2 Instance 3 Cluster - 3 Instances - 3 Primary Shards - 1 Replica per primary 1 1 2 2 33 Index Operation on documents spreads it across Shards
  • 34. Shards - Indexes are split into multiple shards - Primary shards are defined at index creation - Defaults to 5 Primaries and 1 Replica Shard - Shards allow - Horizontal scale - Distribute and parallelize the operations to increase throughput - Create replicas to provide high availability in case of failures
  • 35. Shards … contd - Shard is a Lucene index - Number of Replica shards can be changed on the fly but not the primary shards - To change the number of primary shards, the index needs to be re-created - Shards are automatically balanced when cluster is re- sized
  • 36. 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 Document Fields host ident auth timestamp verb request status size Field indexes 199.72.81.55 unicomp6.unicomp.net 199.120.110.21 burger.letters.com 199.120.110.21 205.212.115.106 d104.aa.net 1, 4, 8, 12, 30, 42, 58, 100... Postings Elasticsearch creates an index for each field, containing the decomposed values of those fields
  • 37. host:199.72.81.55 AND verb:GET 1, 4, 8, 12, 30, 42, 58, 100 ... Look up 199.72.81.55 GET 1, 4, 9, 50, 58, 75, 90, 103 ... AND Merge 1, 4, 58 Score 1.2, 3.7, 0.4 Sort 4, 1, 58 The index data structures support fast retrieval and merging. Scoring and sorting support best match retrieval
  • 38. - Create Index called product - Get list of Indices health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open product 95SQ4TS 5 1 0 0 260b 260b $ curl –XPUT ‘http://hostname/product/’ Index and Document Command Examples $ curl ‘http://hostname/_cat/indices’ health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open product 95SQ4TS 5 1 0 0 260b 260b
  • 39. Index and Document Command Examples .. - Indexing a document - Retrieving a document $ curl -XPUT ’http://hostname/product/cellphone/1' -H 'Content-Type: application/json' -d’ { ”make": ”Apple”, “inventory”: 100 }’ $ curl -XGET ’http://hostname/product/cellphone/1’ { "_index" : ”product", "_type" : ”cellphone", "_id" : "1", "_version" : 1, "found" : true, "_source" : { ”make": ”Apple”, “inventory: 100 } }
  • 40. What happens at Index Operation http PUT – http://hostname/product/cellphone/1 Elasticsearch Cluster Instance 1 Instance 2 1 2 32 1 3 Instance 3 1. Indexing operation 2. Shard determined is based on hashing with document ID. 3. Current node forwards document to node holding the primary shard 4. Primary shard ensures all replica shards replay the same indexing operation 1 3 4
  • 41. Mappings 1. Mappings are used to define types of documents. 2. Define various fields in a document 3. Mapping Types – 1. Core 1. Text or keyword 2. Numeric 3. Date 4. Boolean 2. Arrays and Multi-fields 1. Arrays – “tags” : [“blue”,”red”] 2. Multi-fields – Index same data with different settings 3. Pre-defined fields 1. _ttl, _size 2. _uid, _id, _type, _index 3. _all, _source
  • 42. Mapping command examples curl -XPUT ’http://hostname/product' -H 'Content-Type: application/json' –d‘ { "mappings": { "cellphone": { "properties": { "make": { "type": "text" } } } } }’ Create an index called product with mapping, cellphone and field make as type text –
  • 43. Mapping command examples curl -XPUT ’http://hostname/product/_mapping/reviews' -H 'Content-Type: application/json' -d’ { "properties": { ”review": { "type": "text" }, “rating”: { “type”: “int” } } }’ Add a new mapping, reviews, with fields review, as string and rating, as int, to existing index, product –
  • 44. Mapping command examples curl -XPUT ’http://hostname/product/_mapping/cellphone' -H 'Content-Type: application/json' -d’ { "properties": { ”inventory": { "type": ”int" } } }’ Add a new field, inventory as integer, to existing mapping, cellphone in index product –