Database and Analytics on the AWS Cloud - AWS Innovate Toronto

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Antoine Généreux, AWS Solutions Architect
May 10, 2017
Databases and Analytics on
the AWS Cloud

What to expect from the session
• Challenges and architectural principles
• How to simplify big data processing
• What database technologies should you use?
• Why and When?
• How?
• Architectural patterns and Customer Examples

Key Ideas
• Data is your organization’s most valuable resource
• All data has the potential to be big data
• Databases are no longer the center of analytics*
*but they play a critical role!

Evolution of Analytics
Batch analytics
Real-time analytics
Predictive/Adaptive analytics

Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Amazon Kinesis
Streams app
Lambda Amazon ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon Kinesis
Analytics
A plethora of tools

Challenges
• Is there a reference architecture?
• What tools should I use?
• How?
• Why?

Architectural Principles
Build decoupled systems
• Data → Store → Process → Store →Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, materialized views (schema-on-read)
Be cost-conscious
• Big data ≠ big cost

Simplify Big Data Processing
Time to answer (Latency)
Throughput
Cost
Ingest/
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers &
insights

COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
In-memory data structures
Database records
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Search documents
Log files
Messaging
Message MESSAGES
Messaging
Messages
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
Data streams
Transactions
Files
Events
Types of Data

Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Temperature

STORE
Database SQL & NoSQL databases
Search Search engines
File store File systems
Queue Message queues
Stream
storage
Pub/sub message queues
In-memory Caches, data structure servers
Types of Data Stores
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
COLLECT

In-memory
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
Amazon SQS
Amazon SQS
• Managed message queue service
Apache Kafka
• High throughput distributed streaming
platform
Amazon Kinesis Streams
• Managed stream storage + processing
Amazon Kinesis Firehose
• Managed data delivery
Amazon DynamoDB
• Managed NoSQL database
• Tables can be stream-enabled
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
File store
LoggingTransport
Messaging
Message MESSAGES
Messaging
Message
Stream
Message & Stream Storage

Why Stream Storage?
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Parallel consumption
Streaming MapReduce
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
Producer 2
Producer 3
Producer n
Key = violet
DynamoDB stream Amazon Kinesis stream Kafka topic

In-memory
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon SQS
Message
Amazon S3
File
LoggingIoTApplicationsTransportMessaging
File Storage
Amazon S3

Why Is Amazon S3 Good for Analytics?
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot Instances
• Multiple & heterogeneous analysis clusters can use the same data
• Unlimited number of objects and volume of data
• Very high bandwidth – no aggregate throughput limit
• Designed for 99.99% availability – can tolerate zone failure
• Designed for 99.999999999% durability
• No need to pay for data replication
• Native support for versioning
• Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies
• Secure – SSL, client/server-side encryption at rest
• Low cost

In-memory
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon SQS
Message
Amazon S3
File
File Storage

Database Access
Data Tier
Anti-Pattern

Best Practice: Use the Right Tool for the Job
Search
Service
In-memory
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
MySQL
PostgreSQL
Oracle
SQL Server
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB

Microsecond Real-Time Performance
Fully Managed
Redis Automatic Failover = NoOps
Enhanced Redis Engine
No Cross-AZ Data Transfer Costs
Easy to Deploy, Use and Monitor
Open-Source Compatible
Amazon
ElastiCache

Fully managed NoSQL
Document / Key-Value store
Single-digit millisecond latency
Massive and seamless scalability
Event-driven programming
Amazon
DynamoDB

• Fully managed, highly available: handles all software management,
fault tolerant, replication across multi-AZs within a region
• DynamoDB API compatible: seamlessly caches DynamoDB API
calls, no application re-writes required
• Write-through: DAX handles caching for writes
• Flexible: Configure DAX for one table or many
• Scalable: scales-out to any workload with up to 10 read replicas
• Manageability: fully integrated AWS service: Amazon CloudWatch,
Tagging for DynamoDB, AWS Console
• Security: Amazon VPC, AWS IAM, AWS CloudTrail, AWS
Organizations
Features
DynamoDB Accelerator (DAX)
DynamoDB
Your Applications
DynamoDB Accelerator
Preview

Amazon
RDS
Automated backups (with point-in-time recovery)
Cross-region snapshot copies
Automated patch management
Automated Multi-AZ replication
Scale up / Scale down instance types
Scalable storage on demand
“License included” and BYOL models

Amazon Aurora: MySQL andPostgreSQL-compatible
SQL
Trans-
actions
AZ 1 AZ 2 AZ 3
Caching
Amazon
S3
• 5x faster than MySQL on
same hardware
• SysBench: 100 K writes/sec
and 500 K reads/sec
• Designed for 99.99%
availability
• 6-way replicated storage
across 3 AZs
• Scale to 64 TB and 15 Read
Replicas

Distributed search and analytics engine
Managed service using Elasticsearch and Kibana
Fully managed – zero admin
Highly available and reliable
Tightly integrated with other AWS services
Amazon
Elasticsearch
Service

Which Data Store Should I Use?
Data structure
Fixed schema → SQL, NOSQL
Schema-Free (JSON) → NoSQL, Search
(Key, Value) → In-Memory, NoSQL
Access patterns → Store data in the format you will access it
Put/Get (Key, Value) → In-Memory, NoSQL
Simple relationships (1:N, M:N) → NoSQL
Complex relationships (Multi-table joins, transactional) → SQL
Faceting, Search → Search
Data characteristics → Hot, warm, cold
Cost → Right cost

Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena,
Amazon EMR (Presto, Spark)
Message
Takes milliseconds to seconds
Example: Message processing
Amazon SQS applications on Amazon EC2
Stream
Takes milliseconds to seconds
Example: Fraud alerts, 1 minute metrics
Amazon EMR (Spark Streaming, Flink), Amazon Kinesis
Analytics, KCL, Storm, AWS Lambda
Machine Learning
Takes milliseconds to minutes
Example: Fraud detection, forecast demand
Amazon ML, Amazon EMR (Spark ML)
Analytics Types & Frameworks PROCESS / ANALYZE
Amazon
Machine Learning
MLMessage
Amazon SQS apps
Amazon EC2
Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
Stream
Amazon EC2
Amazon EMR
Fast
Amazon Redshift
+ Redshift Spectrum
Presto
Amazon
EMR
FastSlow
Amazon Athena
BatchInteractive

What About ETL?
https://aws.amazon.com/big-data/partner-solutions/
ETLSTORE PROCESS / ANALYZE
Data Integration Partners
Reduce the effort to move, cleanse, synchronize,
manage, and automatize data related processes.
AWS Glue
AWS Glue is a fully managed ETL service that makes
it easy to understand your data sources, prepare the
data, and move it reliably between data stores
Preview

Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Applications & API
Analysis and visualization
Notebooks
IDE
Business
users
Data scientist,
developers
Data Consumption

Design Patterns and
Customer Examples

Primitive: Decoupled Data Bus
Storage decoupled from processing
Multiple stages
Store Process Store ProcessData Answers
process
store

Primitive: Pub/Sub
Amazon
Kinesis
AWS
Lambda
Data
Amazon Kinesis
Connector
Library
process
store
Apache
Spark
Parallel stream consumption/processing

process
store
Primitive: Materialized Views
Amazon EMR
Amazon
Kinesis
AWS
Lambda
Amazon S3
Data
Amazon
DynamoDB
AnswersSpark
Streaming
Amazon Kinesis
Connector
Library
Spark
SQL
Analysis framework reads from or writes to multiple data stores

Amazon EMR
Real-time Analytics
Amazon
Kinesis
AWS Lambda
Spark
Streaming
Amazon
SNS
Amazon
ML
Notifications
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alerts
App state
Real-time prediction
KPI
process
store
Stream
KCL app
Amazon
S3
Raw
Append-
Only
Log
Amazon
Kinesis
Fan out
Amazon Kinesis
Analytics

Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Amazon S3
Data Lake
Amazon EMR
Amazon
Kinesis
Amazon Redshift
Answers &
insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting

Interactive &
Batch
Analytics
Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
ML
process
store
Consume
Amazon Redshift
Amazon EMR
Presto
Spark
Batch
Interactive
Batch prediction
Real-time prediction
Stream Amazon
Kinesis
Firehose
Amazon Athena
Files
Amazon Kinesis
Analytics

Data Marts
(Amazon
Redshift)
Query Cluster
(EMR)
Query Cluster
(EMR)
Auto Scaling
EC2
Analytics
App
Normalization
ETL Clusters
(EMR)
Batch Analytic
Clusters
(EMR)
Ad Hoc Query
Cluster (EMR)
Auto Scaling
EC2
Analytics
App
Users Data
Providers
Auto Scaling
EC2
Data
Ingestion
Services
Optimization
ETL Clusters
(EMR)
Shared Metastore
(RDS)
Query Optimized
(S3)
Auto Scaling EC2
Data
Catalog
& Lineage
Services
Reference Data
(RDS)
Shared Data Services
Auto Scaling
EC2
Cluster Mgt
& Workflow
Services
Source of
Truth (S3)
>5 PB, up to 75 billion events per day

Data Lake
https://aws.amazon.com/answers/big-data/data-lake-solution/

Amazon SQS apps
Streaming
KCL
apps
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon
Machine Learning
Presto
Amazon
EMR
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FastSlowFast
SearchSQLNoSQLCacheFileMessageStream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
ETL
BatchMessageInteractiveStreamML
Amazon EMR
AWS Lambda
Amazon Kinesis
Analytics
Amazon Athena

Thank you!
Antoine Généreux, AWS Solutions Architect

Database and Analytics on the AWS Cloud - AWS Innovate Toronto

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Database and Analytics on the AWS Cloud - AWS Innovate Toronto

Similar a Database and Analytics on the AWS Cloud - AWS Innovate Toronto (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

Database and Analytics on the AWS Cloud - AWS Innovate Toronto