3. AnalyzeStore
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Amazon RDS,
Amazon Aurora
AWS big data portfolio
AWS Data Pipeline
Amazon
CloudSearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon
Machine
Learning
Amazon
Elasticsearch
Service
AWS Database
Migration Service
Amazon
QuickSight
Amazon
Kinesis
Firehose
AWS Import/Export
AWS Direct
Connect
Collect
Amazon Kinesis
Streams
4. Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
5. The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, machine learning,
streaming
Analysis inline with process
flows
Pay as you go, grow as you
need
Managed availability and
disaster recovery
Enterprise Big data SaaS
6. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical
representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any
vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
Forrester Wave™ Enterprise Data Warehouse Q4 ’15
10. Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
11. Benefit #1: Amazon Redshift is fast
Hardware optimized for I/O intensive workloads, 4 GB/sec/node
Enhanced networking, over 1 million packets/sec/node
Choice of storage type, instance size
Regular cadence of autopatched improvements
12. Benefit #1: Amazon Redshift is fast
New Dense Storage (HDD) instance type
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: Same as our prior generation!
Performance improvement: 50%
Enhanced I/O and commit improvements (Jan ’16)
Reduce amount of time to commit data
Performance improvement: 35%
13. Benefit #2: Amazon Redshift is inexpensive
Ds2 (HDD)
Price per hour for
DW1.XL single node
Effective annual
price per TB compressed
On-demand $ 0.850 $ 3,725
1 year reservation $ 0.500 $ 2,190
3 year reservation $ 0.228 $ 999
Dc1 (SSD)
Price per hour for
DW2.L single node
Effective annual
price per TB compressed
On-demand $ 0.250 $ 13,690
1 year reservation $ 0.161 $ 8,795
3 year reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No upfront costs
Pay as you go
14. Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to Amazon S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
15. Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/region level disasters
16. Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks and in S3 encrypted
• Block key, cluster key, master key (AES-256)
• On-premises HSM & AWS CloudHSM support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
17. Benefit #5: We innovate quickly
Well over 100 new features added since launch
Release every two weeks
Automatic patching
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize
Perf., Approximate Count Distinct, SNS
Alerts, Cross Region Backup (11/13)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and
diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch
size support for single node clusters, new
system tables with commit stats,
row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE
ciphers (4/22)
3 new regex features, Unload to single
file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions,
percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
pg_last_unload_count (9/15)
AES-128 S3 encryption (9/29)
UTF-16 support (9/29)
18. Benefit #6: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine learning
• Data science
19. Benefit #7: Amazon Redshift has a large ecosystem
Data integration Systems integratorsBusiness intelligence
20. Benefit #8: Service oriented architecture
DynamoDB
EMR
S3
EC2/SSH
RDS/Aurora
Amazon
Redshift
Amazon Kinesis
Amazon ML
Data Pipeline
CloudSearch
Amazon
Mobile
Analytics
22. NTT Docomo: Japan’s largest mobile service
provider
68 million customers
Tens of TBs per day of data across a
mobile network
6 PB of total data (uncompressed)
Data science for marketing
operations, logistics, and so on
Greenplum on-premises
Scaling challenges
Performance issues
Need same level of security
Need for a hybrid environment
23. Nasdaq: powering 100 marketplaces in 50
countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 billion rows/day
Analyze market share, client activity,
surveillance, billing, and so on
Microsoft SQL Server on-premises
Expensive legacy DW
($1.16 M/yr.)
Limited capacity (1 yr. of data
online)
Needed lower TCO
Must satisfy multiple security
and regulatory requirements
Similar performance
30. Resize
• Resize while remaining online
• Provision a new cluster in the
background
• Copy data in parallel from node to
node
• Only charged for source cluster
34. Single column
• Table is sorted by 1 column
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
[ SORTKEY ( date ) ]
• Best for:
• Queries that use first column (that is, date) as primary filter
• Can speed up joins and group bys
• Quickest to VACUUM
35. Compound
• Table is sorted by first column, then second column, and so on
Date Region Country
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Korea
2-JUN-2015 Asia Singapore
2-JUN-2015 Europe Germany
3-JUN-2015 Asia Hong Kong
3-JUN-2015 Asia Korea
[ SORTKEY COMPOUND ( date, region, country) ]
• Best for:
• Queries that use first column as primary filter, then other columns
• Can speed up joins and group bys
• Slower to VACUUM
36. Interleaved
• Equal weight is given to each column
Date Region Country
2-JUN-2015 Africa Zaire
3-JUN-2015 Asia Singapore
2-JUN-2015 Asia Korea
2-JUN-2015 Europe Germany
3-JUN-2015 Asia Hong Kong
2-JUN-2015 Asia Korea
[ SORTKEY INTERLEAVED ( date, region, country) ]
• Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter
• Slowest to VACUUM
38. ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
2
3
4
ID Gender Name
101 M John Smith
306 F Lisa Green
ID Gender Name
292 F Jane Jones
209 M James White
ID Gender Name
139 M Peter Black
164 M Brian Snail
ID Gender Name
446 M Pat Partridge
658 F Sarah Cyan
Round
Robin
DISTSTYLE EVEN
39. ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
Hash
Function
ID Gender Name
101 M John Smith
306 F Lisa Green
ID Gender Name
292 F Jane Jones
209 M James White
ID Gender Name
139 M Peter Black
164 M Brian Snail
ID Gender Name
446 M Pat Partridge
658 F Sarah Cyan
DISTSTYLE KEY
40. ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
Hash
Function
ID Gender Name
101 M John Smith
139 M Peter Black
446 M Pat Partridge
164 M Brian Snail
209 M James White
ID Gender Name
292 F Jane Jones
658 F Sarah Cyan
306 F Lisa Green
DISTSTYLE KEY
41. ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
All
DISTSTYLE ALL
42. • KEY
• Large fact tables
• Large dimension tables
• ALL
• Medium dimension tables (1K–2M)
• EVEN
• Tables with no joins or group bys
• Small dimension tables (<1000)
64. In production
Critical
Yle Analytics Pipeline – First idea
Elastic
Beanstalk
Analytics
Collector
Web events
~50 mill./day
RedShift
Recommendation
service
68. Pilot
In production
Critical
Responsible: JO
Testing
Yle Analytics Pipeline - Current situation
Elastic
Beanstalk
Kinesis Streams
Analytics
Collector
Web events
~50 mill./day
Kinesis Firehose
RedShift
Lambda S3 EMR
Daily
archiving
Areena
Recommendation
Delay: 10-15 minutes
Note: Altogether tens of source application types with in total 120 different labels (most of them common). Labels may change.
Small batch sizes
Compression and
larger file sizes.
~60 Gb per day
CloudWatch
Dashboards
Small sized files
Delay: minute?
69. Yle Analytics Pipeline - Future scenario
In production
Critical
Responsible: JO
Elastic
Beanstalk Kinesis Streams
Analytics
Collector
RedShift
Lambda S3
Delay:
15 minutes to hours
Kinesis Firehose
Aggregated
analysis
(hours)
Ad hoc
analysis
Validation
Parsing
Aggregating?
CloudWatch
Web events
~50 mill./day
Delay: minutes?
Delay: seconds DynamoDBEMR
Specific
real time
analysis
(seconds)
Selected analysis
EMR
Ad hoc
analysis w/
random
tools
Batch
analysis
Raw data
queries
APIs,buffering
Additional data
Daily/weekly
division into
paths?
ElasticSearch?
Lambda
EC2 Container
Service?
What kind of architecture?
Data vault?
Schema versioning
Remove old information