This document provides an overview of Amazon Redshift data warehousing capabilities. It discusses how Redshift is fast, inexpensive, fully managed, secure, and innovates quickly. It describes how to get started with Redshift, provision clusters, model data, load and query data, and monitor performance. It also provides an example of how MakerBot uses Redshift as part of its "Dream Stack" along with other AWS services for analytics.
3. Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
4. The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, machine
learning, streaming
Analysis inline with
process flows
Pay as you go, grow as
you need
Managed availability and
disaster recovery
Enterprise Big data SaaS
6. Amazon Redshift architecture
Leader node
Simple SQL endpoint
Stores metadata
Optimizes query plan
Coordinates query execution
Compute nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
7. Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
8. Benefit #1: Amazon Redshift is fast
Dense Storage DS2 (HDD) instance type
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: Same as our prior generation DS1!
Performance improvement: 50%
Enhanced I/O and commit improvements (Jan ’16)
Reduce amount of time to commit data
Performance improvement: 35%
9. Benefit #2: Amazon Redshift is inexpensive
Ds2 (HDD)
Price per hour for
DW1.XL single node
Effective annual
price per TB compressed
On-demand $ 0.850 $ 3,725
1 year reservation $ 0.500 $ 2,190
3 year reservation $ 0.228 $ 999
Dc1 (SSD)
Price per hour for
DW2.L single node
Effective annual
price per TB compressed
On-demand $ 0.250 $ 13,690
1 year reservation $ 0.161 $ 8,795
3 year reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No upfront costs
Pay as you go
10. Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to Amazon S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
11. Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/region level disasters
12. Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
All blocks on disks and in S3 encrypted
Block key, cluster key, master key (AES-256)
On-premises HSM & AWS CloudHSM support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion, Backup & Restore
Customer VPC
Internal
VPC
JDBC/ODBC
13. Benefit #5: We innovate quickly
Well over 100 new features added since launch
Release every two weeks
Automatic patching
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize
Perf., Approximate Count Distinct, SNS
Alerts, Cross Region Backup (11/13)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and
diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch
size support for single node clusters, new
system tables with commit stats,
row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE
ciphers (4/22)
3 new regex features, Unload to single
file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions,
percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
pg_last_unload_count (9/15)
AES-128 S3 encryption (9/29)
UTF-16 support (9/29)
14. Benefit #6: Amazon Redshift has a large ecosystem
Data integration Systems integratorsBusiness intelligence
20. Resize
Resize while remaining online
Provision a new cluster in the
background
Copy data in parallel from node to
node
You are only charged for the source
cluster
22. 3 Important Details…
Column Encoding
Applied on First Data Load
Automatically
Ensure correct encoding is
used
Periodically revisit
encodings in case of change
Data Distribution
Even, Key Based, or
Replicated distribution of
data is available
Focus on colocation of data
to limit network transfer
View network transfer
information in Explain Plan
Data Sorting
Compound (default) Sort
Keys for predictable query
patterns
Interleaved Sort Keys for
tables that can be queried in
any way
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2013
MIN: 12-JUNE-2013
MAX: 20-JUNE-2013
MIN: 02-JUNE-2013
MAX: 25-JUNE-2013
Unsorted
table MIN: 01-JUNE-2013
MAX: 06-JUNE-2013
MIN: 07-JUNE-2013
MAX: 12-JUNE-2013
MIN: 13-JUNE-2013
MAX: 18-JUNE-2013
MIN: 19-JUNE-2013
MAX: 24-JUNE-2013
Sorted by date
24. Even Data is distributed evenly amongst all
Compute Nodes on the basis of the
Key
Based
Data is distributed to Compute Nodes on
the basis of the provided distribution key
column from a given record
All Data is replicated onto each Compute
Node
25. Key Based
Large fact tables
Large dimension tables
All
Medium dimension tables (1K–2M)
Even
Tables with no joins or group bys
Small dimension tables (<1000)
When to use which type of distribution?
26. Choosing a good distribution key
• High cardinality
• Number of unique values in the distribution key is significantly
larger than the number of slices in the cluster
• Low skew (uniform distribution)
• Each unique value in the distribution key is associated with
the same number of records in the table
• High entropy
• The unique values in the distribution key vary from each other
greatly
• Think GUIDs not sequential ID’s
• Frequently joined to other tables
28. Types of Sort Keys
• Compound (default)
• Good for known query patterns
• Contains up to 400 columns
• Interleaved
• Good for unknown query patterns
• Can contain up to 8 columns
• Must be maintained during Vacuum phase
41. MakerBot.com
• MakerBot, a subsidiary of Stratasys Ltd. (Nasdaq: SSYS), is
leading the next industrial revolution by setting the standards in
reliable and affordable desktop 3D-Printing
• Founded in 2009, MakerBot sells desktop 3D-Printers to innovative
and industry-leading customers worldwide, including engineers,
architects, designers, educators and consumers
• Has the largest installed base, and market share, of the desktop
3D-Printing industry
• Runs Thingiverse.com, the largest 3D-Printing Community
• 3D-Printing easy and accessible for everyone
Thingiverse.com
The 50 Most Influential
Gadgets of All Time
42. Richard L Williams
~20 years in Data Warehousing in HK & USA
• discovered unknown author Ralph Kimball
• used Cognos (shipped with VB 4.0) & RedBrick
• eCommerce, Retail, Insurance, Pharma
• Email/Lifecycle Marketing, Campaign Mgt, Actuarial
• Using AWS: 1800-Flowers, BMS, Janssen (J&J), MakerBot
43. Ecosystem – where’s the data?
Largest table ~130m rows
But most in 100k – 1m range
Tables Slowest to Load:
- Salesforce
- 100-200 columns “wide”
SQL-Tool:-
- DBVisualizer
- SQLWorkBench/J
- Aginity (Windows)
MS SQL-Svr on EC2
MySQL as RDS
Cloud apps
Internal web-sites
Desktop s/w
Firmware (on printer) s/w
44. Dream Stack
Redshift Matillion Tableau
Python
Addresses all the issues in DW:-
- can even do unstructured data..!
Works with Redshift, and Fast:-
- Informatica, Snaplogic, Talend do
not work with MPP
- Hadoop/EMR not necessary
Power to the users
Intuitive, data-types, Boto3,
libraries, widely used
45. So what..?
Personally: Career Transformative
- accurately predict effort and time
Manager: very happy
- Quickly build
- Quickly iterate
- “No Limits” –> Roadmap to the Vision
Company: becoming strategic
- Competitive Advantage
47. Demo - Master Class
Deep Copy
Deep Insert “Waves”
S3 “Trigger” files
Grants on Schemas to Groups
Groups are “roles”, add Users
Revoke on Schema [Public]
Matillion working Schema
Delta’s
Lookup’s
…
..
.
Scripts
Python + Boto(3)
ETL Matillion
48. Future
I wish I could describe these in more
detail but they are the company’s
Competitive Advantage
richard.williams@makerbot.com