Build a Data Warehouse on AWS

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sara Mitchell
Solutions Architect, Amazon Web Services
Mikael Hedberg
CTO Apsis International and formerly CTO Innometrics
Build a Data Warehouse on AWS

Amazon Redshift: Modern Data Warehouse
Fast, scalable, fully managed data warehouse at 1/10th the cost
Query data across your Amazon Redshift data warehouse and your Amazon S3
data lake with Redshift Spectrum feature
Massively parallel, scales from gigabytes to exabytes
Fast
Delivers fast results for short
queries, complex queries,
and mixed workloads.
Cost effective
Start at $0.25 per hour; scale
out for as low as $250–$333
per uncompressed terabyte
per year.
Data lake integration Secure
Audit everything; encrypt
data end-to-end; extensive
certification and
compliance.
Query open file formats in
Amazon S3 and optimized
data formats on direct-
attached disks.
$ Data lake
1001100001001010
1110010101011100
1010110101100101
010100001

Amazon Redshift Cluster Architecture
Massively parallel, shared nothing architecture
Streaming backup & restore from Amazon S3
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, back up, restore
• 2, 16, or 32 slices
Amazon Redshift Cluster
JDBC/ODBC
Leader node
Compute nodes
Efficient data loads
Streaming, backup, & restore
Amazon S3

BI / Dashboard tools
Analytics and
Amazon
Redshift
Queries go to
leader node.
1
If cache contains
query result, it’s
returned with no
processing.
2
If query is not in
cache, it’s
executed and
result is cached.
3
In-memory leader node cache,
resulting in subsecond response.
Transparent – It just works.
Skip WLM, skip processing, & skip
optimization.
Cache persists across sessions.
Caching frees up the Amazon Redshift
cluster, increasing performance for
other non-repetitive queries.
RESULTS CACHE
QUERY_ID RESULT
QUERY_ID RESULT
Fast – Result Caching: Subsecond Repeat Queries

Fast – Short Query Acceleration
Express Lane for Short Queries
Analytics and
BI / dashboard tools
Amazon
Redshift
Machine learning predicts the
runtime of queries.
Short queries are routed to an express
queue for faster processing.
Higher throughput, less variability.
Customized for your workload.
Transparent – It just works!
Machine Learning
Classifier
Machine learning

Fast – Dense Compute Node – DC2
2x performance
@ same price as DC1
3x more I/O with 30% better storage utilization than DC1
“We saw a 9x reduction in month-end
reporting time with Amazon Redshift
DC2 nodes compared with DC1.”
- Bradley Todd,
Technical Architect, Liberty Mutual
NVMe SSD DDR4 memory
Intel E5-2686 v4 (Broadwell)

Apsis Profile Cloud

Segmentation

Originally designed for pure real-time
Desktop Browser
Mobile Apps
Users

ProductView
Desktop Browser
Mobile Apps
Users

Desktop Browser
Mobile Apps
Users
API
ProductView

NoSQL Storage
Desktop Browser
Mobile Apps
Users
API
ProductView

NoSQL Storage
Desktop Browser
Mobile Apps
Users
Real-time data
pipeline
API
ProductView

NoSQL Storage
Desktop Browser
Mobile Apps
Users
Real-time data
pipeline
Applications
API
ProductView
ProductView

Data Warehousing Requirements
Use-case:
• Ability to query ’raw,’ unaggregated data sets
• Ad-hoc queries with user-defined complex dimensions, not known ahead of time, and with
timing logic that doesn’t allow pre-aggregation
• Up to ~30 billion datapoints per customer

Redshift Solution
Amazon Redshift
Real-time data
pipeline
BI API
BI User
ETL from JSON
documents –>
relational Redshift
model
SQL interface

Programmatically generated queries

Thank you

Reality Ambition
Can you actually query all your data?
New insights and use cases
• Real time decisions, personalization, fraud
detection, risk analysis
• Event driven automation, Process robotics
• AI and ML capabilities
Many other data sources
• Twitter, clickstream data
• IoT devices and sensors
• Video, speech, audio data

Data Lakes Extend the Traditional Approach
Relational and non-relational data
Terabytes to exabytes scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data warehouse
Business
Intelligence
Data lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time

Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
§ Multiple upload
§ Range GET
§ Store as much as you need
§ Scale storage and compute
independently
§ No minimum usage
commitments
Scalable
§ Amazon Redshift / Spectrum
§ Amazon EMR
§ Amazon Athena
§ Amazon DynamoDB
Integrated
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
Easy to use
Why Amazon S3 for the Data Lake?

Data Lake Integration – Amazon Redshift Spectrum
Query across your Amazon Redshift data warehouse and your Amazon S3 data lake
Run Amazon Redshift SQL queries against Amazon S3
Scale compute and storage separately
Fast query performance
Unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
On demand, pay-per-query based on data scanned
Amazon S3
data lake
Amazon
Redshift data
Redshift Spectrum
query engine

Redshift Spectrum Architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, back up, restore
• 2, 16, or 32 slices
Redshift Spectrum
• In-place queries of data on Amazon S3
• Ultra high scale, unlimited concurrency
• CSV, Grok, Avro, Parquet, and more
Amazon Redshift Cluster
JDBC/ODBC
...
1 2 3 4 N
Leader node
Compute nodes
Spectrum Fleet
Amazon S3

But why not use Amazon Athena?
• No Infrastructure or administration
• Zero spin up time
• Transparent upgrades
• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take advantage
of Amazon S3 durability and availability

Amazon Redshift Spectrum or Amazon Athena?
Amazon Athena
• Interactive, ad-hoc queries using
SQL and S3
• Serverless architecture
• Structured and un-structured data
• Reduce the amount of data scanned
to reduce cost and increase
performance (use compression,
partitioning, or convert to columnar
format)
• Charged on S3 data scanned
• Fast, simple queries on S3
• Integrates with BI, SQL Clients and
JDBC tools
Amazon Redshift Spectrum
• Large sets of structured data
• Combine data in S3 and Amazon
Redshift
• Limitless concurrency
• No contention on Redshift Cluster
• Amazon manages cluster scaling to
thousands of instances
• S3 cost effective storage
Amazon Redshift
• Multiple and complex joins
• Low IO queries
• Lower variability in latency for use cases
with strict SLAs

Demo

Creating tables in Redshift and Spectrum
-- create a table in Redshift
create table event(
eventid integer not null distkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null sortkey,
eventname varchar(200),
starttime timestamp);
-- Load data into Redshift
copy event from 's3://bucket/allevents_pipe.txt'
iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-
west-2';
-- create external schema in Spectrum
create external schema spectrum
from data catalog
database 'spectrumdb'
iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'
create external database if not exists;
create external table spectrum.sales(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
row format delimited
fields terminated by 't'
stored as textfile
location 's3://bucket/spectrum/sales/'
table properties ('numRows'='172000');

Query combining data in Redshift and Spectrum
-- query the spectrum table
select count(*) from spectrum.sales;
-- Join the external table SPECTRUM.SALES with the local table EVENT to find the total sales for the top ten events.
select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, event
where spectrum.sales.eventid = event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;
-- View the query plan for the previous query.
-- Note the S3 Seq Scan, S3 HashAggregate, and S3 Query Scan
-- steps that were executed against the data on Amazon S3.
explain
select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid)
from spectrum.sales, event
where spectrum.sales.eventid = event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;

Output from query plan
-- Note the S3 Seq Scan, S3 HashAggregate, and S3 Query Scan
-- steps that were executed against the data on Amazon S3.
QUERY PLAN
-----------------------------------------------------------------------------
XN Limit (cost=1001055770628.63..1001055770628.65 rows=10 width=31)
-> XN Merge (cost=1001055770628.63..1001055770629.13 rows=200 width=31)
Merge Key: sum(sales.derived_col2)
-> XN Network (cost=1001055770628.63..1001055770629.13 rows=200 width=31)
Send to leader
-> XN Sort (cost=1001055770628.63..1001055770629.13 rows=200 width=31)
Sort Key: sum(sales.derived_col2)
-> XN HashAggregate (cost=1055770620.49..1055770620.99 rows=200 width=31)
-> XN Hash Join DS_BCAST_INNER (cost=3119.97..1055769620.49 rows=200000 width=31)
Hash Cond: ("outer".derived_col1 = "inner".eventid)
-> XN S3 Query Scan sales (cost=3010.00..5010.50 rows=200000 width=31)
-> S3 HashAggregate (cost=3010.00..3010.50 rows=200000 width=16)
-> S3 Seq Scan spectrum.sales location:"s3://bucket/spectrum/sales" format:TEXT
(cost=0.00..2150.00 rows=172000 width=16)
Filter: (pricepaid > 30.00)
-> XN Hash (cost=87.98..87.98 rows=8798 width=4)
-> XN Seq Scan on event (cost=0.00..87.98 rows=8798 width=4)

Performance Tuning (1/2)
Load data in parallel
• Use a single COPY command per table
• Use at least as many input files as you have slices
• Use large (100MB-1GB after compression), equally-sized files
Use appropriate column encodings
• Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of
disk I/O and therefore improves query performance
• COPY automatically analyzes and compresses data on first load into an empty table
• Use the Amazon Redshift Column Encoding Utility from the GitHUb Redshift Utils repository to apply encodings to
existing tables
Distribute data to avoid data transfer during joins
• Use KEY distribution to distribute large fact and dimension tables on the column participating in the most expensive joins
• Use ALL to distribute small (< 5m rows) dimension tables to each compute node
• If there is no good distribution key, choose EVEN distribution

Performance Tuning (2/2)
Use sort keys to reduce disk I/O
• Create compound sort keys on the columns most commonly used in WHERE clauses
• Don't encode the first column of a sort key – leave it RAW
• VACUUM newly ingested data if inserts are not in sort key order
Use Workload Management (WLM) to allocate capacity for different
workloads
• Use different queues for moderate and expensive queries
• Good practice is to assign no more than 15 slots across all queues
• Use Queue Hopping to automatically move long-running queries to a queue with more memory
• Turn on Short Query Acceleration (SQA) if you have many short-running, interactive queries
• With Resultset Caching, many hundreds of queries can be run per second
• Use Query Monitoring Rules to log, hop or abort long-running, expensive, runaway queries
Scaling options
• Use multiple business line clusters each sized to meet the requirements of separate groups of end users
• Store data in S3 in a columnar compressed format such as Parquet or ORC and query via Redshift Spectrum
• Connect Aurora Postgres to Redshift to cache aggregated data - scale connections and improved performance

Kinesis Firehose
Athena
Query Service Glue
Data Access & Authorisation
Give your users easy and secure access
Data Ingestion
Get your data into S3
quickly and securely
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Storage & Catalog
Secure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Machine Learning
Predictive analytics
Amazon AI

Useful Links
• Details on tuning steps
• https://aws.amazon.com/blogs/big-data/top-8-best-practices-
for-high-performance-etl-processing-using-amazon-redshift/
• https://aws.amazon.com/blogs/big-data/10-best-practices-for-
amazon-redshift-spectrum/
• https://aws.amazon.com/blogs/big-data/top-10-performance-
tuning-techniques-for-amazon-redshift/
• https://aws.amazon.com/blogs/big-data/best-practices-for-
micro-batch-loading-on-amazon-redshift/
• Repo of useful scripts:
• https://github.com/awslabs/amazon-redshift-utils

Thank you!

Build a Data Warehouse on AWS

Recomendados

Recomendados

Más contenido relacionado

Más de Amazon Web Services

Más de Amazon Web Services (20)

Build a Data Warehouse on AWS