SlideShare una empresa de Scribd logo
1 de 35
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sara Mitchell
Solutions Architect, Amazon Web Services
Mikael Hedberg
CTO Apsis International and formerly CTO Innometrics
Build a Data Warehouse on AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift: Modern Data Warehouse
Fast, scalable, fully managed data warehouse at 1/10th the cost
Query data across your Amazon Redshift data warehouse and your Amazon S3
data lake with Redshift Spectrum feature
Massively parallel, scales from gigabytes to exabytes
Fast
Delivers fast results for short
queries, complex queries,
and mixed workloads.
Cost effective
Start at $0.25 per hour; scale
out for as low as $250–$333
per uncompressed terabyte
per year.
Data lake integration Secure
Audit everything; encrypt
data end-to-end; extensive
certification and
compliance.
Query open file formats in
Amazon S3 and optimized
data formats on direct-
attached disks.
$ Data lake
1001100001001010
1110010101011100
1010110101100101
010100001
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Cluster Architecture
Massively parallel, shared nothing architecture
Streaming backup & restore from Amazon S3
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, back up, restore
• 2, 16, or 32 slices
Amazon Redshift Cluster
JDBC/ODBC
Leader node
Compute nodes
Efficient data loads
Streaming, backup, & restore
Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
BI / Dashboard tools
Analytics and
Amazon
Redshift
Queries go to
leader node.
1
If cache contains
query result, it’s
returned with no
processing.
2
If query is not in
cache, it’s
executed and
result is cached.
3
In-memory leader node cache,
resulting in subsecond response.
Transparent – It just works.
Skip WLM, skip processing, & skip
optimization.
Cache persists across sessions.
Caching frees up the Amazon Redshift
cluster, increasing performance for
other non-repetitive queries.
RESULTS CACHE
QUERY_ID RESULT
QUERY_ID RESULT
Fast – Result Caching: Subsecond Repeat Queries
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fast – Short Query Acceleration
Express Lane for Short Queries
Analytics and
BI / dashboard tools
Amazon
Redshift
Machine learning predicts the
runtime of queries.
Short queries are routed to an express
queue for faster processing.
Higher throughput, less variability.
Customized for your workload.
Transparent – It just works!
Machine Learning
Classifier
Machine learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fast – Dense Compute Node – DC2
2x performance
@ same price as DC1
3x more I/O with 30% better storage utilization than DC1
“We saw a 9x reduction in month-end
reporting time with Amazon Redshift
DC2 nodes compared with DC1.”
- Bradley Todd,
Technical Architect, Liberty Mutual
NVMe SSD DDR4 memory
Intel E5-2686 v4 (Broadwell)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apsis Profile Cloud
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Segmentation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Originally designed for pure real-time
Desktop Browser
Mobile Apps
Users
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
ProductView
Originally designed for pure real-time
Desktop Browser
Mobile Apps
Users
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Originally designed for pure real-time
Desktop Browser
Mobile Apps
Users
API
ProductView
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Originally designed for pure real-time
NoSQL Storage
Desktop Browser
Mobile Apps
Users
API
ProductView
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Originally designed for pure real-time
NoSQL Storage
Desktop Browser
Mobile Apps
Users
Real-time data
pipeline
API
ProductView
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Originally designed for pure real-time
NoSQL Storage
Desktop Browser
Mobile Apps
Users
Real-time data
pipeline
Applications
API
ProductView
ProductView
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Warehousing Requirements
Use-case:
• Ability to query ’raw,’ unaggregated data sets
• Ad-hoc queries with user-defined complex dimensions, not known ahead of time, and with
timing logic that doesn’t allow pre-aggregation
• Up to ~30 billion datapoints per customer
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Redshift Solution
Amazon Redshift
Real-time data
pipeline
BI API
BI User
ETL from JSON
documents –>
relational Redshift
model
SQL interface
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Programmatically generated queries
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reality Ambition
Can you actually query all your data?
New insights and use cases
• Real time decisions, personalization, fraud
detection, risk analysis
• Event driven automation, Process robotics
• AI and ML capabilities
Many other data sources
• Twitter, clickstream data
• IoT devices and sensors
• Video, speech, audio data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach
Relational and non-relational data
Terabytes to exabytes scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data warehouse
Business
Intelligence
Data lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
§ Multiple upload
§ Range GET
§ Store as much as you need
§ Scale storage and compute
independently
§ No minimum usage
commitments
Scalable
§ Amazon Redshift / Spectrum
§ Amazon EMR
§ Amazon Athena
§ Amazon DynamoDB
Integrated
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
Easy to use
Why Amazon S3 for the Data Lake?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake Integration – Amazon Redshift Spectrum
Query across your Amazon Redshift data warehouse and your Amazon S3 data lake
Run Amazon Redshift SQL queries against Amazon S3
Scale compute and storage separately
Fast query performance
Unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
On demand, pay-per-query based on data scanned
Amazon S3
data lake
Amazon
Redshift data
Redshift Spectrum
query engine
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Redshift Spectrum Architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, back up, restore
• 2, 16, or 32 slices
Redshift Spectrum
• In-place queries of data on Amazon S3
• Ultra high scale, unlimited concurrency
• CSV, Grok, Avro, Parquet, and more
Amazon Redshift Cluster
JDBC/ODBC
...
1 2 3 4 N
Leader node
Compute nodes
Spectrum Fleet
Amazon S3
But why not use Amazon Athena?
• No Infrastructure or administration
• Zero spin up time
• Transparent upgrades
• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take advantage
of Amazon S3 durability and availability
Amazon Redshift Spectrum or Amazon Athena?
Amazon Athena
• Interactive, ad-hoc queries using
SQL and S3
• Serverless architecture
• Structured and un-structured data
• Reduce the amount of data scanned
to reduce cost and increase
performance (use compression,
partitioning, or convert to columnar
format)
• Charged on S3 data scanned
• Fast, simple queries on S3
• Integrates with BI, SQL Clients and
JDBC tools
Amazon Redshift Spectrum
• Large sets of structured data
• Combine data in S3 and Amazon
Redshift
• Limitless concurrency
• No contention on Redshift Cluster
• Amazon manages cluster scaling to
thousands of instances
• S3 cost effective storage
Amazon Redshift
• Multiple and complex joins
• Low IO queries
• Lower variability in latency for use cases
with strict SLAs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Creating tables in Redshift and Spectrum
-- create a table in Redshift
create table event(
eventid integer not null distkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null sortkey,
eventname varchar(200),
starttime timestamp);
-- Load data into Redshift
copy event from 's3://bucket/allevents_pipe.txt'
iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-
west-2';
-- create external schema in Spectrum
create external schema spectrum
from data catalog
database 'spectrumdb'
iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'
create external database if not exists;
create external table spectrum.sales(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
row format delimited
fields terminated by 't'
stored as textfile
location 's3://bucket/spectrum/sales/'
table properties ('numRows'='172000');
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Query combining data in Redshift and Spectrum
-- query the spectrum table
select count(*) from spectrum.sales;
-- Join the external table SPECTRUM.SALES with the local table EVENT to find the total sales for the top ten events.
select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, event
where spectrum.sales.eventid = event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;
-- View the query plan for the previous query.
-- Note the S3 Seq Scan, S3 HashAggregate, and S3 Query Scan
-- steps that were executed against the data on Amazon S3.
explain
select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid)
from spectrum.sales, event
where spectrum.sales.eventid = event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Output from query plan
-- Note the S3 Seq Scan, S3 HashAggregate, and S3 Query Scan
-- steps that were executed against the data on Amazon S3.
QUERY PLAN
-----------------------------------------------------------------------------
XN Limit (cost=1001055770628.63..1001055770628.65 rows=10 width=31)
-> XN Merge (cost=1001055770628.63..1001055770629.13 rows=200 width=31)
Merge Key: sum(sales.derived_col2)
-> XN Network (cost=1001055770628.63..1001055770629.13 rows=200 width=31)
Send to leader
-> XN Sort (cost=1001055770628.63..1001055770629.13 rows=200 width=31)
Sort Key: sum(sales.derived_col2)
-> XN HashAggregate (cost=1055770620.49..1055770620.99 rows=200 width=31)
-> XN Hash Join DS_BCAST_INNER (cost=3119.97..1055769620.49 rows=200000 width=31)
Hash Cond: ("outer".derived_col1 = "inner".eventid)
-> XN S3 Query Scan sales (cost=3010.00..5010.50 rows=200000 width=31)
-> S3 HashAggregate (cost=3010.00..3010.50 rows=200000 width=16)
-> S3 Seq Scan spectrum.sales location:"s3://bucket/spectrum/sales" format:TEXT
(cost=0.00..2150.00 rows=172000 width=16)
Filter: (pricepaid > 30.00)
-> XN Hash (cost=87.98..87.98 rows=8798 width=4)
-> XN Seq Scan on event (cost=0.00..87.98 rows=8798 width=4)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Performance Tuning (1/2)
Load data in parallel
• Use a single COPY command per table
• Use at least as many input files as you have slices
• Use large (100MB-1GB after compression), equally-sized files
Use appropriate column encodings
• Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of
disk I/O and therefore improves query performance
• COPY automatically analyzes and compresses data on first load into an empty table
• Use the Amazon Redshift Column Encoding Utility from the GitHUb Redshift Utils repository to apply encodings to
existing tables
Distribute data to avoid data transfer during joins
• Use KEY distribution to distribute large fact and dimension tables on the column participating in the most expensive joins
• Use ALL to distribute small (< 5m rows) dimension tables to each compute node
• If there is no good distribution key, choose EVEN distribution
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Performance Tuning (2/2)
Use sort keys to reduce disk I/O
• Create compound sort keys on the columns most commonly used in WHERE clauses
• Don't encode the first column of a sort key – leave it RAW
• VACUUM newly ingested data if inserts are not in sort key order
Use Workload Management (WLM) to allocate capacity for different
workloads
• Use different queues for moderate and expensive queries
• Good practice is to assign no more than 15 slots across all queues
• Use Queue Hopping to automatically move long-running queries to a queue with more memory
• Turn on Short Query Acceleration (SQA) if you have many short-running, interactive queries
• With Resultset Caching, many hundreds of queries can be run per second
• Use Query Monitoring Rules to log, hop or abort long-running, expensive, runaway queries
Scaling options
• Use multiple business line clusters each sized to meet the requirements of separate groups of end users
• Store data in S3 in a columnar compressed format such as Parquet or ORC and query via Redshift Spectrum
• Connect Aurora Postgres to Redshift to cache aggregated data - scale connections and improved performance
Kinesis Firehose
Athena
Query Service Glue
Data Access & Authorisation
Give your users easy and secure access
Data Ingestion
Get your data into S3
quickly and securely
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Storage & Catalog
Secure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Machine Learning
Predictive analytics
Amazon AI
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Useful Links
• Details on tuning steps
• https://aws.amazon.com/blogs/big-data/top-8-best-practices-
for-high-performance-etl-processing-using-amazon-redshift/
• https://aws.amazon.com/blogs/big-data/10-best-practices-for-
amazon-redshift-spectrum/
• https://aws.amazon.com/blogs/big-data/top-10-performance-
tuning-techniques-for-amazon-redshift/
• https://aws.amazon.com/blogs/big-data/best-practices-for-
micro-batch-loading-on-amazon-redshift/
• Repo of useful scripts:
• https://github.com/awslabs/amazon-redshift-utils
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!

Más contenido relacionado

Más de Amazon Web Services

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSAmazon Web Services
 
AWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAmazon Web Services
 
Crea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightCrea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightAmazon Web Services
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotAmazon Web Services
 
Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Amazon Web Services
 
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?Amazon Web Services
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksAmazon Web Services
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Amazon Web Services
 

Más de Amazon Web Services (20)

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
 
AWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei server
 
Crea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightCrea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSight
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
 
Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows
 
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced Attacks
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
 

Build a Data Warehouse on AWS

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sara Mitchell Solutions Architect, Amazon Web Services Mikael Hedberg CTO Apsis International and formerly CTO Innometrics Build a Data Warehouse on AWS
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift: Modern Data Warehouse Fast, scalable, fully managed data warehouse at 1/10th the cost Query data across your Amazon Redshift data warehouse and your Amazon S3 data lake with Redshift Spectrum feature Massively parallel, scales from gigabytes to exabytes Fast Delivers fast results for short queries, complex queries, and mixed workloads. Cost effective Start at $0.25 per hour; scale out for as low as $250–$333 per uncompressed terabyte per year. Data lake integration Secure Audit everything; encrypt data end-to-end; extensive certification and compliance. Query open file formats in Amazon S3 and optimized data formats on direct- attached disks. $ Data lake 1001100001001010 1110010101011100 1010110101100101 010100001
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Cluster Architecture Massively parallel, shared nothing architecture Streaming backup & restore from Amazon S3 Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, back up, restore • 2, 16, or 32 slices Amazon Redshift Cluster JDBC/ODBC Leader node Compute nodes Efficient data loads Streaming, backup, & restore Amazon S3
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BI / Dashboard tools Analytics and Amazon Redshift Queries go to leader node. 1 If cache contains query result, it’s returned with no processing. 2 If query is not in cache, it’s executed and result is cached. 3 In-memory leader node cache, resulting in subsecond response. Transparent – It just works. Skip WLM, skip processing, & skip optimization. Cache persists across sessions. Caching frees up the Amazon Redshift cluster, increasing performance for other non-repetitive queries. RESULTS CACHE QUERY_ID RESULT QUERY_ID RESULT Fast – Result Caching: Subsecond Repeat Queries
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fast – Short Query Acceleration Express Lane for Short Queries Analytics and BI / dashboard tools Amazon Redshift Machine learning predicts the runtime of queries. Short queries are routed to an express queue for faster processing. Higher throughput, less variability. Customized for your workload. Transparent – It just works! Machine Learning Classifier Machine learning
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fast – Dense Compute Node – DC2 2x performance @ same price as DC1 3x more I/O with 30% better storage utilization than DC1 “We saw a 9x reduction in month-end reporting time with Amazon Redshift DC2 nodes compared with DC1.” - Bradley Todd, Technical Architect, Liberty Mutual NVMe SSD DDR4 memory Intel E5-2686 v4 (Broadwell)
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apsis Profile Cloud
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Segmentation
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Originally designed for pure real-time Desktop Browser Mobile Apps Users
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. ProductView Originally designed for pure real-time Desktop Browser Mobile Apps Users
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Originally designed for pure real-time Desktop Browser Mobile Apps Users API ProductView
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Originally designed for pure real-time NoSQL Storage Desktop Browser Mobile Apps Users API ProductView
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Originally designed for pure real-time NoSQL Storage Desktop Browser Mobile Apps Users Real-time data pipeline API ProductView
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Originally designed for pure real-time NoSQL Storage Desktop Browser Mobile Apps Users Real-time data pipeline Applications API ProductView ProductView
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Warehousing Requirements Use-case: • Ability to query ’raw,’ unaggregated data sets • Ad-hoc queries with user-defined complex dimensions, not known ahead of time, and with timing logic that doesn’t allow pre-aggregation • Up to ~30 billion datapoints per customer
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Solution Amazon Redshift Real-time data pipeline BI API BI User ETL from JSON documents –> relational Redshift model SQL interface
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Programmatically generated queries
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reality Ambition Can you actually query all your data? New insights and use cases • Real time decisions, personalization, fraud detection, risk analysis • Event driven automation, Process robotics • AI and ML capabilities Many other data sources • Twitter, clickstream data • IoT devices and sensors • Video, speech, audio data
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lakes Extend the Traditional Approach Relational and non-relational data Terabytes to exabytes scale Schema defined during analysis Diverse analytical engines to gain insights Designed for low-cost storage and analytics OLTP ERP CRM LOB Data warehouse Business Intelligence Data lake 100110000100101011100 101010111001010100001 011111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time
  • 22. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance § Multiple upload § Range GET § Store as much as you need § Scale storage and compute independently § No minimum usage commitments Scalable § Amazon Redshift / Spectrum § Amazon EMR § Amazon Athena § Amazon DynamoDB Integrated § Simple REST API § AWS SDKs § Read-after-create consistency § Event notification § Lifecycle policies Easy to use Why Amazon S3 for the Data Lake?
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake Integration – Amazon Redshift Spectrum Query across your Amazon Redshift data warehouse and your Amazon S3 data lake Run Amazon Redshift SQL queries against Amazon S3 Scale compute and storage separately Fast query performance Unlimited concurrency CSV, ORC, Grok, Avro, & Parquet data formats On demand, pay-per-query based on data scanned Amazon S3 data lake Amazon Redshift data Redshift Spectrum query engine
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Spectrum Architecture Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, unload, back up, restore • 2, 16, or 32 slices Redshift Spectrum • In-place queries of data on Amazon S3 • Ultra high scale, unlimited concurrency • CSV, Grok, Avro, Parquet, and more Amazon Redshift Cluster JDBC/ODBC ... 1 2 3 4 N Leader node Compute nodes Spectrum Fleet Amazon S3
  • 25. But why not use Amazon Athena? • No Infrastructure or administration • Zero spin up time • Transparent upgrades • Query data in its raw format • AVRO, Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No loading of data, no ETL required • Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability
  • 26. Amazon Redshift Spectrum or Amazon Athena? Amazon Athena • Interactive, ad-hoc queries using SQL and S3 • Serverless architecture • Structured and un-structured data • Reduce the amount of data scanned to reduce cost and increase performance (use compression, partitioning, or convert to columnar format) • Charged on S3 data scanned • Fast, simple queries on S3 • Integrates with BI, SQL Clients and JDBC tools Amazon Redshift Spectrum • Large sets of structured data • Combine data in S3 and Amazon Redshift • Limitless concurrency • No contention on Redshift Cluster • Amazon manages cluster scaling to thousands of instances • S3 cost effective storage Amazon Redshift • Multiple and complex joins • Low IO queries • Lower variability in latency for use cases with strict SLAs
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Creating tables in Redshift and Spectrum -- create a table in Redshift create table event( eventid integer not null distkey, venueid smallint not null, catid smallint not null, dateid smallint not null sortkey, eventname varchar(200), starttime timestamp); -- Load data into Redshift copy event from 's3://bucket/allevents_pipe.txt' iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole' delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us- west-2'; -- create external schema in Spectrum create external schema spectrum from data catalog database 'spectrumdb' iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole' create external database if not exists; create external table spectrum.sales( salesid integer, listid integer, sellerid integer, buyerid integer, eventid integer, dateid smallint, qtysold smallint, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp) row format delimited fields terminated by 't' stored as textfile location 's3://bucket/spectrum/sales/' table properties ('numRows'='172000');
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Query combining data in Redshift and Spectrum -- query the spectrum table select count(*) from spectrum.sales; -- Join the external table SPECTRUM.SALES with the local table EVENT to find the total sales for the top ten events. select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, event where spectrum.sales.eventid = event.eventid and spectrum.sales.pricepaid > 30 group by spectrum.sales.eventid order by 2 desc; -- View the query plan for the previous query. -- Note the S3 Seq Scan, S3 HashAggregate, and S3 Query Scan -- steps that were executed against the data on Amazon S3. explain select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, event where spectrum.sales.eventid = event.eventid and spectrum.sales.pricepaid > 30 group by spectrum.sales.eventid order by 2 desc;
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Output from query plan -- Note the S3 Seq Scan, S3 HashAggregate, and S3 Query Scan -- steps that were executed against the data on Amazon S3. QUERY PLAN ----------------------------------------------------------------------------- XN Limit (cost=1001055770628.63..1001055770628.65 rows=10 width=31) -> XN Merge (cost=1001055770628.63..1001055770629.13 rows=200 width=31) Merge Key: sum(sales.derived_col2) -> XN Network (cost=1001055770628.63..1001055770629.13 rows=200 width=31) Send to leader -> XN Sort (cost=1001055770628.63..1001055770629.13 rows=200 width=31) Sort Key: sum(sales.derived_col2) -> XN HashAggregate (cost=1055770620.49..1055770620.99 rows=200 width=31) -> XN Hash Join DS_BCAST_INNER (cost=3119.97..1055769620.49 rows=200000 width=31) Hash Cond: ("outer".derived_col1 = "inner".eventid) -> XN S3 Query Scan sales (cost=3010.00..5010.50 rows=200000 width=31) -> S3 HashAggregate (cost=3010.00..3010.50 rows=200000 width=16) -> S3 Seq Scan spectrum.sales location:"s3://bucket/spectrum/sales" format:TEXT (cost=0.00..2150.00 rows=172000 width=16) Filter: (pricepaid > 30.00) -> XN Hash (cost=87.98..87.98 rows=8798 width=4) -> XN Seq Scan on event (cost=0.00..87.98 rows=8798 width=4)
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance Tuning (1/2) Load data in parallel • Use a single COPY command per table • Use at least as many input files as you have slices • Use large (100MB-1GB after compression), equally-sized files Use appropriate column encodings • Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance • COPY automatically analyzes and compresses data on first load into an empty table • Use the Amazon Redshift Column Encoding Utility from the GitHUb Redshift Utils repository to apply encodings to existing tables Distribute data to avoid data transfer during joins • Use KEY distribution to distribute large fact and dimension tables on the column participating in the most expensive joins • Use ALL to distribute small (< 5m rows) dimension tables to each compute node • If there is no good distribution key, choose EVEN distribution
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance Tuning (2/2) Use sort keys to reduce disk I/O • Create compound sort keys on the columns most commonly used in WHERE clauses • Don't encode the first column of a sort key – leave it RAW • VACUUM newly ingested data if inserts are not in sort key order Use Workload Management (WLM) to allocate capacity for different workloads • Use different queues for moderate and expensive queries • Good practice is to assign no more than 15 slots across all queues • Use Queue Hopping to automatically move long-running queries to a queue with more memory • Turn on Short Query Acceleration (SQA) if you have many short-running, interactive queries • With Resultset Caching, many hundreds of queries can be run per second • Use Query Monitoring Rules to log, hop or abort long-running, expensive, runaway queries Scaling options • Use multiple business line clusters each sized to meet the requirements of separate groups of end users • Store data in S3 in a columnar compressed format such as Parquet or ORC and query via Redshift Spectrum • Connect Aurora Postgres to Redshift to cache aggregated data - scale connections and improved performance
  • 33. Kinesis Firehose Athena Query Service Glue Data Access & Authorisation Give your users easy and secure access Data Ingestion Get your data into S3 quickly and securely Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Storage & Catalog Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Machine Learning Predictive analytics Amazon AI
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Useful Links • Details on tuning steps • https://aws.amazon.com/blogs/big-data/top-8-best-practices- for-high-performance-etl-processing-using-amazon-redshift/ • https://aws.amazon.com/blogs/big-data/10-best-practices-for- amazon-redshift-spectrum/ • https://aws.amazon.com/blogs/big-data/top-10-performance- tuning-techniques-for-amazon-redshift/ • https://aws.amazon.com/blogs/big-data/best-practices-for- micro-batch-loading-on-amazon-redshift/ • Repo of useful scripts: • https://github.com/awslabs/amazon-redshift-utils
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you!