SlideShare una empresa de Scribd logo
1 de 46
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Zach Christopherson, Database Engineer, AWS
June 28, 2016
Amazon Redshift for Big Data Analytics
Agenda
• Service Overview
• Migration Considerations
• Optimization:
• Schema / Table Design
• Data Ingestion
• Maintenance and Database Tuning
• Demonstration
Service Overview
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
Selected Amazon Redshift customers
Amazon Redshift system architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB, Amazon EMR, or SSH
Two hardware platforms
• Optimized for data processing
• DS2: HDD; scale from 2TB to 2PB
• DC1: SSD; scale from 160GB to 326TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
A deeper look at compute node architecture
Each node contains multiple slices
• DS2 – 2 slices on XL, 16 on 8XL
• DC1 – 2 slices on L, 32 on 8XL
Each slice is allocated CPU, and
table data
Each slice processes a piece of
the workload in parallel
Leader Node
Amazon Redshift dramatically reduces I/O
Data compression
Zone maps
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• Calculating SUM(Amount) with
row storage:
– Need to read everything
– Unnecessary I/O
ID Age State Amount
Amazon Redshift dramatically reduces I/O
Data compression
Zone maps
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• Calculating SUM(Amount) with
column storage:
– Only scan the necessary
blocks
ID Age State Amount
Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
• Columnar compression
– Effective due to like data
– Reduces storage
requirements
– Reduces I/O
ID Age State Amount
analyze compression orders;
Table | Column | Encoding
--------+-------------+----------
orders | id | mostly32
orders | age | mostly32
orders | state | lzo
orders | amount | mostly32
Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
• In-memory block metadata
• Contains per-block MIN and MAX value
• Effectively prunes blocks which don’t
contain data for a given query
• Minimize unnecessary I/O
ID Age State Amount
Migration Considerations
Forklift = BAD
On Redshift
• Optimal throughput at 8-12 concurrency, 50 possible
• Upfront table design is critical, not able to add indexes
• DISTKEY and SORTKEY significantly influence performance
• Primary Key, Foreign Key, Unique constraints will help
• CAUTION: These are not enforced
• Compression is for speed as well
On Redshift
• Commits are expensive, batch loads
• Blocks are immutable, a small write (~1-10 rows) and a
relatively larger write (~100k rows) have a similar cost
• Partition temporal data with time series tables and
UNION ALL VIEWs
Optimization: Schema Design
Goals of Distribution
• Distribute data evenly for parallel processing
• Minimize data movement
• Co-located Joins
• Localized Aggregations
Distribution Key All
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Full table data on first
slice of every node
Same key to same location
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Even
Round robin
distribution
Choosing a Distribution Style
KEY
• Large FACT tables
• Large or rapidly changing
tables used in joins
• Localize columns used within
aggregations
ALL
• Have slowly changing data
• Reasonable size (i.e., few
millions but not 100’s of
millions of rows)
• No common distribution key for
frequent joins
• Typical use case – joined
dimension table without a
common distribution key
EVEN
• Tables not frequently joined or
aggregated
• Large tables without acceptable
candidate keys
Goals of Sorting
• Physically sort data within blocks and throughout table
• Enable rrscans to prune blocks by leveraging zone maps
• Optimal SORTKEY is dependent on:
• Query patterns
• Data profile
• Business requirements
COMPOUND
• Most Common
• Well defined filter criteria
• Time-series data
Choosing a SORTKEY
INTERLEAVED
• Edge Cases
• Large tables (>Billion Rows)
• No common filter criteria
• Non time-series data
• Primarily as a query predicate (date, identifier, …)
• Optionally choose a column frequently used for aggregates
• Optionally choose same as distribution key column for most efficient
joins (merge join)
Compressing Data
• COPY automatically analyzes and compresses data
when loading into empty tables
• ANALYZE COMPRESSION checks existing tables and
proposes optimal compression algorithms for each
column
• Changing column encoding requires a table rebuild
Automatic compression is a good thing (mostly)
“The definition of insanity is doing the same thing over and
over again, but expecting different results”.
- Albert Einstein
If you have a regular ETL process and you use temp tables
or staging tables, turn off automatic compression
• Use analyze compression to determine the right encodings
• Bake those encodings into your DML
• Use CREATE TABLE … LIKE
Automatic compression is a good thing (mostly)
• From the zone maps we know:
• Which block(s) contain the
range
• Which row offsets to scan
• Highly compressed sort keys:
• Many rows per block
• Large row offset
Skip compression on just the
leading column of the compound
sortkey
Optimization: Ingest Performance
Amazon Redshift Loading Data Overview
AWS CloudCorporate Data center
Amazon
DynamoDB
Amazon S3
Data
Volume
Amazon Elastic
MapReduce
Amazon
RDS
Amazon
Redshift
Amazon
Glacier
logs / files
Source DBs
VPN
Connection
AWS Direct
Connect
S3 Multipart
Upload
AWS Import/
Export
EC2 or On-
Prem (using
SSH)
Parallelism is a function of load files
Each slice’s query processors are able to load one file at a time
• Streaming Decompression
• Parse
• Distribute
• Write
A single input file means
only one slice is ingesting data
Realizing only partial cluster usage as 6.25% of slices are active
2 4 6 8 10 12 141 3 5 7 9 11 13 15
Maximize Throughput with Multiple Files
Use at least as many input files as
there are slices in cluster
With 16 input files, all slices are
working so you maximize
throughput
COPY continues to scale linearly
as you add additional nodes
2 4 6 8 10 12 141 3 5 7 9 11 13 15
Optimization: Performance Tuning
Optimizing a database for querying
• Periodically check your table status
• Vacuum and Analyze regularly
• SVV_TABLE_INFO
• Missing statistics
• Table skew
• Uncompressed Columns
• Unsorted Data
• Check your cluster status
• WLM queuing
• Commit queuing
• Database Locks
Missing Statistics
• Amazon Redshift’s query
optimizer relies on up-to-date
statistics
• Statistics are only necessary for
data which you are accessing
• Updated stats important on:
• SORTKEY
• DISTKEY
• Columns in query predicates
Table Skew
• Unbalanced workload
• Query completes as fast as the
slowest slice completes
• Can cause skew inflight:
• Temp data fills a single
node resulting in query
failure
Table Maintenance and Status
Unsorted Table
• Sortkey is just a guide, but data
needs to actually be sorted
• VACUUM or DEEP COPY to
sort
• Scans against unsorted tables
continue to benefit from zone
maps:
• Load sequential blocks
WLM Queue
Identify short/long-running queries
and prioritize them
Define multiple queues to route
queries appropriately.
Default concurrency of 5
Leverage wlm_apex_hourly to tune
WLM based on peak concurrency
requirements
Cluster Status: Commits and WLM
Commit Queue
How long is your commit queue?
• Identify needless transactions
• Group dependent statements
within a single transaction
• Offload operational workloads
• STL_COMMIT_STATS
Cluster Status: Database Locks
• Database Locks
• Read locks, Write locks, Exclusive locks
• Reads block exclusive
• Writes block writes and exclusive
• Exclusives block everything
• Ungranted locks block subsequent lock requests
• Exposed through SVV_TRANSACTIONS
Demonstration
https://s3.amazonaws.com/chriz-webinar/webinar.zip
Use SORTKEYs to effectively prune blocks
Use SORTKEYs to effectively prune blocks
Use SORTKEYs to effectively prune blocks
Don’t compress initial SORTKEY column
Use compression encoding to reduce I/O
Choose a DISTKEY which avoids data skew
Ingest: Disable predictable compression analysis
Ingest: Load multiple files to match cluster slices
VACUUM to physically removed deleted rows
VACUUM to keep your tables sorted
Gather statistics to assist the query planner
Thank you!
Questions?
https://s3.amazonaws.com/chriz-webinar/webinar.zip

Más contenido relacionado

La actualidad más candente

Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
joeharris76
 

La actualidad más candente (20)

AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Scalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedScalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query Speed
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech Talks
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
What's New in Amazon Aurora
What's New in Amazon AuroraWhat's New in Amazon Aurora
What's New in Amazon Aurora
 

Destacado

AWS Customer Presentation : PBS - AWS Summit 2012 - NYC
AWS Customer Presentation : PBS - AWS Summit 2012 - NYCAWS Customer Presentation : PBS - AWS Summit 2012 - NYC
AWS Customer Presentation : PBS - AWS Summit 2012 - NYC
Amazon Web Services
 
Security in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve RileySecurity in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve Riley
Amazon Web Services
 

Destacado (20)

Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftData Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...
Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...
Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...
 
Accelerate Go-To-Market Speed in a CI/CD Environment
Accelerate Go-To-Market Speed in a CI/CD EnvironmentAccelerate Go-To-Market Speed in a CI/CD Environment
Accelerate Go-To-Market Speed in a CI/CD Environment
 
Stg205 amazon s3
Stg205 amazon s3Stg205 amazon s3
Stg205 amazon s3
 
AWS Customer Presentation : PBS - AWS Summit 2012 - NYC
AWS Customer Presentation : PBS - AWS Summit 2012 - NYCAWS Customer Presentation : PBS - AWS Summit 2012 - NYC
AWS Customer Presentation : PBS - AWS Summit 2012 - NYC
 
CPN203 Saving with EC2 Spot Instances - AWS re: Invent 2012
CPN203 Saving with EC2 Spot Instances - AWS re: Invent 2012CPN203 Saving with EC2 Spot Instances - AWS re: Invent 2012
CPN203 Saving with EC2 Spot Instances - AWS re: Invent 2012
 
DAT201 Migrating Databases to AWS - AWS re: Invent 2012
DAT201 Migrating Databases to AWS - AWS re: Invent 2012DAT201 Migrating Databases to AWS - AWS re: Invent 2012
DAT201 Migrating Databases to AWS - AWS re: Invent 2012
 
Security in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve RileySecurity in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve Riley
 
AWS Compute Services State of the Union (CPN202) | AWS re:Invent 2013
AWS Compute Services State of the Union (CPN202) | AWS re:Invent 2013AWS Compute Services State of the Union (CPN202) | AWS re:Invent 2013
AWS Compute Services State of the Union (CPN202) | AWS re:Invent 2013
 
AWS Paris Summit 2014 - T2 - Amazon Workspaces, postes de travail sur le cloud
AWS Paris Summit 2014 - T2 - Amazon Workspaces, postes de travail sur le cloudAWS Paris Summit 2014 - T2 - Amazon Workspaces, postes de travail sur le cloud
AWS Paris Summit 2014 - T2 - Amazon Workspaces, postes de travail sur le cloud
 
6 rules for innovation
6 rules for innovation6 rules for innovation
6 rules for innovation
 
Relational Databases Redefined with AWS
Relational Databases Redefined with AWSRelational Databases Redefined with AWS
Relational Databases Redefined with AWS
 
Modern Security and Compliance Through Automation
Modern Security and Compliance Through AutomationModern Security and Compliance Through Automation
Modern Security and Compliance Through Automation
 
AWS Summit 2013 | India - 0 to Production in 40 minutes, Pieter Kemps
AWS Summit 2013 | India - 0 to Production in 40 minutes, Pieter KempsAWS Summit 2013 | India - 0 to Production in 40 minutes, Pieter Kemps
AWS Summit 2013 | India - 0 to Production in 40 minutes, Pieter Kemps
 
The 2014 AWS Enterprise Summit Keynote
The 2014 AWS Enterprise Summit Keynote The 2014 AWS Enterprise Summit Keynote
The 2014 AWS Enterprise Summit Keynote
 
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
 
RMG204 Optimizing Costs with AWS - AWS re: Invent 2012
RMG204 Optimizing Costs with AWS - AWS re: Invent 2012RMG204 Optimizing Costs with AWS - AWS re: Invent 2012
RMG204 Optimizing Costs with AWS - AWS re: Invent 2012
 

Similar a AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics

Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum
 

Similar a AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics (20)

Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
 
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Zach Christopherson, Database Engineer, AWS June 28, 2016 Amazon Redshift for Big Data Analytics
  • 2. Agenda • Service Overview • Migration Considerations • Optimization: • Schema / Table Design • Data Ingestion • Maintenance and Database Tuning • Demonstration
  • 4. Relational data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  • 6. Amazon Redshift system architecture Leader node • SQL endpoint • Stores metadata • Coordinates query execution Compute nodes • Local, columnar storage • Execute queries in parallel • Load, backup, restore via Amazon S3; load from Amazon DynamoDB, Amazon EMR, or SSH Two hardware platforms • Optimized for data processing • DS2: HDD; scale from 2TB to 2PB • DC1: SSD; scale from 160GB to 326TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 7. A deeper look at compute node architecture Each node contains multiple slices • DS2 – 2 slices on XL, 16 on 8XL • DC1 – 2 slices on L, 32 on 8XL Each slice is allocated CPU, and table data Each slice processes a piece of the workload in parallel Leader Node
  • 8. Amazon Redshift dramatically reduces I/O Data compression Zone maps ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 • Calculating SUM(Amount) with row storage: – Need to read everything – Unnecessary I/O ID Age State Amount
  • 9. Amazon Redshift dramatically reduces I/O Data compression Zone maps ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 • Calculating SUM(Amount) with column storage: – Only scan the necessary blocks ID Age State Amount
  • 10. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps • Columnar compression – Effective due to like data – Reduces storage requirements – Reduces I/O ID Age State Amount analyze compression orders; Table | Column | Encoding --------+-------------+---------- orders | id | mostly32 orders | age | mostly32 orders | state | lzo orders | amount | mostly32
  • 11. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps • In-memory block metadata • Contains per-block MIN and MAX value • Effectively prunes blocks which don’t contain data for a given query • Minimize unnecessary I/O ID Age State Amount
  • 14. On Redshift • Optimal throughput at 8-12 concurrency, 50 possible • Upfront table design is critical, not able to add indexes • DISTKEY and SORTKEY significantly influence performance • Primary Key, Foreign Key, Unique constraints will help • CAUTION: These are not enforced • Compression is for speed as well
  • 15. On Redshift • Commits are expensive, batch loads • Blocks are immutable, a small write (~1-10 rows) and a relatively larger write (~100k rows) have a similar cost • Partition temporal data with time series tables and UNION ALL VIEWs
  • 17. Goals of Distribution • Distribute data evenly for parallel processing • Minimize data movement • Co-located Joins • Localized Aggregations Distribution Key All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Full table data on first slice of every node Same key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round robin distribution
  • 18. Choosing a Distribution Style KEY • Large FACT tables • Large or rapidly changing tables used in joins • Localize columns used within aggregations ALL • Have slowly changing data • Reasonable size (i.e., few millions but not 100’s of millions of rows) • No common distribution key for frequent joins • Typical use case – joined dimension table without a common distribution key EVEN • Tables not frequently joined or aggregated • Large tables without acceptable candidate keys
  • 19. Goals of Sorting • Physically sort data within blocks and throughout table • Enable rrscans to prune blocks by leveraging zone maps • Optimal SORTKEY is dependent on: • Query patterns • Data profile • Business requirements
  • 20. COMPOUND • Most Common • Well defined filter criteria • Time-series data Choosing a SORTKEY INTERLEAVED • Edge Cases • Large tables (>Billion Rows) • No common filter criteria • Non time-series data • Primarily as a query predicate (date, identifier, …) • Optionally choose a column frequently used for aggregates • Optionally choose same as distribution key column for most efficient joins (merge join)
  • 21. Compressing Data • COPY automatically analyzes and compresses data when loading into empty tables • ANALYZE COMPRESSION checks existing tables and proposes optimal compression algorithms for each column • Changing column encoding requires a table rebuild
  • 22. Automatic compression is a good thing (mostly) “The definition of insanity is doing the same thing over and over again, but expecting different results”. - Albert Einstein If you have a regular ETL process and you use temp tables or staging tables, turn off automatic compression • Use analyze compression to determine the right encodings • Bake those encodings into your DML • Use CREATE TABLE … LIKE
  • 23. Automatic compression is a good thing (mostly) • From the zone maps we know: • Which block(s) contain the range • Which row offsets to scan • Highly compressed sort keys: • Many rows per block • Large row offset Skip compression on just the leading column of the compound sortkey
  • 25. Amazon Redshift Loading Data Overview AWS CloudCorporate Data center Amazon DynamoDB Amazon S3 Data Volume Amazon Elastic MapReduce Amazon RDS Amazon Redshift Amazon Glacier logs / files Source DBs VPN Connection AWS Direct Connect S3 Multipart Upload AWS Import/ Export EC2 or On- Prem (using SSH)
  • 26. Parallelism is a function of load files Each slice’s query processors are able to load one file at a time • Streaming Decompression • Parse • Distribute • Write A single input file means only one slice is ingesting data Realizing only partial cluster usage as 6.25% of slices are active 2 4 6 8 10 12 141 3 5 7 9 11 13 15
  • 27. Maximize Throughput with Multiple Files Use at least as many input files as there are slices in cluster With 16 input files, all slices are working so you maximize throughput COPY continues to scale linearly as you add additional nodes 2 4 6 8 10 12 141 3 5 7 9 11 13 15
  • 29. Optimizing a database for querying • Periodically check your table status • Vacuum and Analyze regularly • SVV_TABLE_INFO • Missing statistics • Table skew • Uncompressed Columns • Unsorted Data • Check your cluster status • WLM queuing • Commit queuing • Database Locks
  • 30. Missing Statistics • Amazon Redshift’s query optimizer relies on up-to-date statistics • Statistics are only necessary for data which you are accessing • Updated stats important on: • SORTKEY • DISTKEY • Columns in query predicates
  • 31. Table Skew • Unbalanced workload • Query completes as fast as the slowest slice completes • Can cause skew inflight: • Temp data fills a single node resulting in query failure Table Maintenance and Status Unsorted Table • Sortkey is just a guide, but data needs to actually be sorted • VACUUM or DEEP COPY to sort • Scans against unsorted tables continue to benefit from zone maps: • Load sequential blocks
  • 32. WLM Queue Identify short/long-running queries and prioritize them Define multiple queues to route queries appropriately. Default concurrency of 5 Leverage wlm_apex_hourly to tune WLM based on peak concurrency requirements Cluster Status: Commits and WLM Commit Queue How long is your commit queue? • Identify needless transactions • Group dependent statements within a single transaction • Offload operational workloads • STL_COMMIT_STATS
  • 33. Cluster Status: Database Locks • Database Locks • Read locks, Write locks, Exclusive locks • Reads block exclusive • Writes block writes and exclusive • Exclusives block everything • Ungranted locks block subsequent lock requests • Exposed through SVV_TRANSACTIONS
  • 35. Use SORTKEYs to effectively prune blocks
  • 36. Use SORTKEYs to effectively prune blocks
  • 37. Use SORTKEYs to effectively prune blocks
  • 38. Don’t compress initial SORTKEY column
  • 39. Use compression encoding to reduce I/O
  • 40. Choose a DISTKEY which avoids data skew
  • 41. Ingest: Disable predictable compression analysis
  • 42. Ingest: Load multiple files to match cluster slices
  • 43. VACUUM to physically removed deleted rows
  • 44. VACUUM to keep your tables sorted
  • 45. Gather statistics to assist the query planner