SlideShare una empresa de Scribd logo
1 de 27
Amazon Redshift:
Good practices for
performance optimization
What is Amazon Redshift?
•Petabyte-scale columnar
database
•Fast response time
• ~ 10 times faster than
typical relational database
•Cost-effective (around 1000 $
per TB per year)
When customers are using Amazon Redshift?
• Reduce cost by
extending DW
instead of adding
HW
• Migrate from
existing DWH
system
• Respond faster to
business: provision
in minutes
Replacing
traditional DWH
• Improve
performance by an
order of magnitude
• Make more data
available for
analysis
• Access business
data via standard
reporting tools
Analyzing Big
Data
• Add analytics
functionality to the
applications
• Scale DW capacity
as demand grows
• Reduce SW and HW
costs by an order of
magnitude
Providing services
and SaaS
Amazon Redshift architecture
Redshift reduces IO
• With raw
storage you do
unnecessary IO
• With columnar
storage you
only read the
data you need
Column storage
• COPY
compresses
automatically
• You can analyze
and override
• More
performance,
less cost
Data
compression
• Track the
minimum and
maximum value
for each block
• Skip over blocks
that do not
contain relevant
data
Zone maps Direct-attached
storage
• > 2 GB/sec scan
rate
• Optimized for
data processing
• High disk density
Redshift Recap
Main possible Issues with Redshift
performance
• Incorrect column encoding
• Skewed table data
• Queries not benefiting from
sort keys – Excessive scanning
• Tables without statistics or
which need vacuum
• Tables with very large
VARCHAR columns
• Queries waiting on queue
slots
• Queries that are disk-based –
incorrect sizing, GROUP BY
distinct values, UNION, hash
joins with DISTINCT values
• Commit queue waits
• Inefficient data loads
• Inefficient use of Temporary
Tables
• Large nested loop JOINs
• Inappropriate Join Cardinality
Incorrect column encoding
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large
performance gains when they ensure that column encoding is optimally applied. To determine if you are deviating from
this best practice, you can use the v_extended_table_info view from the Amazon Redshift Utils GitHub repository.
If you find that you have tables without optimal column encoding, then use
the Amazon Redshift Column Encoding Utility from the Utils repository to
apply encoding. This command line utility uses the ANALYZE COMPRESSION
command on each table. If encoding is required, it generates a SQL script
which creates a new table with the correct encoding, copies all the data
into the new table, and then transactionally renames the new table to the
old name while retaining the original data.
•Raw Encoding
•Byte-Dictionary Encoding
•Delta Encoding
•LZO Encoding
•Mostly Encoding
•Runlength Encoding
•Text255 and Text32k Encodings
•Zstandard Encoding
Skewed table data
If skew is a problem, you typically see
that node performance is uneven on
the cluster. Use table_inspector.sql, to
see how data blocks in a distribution
key map to the slices and nodes in the
cluster.
Consider changing the distribution key
to a column that exhibits high
cardinality and uniform distribution.
Evaluate a candidate column as a
distribution key by creating a new table
using CTAS:
CREATE TABLE my_test_table DISTKEY
(<column name>) AS SELECT <column
name> FROM <table name>;
You can use SCT for setting the proper Distribution Key and Style during the
migration process
Deal with tables with very large VARCHAR
columns
During processing of complex queries, intermediate query results might need to be stored in temporary blocks.
These temporary tables are not compressed, so unnecessarily wide columns consume excessive memory and
temporary disk space, which can affect query performance.
SELECT database, schema || '.' || "table" AS
"table", max_varchar FROM svv_table_info
WHERE max_varchar > 150 ORDER BY 2;
Use the following query to generate a list of
tables that should have their maximum column
widths reviewed:
Identify which table columns have wide varchar columns and
then determine the true maximum width for each wide
column:
SELECT max(len(rtrim(column_name))) FROM table_name;
If you query the top running queries for the database using the top_queries.sql admin script, pay special attention to
SELECT * queries which include the JSON fragment column. If end users query these large columns but don’t use
actually execute JSON functions against them, consider moving them into another table that only contains the
primary key column of the original table and the JSON column.
Queries not benefiting from sort keys
Amazon Redshift tables can have a sort key column identified, which acts like an index in other databases, but which
does not incur a storage cost as with other platforms (for more information, see Choosing Sort Keys). A sort key should
be created on those columns which are most commonly used in WHERE clauses. If you have a known query pattern,
then COMPOUND sort keys give the best performance; if end users query different columns equally, then use an
INTERLEAVED sort key. If using compound sort keys, review your queries to ensure that their WHERE clauses specify the
sort columns in the same order they were defined in the compound key.
To determine which tables don’t have sort keys, run the following query against the v_extended_table_info view from
the Amazon Redshift Utils repository:
SELECT * FROM admin.v_extended_table_info WHERE sortkey IS null;
If you do not have a sort key this potentially can lead to a excessive scanning issue.
Queries waiting on queue slots (ICT)
Amazon Redshift runs queries using a queuing system known as
workload management (WLM). You can define up to 8 queues
to separate workloads from each other, and set the
concurrency on each queue to meet your overall throughput
requirements.
In some cases, the queue to which a user or query has been
assigned is completely busy and a user’s query must wait for a
slot to be open. During this time, the system is not executing
the query at all, which is a sign that you may need to increase
concurrency.
First, you need to determine if any queries are queuing, using
the queuing_queries.sql admin script. Review the maximum
concurrency that your cluster has needed in the past with
wlm_apex.sql, down to an hour-by-hour historical analysis with
wlm_apex_hourly.sql. Keep in mind that while increasing
concurrency allows more queries to run, each query will get a
smaller share of the memory allocated to its queue
Queries that are disk-based (ICT)
SELECT q.query, trim(q.cat_text)
FROM (
SELECT query,
replace( listagg(text,' ') WITHIN GROUP (ORDER BY sequence),
'n', ' ') AS cat_text
FROM stl_querytext
WHERE userid>1
GROUP BY query) q
JOIN (
SELECT distinct query
FROM svl_query_summary
WHERE is_diskbased='t' AND (LABEL LIKE 'hash%' OR LABEL LIKE 'sort%’
OR LABEL LIKE 'aggr%’) AND userid > 1) qs
ON qs.query = q.query;
If a query isn’t able to
completely execute in memory,
it may need to use disk-based
temporary storage for parts of
an explain plan. The additional
disk I/O slows down the query;
this can be addressed by
increasing the amount of
memory allocated to a session
(for more information, see WLM
Dynamic Memory Allocation).
To determine if any queries have been writing to disk, use the following query:
Based on the user or the queue
assignment rules, you can
increase the amount of
memory given to the selected
queue to prevent queries
needing to spill to disk to
complete.
Commit queue waits (ICT)
Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively
high, and excessive use of COMMIT can result in queries waiting for access to a commit queue.
If you are committing too often on your database, you will start to see waits on the commit queue increase, which can
be viewed with the commit_stats.sql admin script. This script shows the largest queue length and queue time for
queries run in the past two days. If you have queries that are waiting on the commit queue, then look for sessions that
are committing multiple times per session, such as ETL jobs that are logging progress or inefficient data loads.
One of the worst practices is to insert data into Amazon Redshift row by row. Use COPY command or your ETL tool
compatible with Amazon Redshift instead.
Inefficient use of Temporary Tables
Amazon Redshift provides temporary tables, which are like normal tables except that they are only visible within a single
session. When the user disconnects the session, the tables are automatically deleted. Temporary tables can be created
using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. The CREATE TABLE
statement gives you complete control over the definition of the temporary table, while the SELECT … INTO and C(T)TAS
commands use the input data to determine column names, sizes and data types, and use default storage properties.
These default storage properties may cause issues if not carefully considered. Amazon Redshift’s default table structure is
to use EVEN distribution with no column encoding. This is a sub-optimal data structure for many types of queries, and if you
are using the SELECT…INTO syntax you cannot set the column encoding or distribution and sort keys. If you use the CREATE
TABLE AS (CTAS) syntax, you can specify a distribution style and sort keys, and Redshift will automatically apply LZO
encoding for everything other than sort keys, booleans, reals and doubles. If you consider this automatic encoding sub-
optimal, and require further control, use the CREATE TABLE syntax rather than CTAS.
If you are creating temporary tables, it is highly recommended that you convert all SELECT…INTO syntax to use the CREATE
statement. This ensures that your temporary tables have column encodings and are distributed in a fashion that is
sympathetic to the other entities that are part of the workflow.
Tables without statistics
• Amazon Redshift, like other databases, requires statistics about tables and the
composition of data blocks being stored in order to make good decisions when
planning a query (for more information, see Analyzing Tables). Without good
statistics, the optimiser may make suboptimal choices about the order in which
to access tables, or how to join datasets together.
• The ANALYZE Command History topic in the Amazon Redshift Developer Guide
supplies queries to help you address missing or stale statistics, and you can also
simply run the missing_table_stats.sql admin script to determine which tables
are missing stats, or the statement below to determine tables that have stale
statistics:
SELECT database, schema || '.' || "table" AS "table", stats_off FROM svv_table_info
WHERE stats_off > 5 ORDER BY 2;
Tables which need VACUUM
In Amazon Redshift, data blocks are immutable. When
rows are DELETED or UPDATED, they are simply logically
deleted (flagged for deletion) but not physically
removed from disk. Updates result in a new block being
written with new data appended. Both of these
operations cause the previous version of the row to
continue consuming disk space and continue being
scanned when a query scans the table. As a result, table
storage space is increased and performance degraded
due to otherwise avoidable disk I/O during scans. A
VACUUM command recovers the space from deleted
rows and restores the sort order.
You can use the perf_alert.sql admin script to identify
tables that have had alerts about scanning a large
number of deleted rows raised in the last seven days.
To address issues with tables with missing or stale statistics or where vacuum is required, run another AWS Labs utility,
Analyze & Vacuum Schema. This ensures that you always keep up-to-date statistics, and only vacuum tables that actually
need reorganization.
Analyze the performance of the queries and
address issues
• Your best friends in a Redshift world
are:
• ANALYZE for identifying out-of-date
statistics
• SVL_QUERY_SUMMARY View for
summary of all the useful data around
the behavior of your queries and high-
level overview of the cluster
• Query Alerts – for receiving an
important information around
deviations in behavior of your queries
• SVL_QUERY_REPORT View – for
collecting details about your queries
health and performance
Setup and finetune monitoring and
maintenance procedures
• In addition to your preferred monitoring methods and multiple partners
solutions we do have some helpful tools:
• Amazon Redshift Column Encoding Utility
• Use table_inspector.sql, to see how data blocks in a distribution key map to the
slices and nodes in the cluster.
• Run the missing_table_stats.sql admin script to determine which tables are missing
stats.
• Use the perf_alert.sql admin script to identify tables that have had alerts about
scanning a large number of deleted rows raised in the last seven days.
• Use top_queries.sql to determine the top running queries.
• Review the maximum concurrency that your cluster has needed in the past with
wlm_apex.sql, down to an hour-by-hour historical analysis with
wlm_apex_hourly.sql.
• View the commit stats with the commit_stats.sql admin script
Eliminate Nested Loops
• Due to SQL query specifying a
join condition that requires a
”brute force” approach
between two large tables
• Quite easy to spot
• Look for “Nested Loop” in a
Query Plans or
STL_ALERT_EVENT_LOG
ON a.date > b.date
ON a.text LIKE b.text
ON a.x = b.x OR a.y = b.y
• Rewrite inequality JOIN condition
as a window function
• Use small nested loop instead of
two large tables
• Maybe you can do a nested loop
join and persist the results in a
separate relational table
Inappropriate Joins Cardinality
• Query “Fan-out”
• Look for high number of rows
generated from joins (higher
than the sum of rows of all
scanned tables) in the query
plan or execution metrics
• Carefully review the joins logic
• Use SVL_QUERY_SUMMARY to
detect
FROM house
JOIN rooms ON rooms.house_id = house.id
JOIN residents ON residents.house_id = house.id
• If possible break large fan-out
queries into several smaller
queries
• Use derived tables to transform
one-to-many joins into one-to-one
joins
• Try out some advanced techniques
like 1 and 2
Suboptimal WHERE clause
If your WHERE clause causes excessive table scans, you might see a SCAN
step in the segment with the highest maxtime value in
SVL_QUERY_SUMMARY. For more information, see Using the
SVL_QUERY_SUMMARY View.
To fix this issue, add a WHERE clause to the query based on the primary sort
column of the largest table. This approach will help minimize scanning time.
For more information, see Amazon Redshift Best Practices for Designing
Tables.
Insufficiently Restrictive Predicate
If your query has an insufficiently restrictive predicate, you might see a
SCAN step in the segment with the highest maxtime value in
SVL_QUERY_SUMMARY that has a very high rows value compared to the
rows value in the final RETURN step in the query. For more information, see
Using the SVL_QUERY_SUMMARY View.
To fix this issue, try adding a predicate to the query or making the existing
predicate more restrictive to narrow the output.
Very Large Result Set
If your query returns a very large result set, consider rewriting the query to
use UNLOAD to write the results to Amazon S3.
This approach will improve the performance of the RETURN step by taking
advantage of parallel processing. For more information on checking for a
very large result set, see Using the SVL_QUERY_SUMMARY View.
Large SELECT List
If your query has an unusually large SELECT list, you might see a bytes value
that is high relative to the rows value for any step (in comparison to other
steps) in SVL_QUERY_SUMMARY. This high bytes value can be an indicator
that you are selecting a lot of columns. For more information, see Using the
SVL_QUERY_SUMMARY View.
To fix this issue, review the columns you are selecting and see if any can be
removed.
Working with SVL_QUERY_SUMMARY
SELECT query, elapsed, substring FROM svl_qlog ORDER BY query DESC limit 5;
Select your query ID
Collect query data
SELECT * FROM svl_query_summary WHERE query = MyQueryID ORDER BY stm, seg, step;
For analysis please refer to: https://docs.aws.amazon.com/redshift/latest/dg/using-SVL-
Query-Summary.html
Some additional materials can be found here
• Best practices for designing queries
• Best practices for designing tables
• Top 10 performance tuning techniques
• Troubleshooting queries
Amazon Redshift Developer Guide
https://docs.aws.amazon.com/redshift/l
atest/dg/redshift-dg.pdf

Más contenido relacionado

La actualidad más candente

Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
Dan McKinley
 

La actualidad más candente (20)

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
 
Ten query tuning techniques every SQL Server programmer should know
Ten query tuning techniques every SQL Server programmer should knowTen query tuning techniques every SQL Server programmer should know
Ten query tuning techniques every SQL Server programmer should know
 
Mysql Optimization
Mysql OptimizationMysql Optimization
Mysql Optimization
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민
 
Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
 
Apache solr教學介紹 20150501
Apache solr教學介紹 20150501Apache solr教學介紹 20150501
Apache solr教學介紹 20150501
 
Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platforms
 
Getting The Most Out Of Your Flash/SSDs
Getting The Most Out Of Your Flash/SSDsGetting The Most Out Of Your Flash/SSDs
Getting The Most Out Of Your Flash/SSDs
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and Performance
 
Azure CosmosDb
Azure CosmosDbAzure CosmosDb
Azure CosmosDb
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Advanced Schema Design Patterns
Advanced Schema Design Patterns Advanced Schema Design Patterns
Advanced Schema Design Patterns
 
Data Vault: Data Warehouse Design Goes Agile
Data Vault: Data Warehouse Design Goes AgileData Vault: Data Warehouse Design Goes Agile
Data Vault: Data Warehouse Design Goes Agile
 

Similar a How to Fine-Tune Performance Using Amazon Redshift

Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008
paulguerin
 
MySQL: Know more about open Source Database
MySQL: Know more about open Source DatabaseMySQL: Know more about open Source Database
MySQL: Know more about open Source Database
Mahesh Salaria
 
02 database oprimization - improving sql performance - ent-db
02  database oprimization - improving sql performance - ent-db02  database oprimization - improving sql performance - ent-db
02 database oprimization - improving sql performance - ent-db
uncleRhyme
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best Practices
Eduardo Castro
 
Mohan Testing
Mohan TestingMohan Testing
Mohan Testing
smittal81
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
Calpont
 

Similar a How to Fine-Tune Performance Using Amazon Redshift (20)

(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09
 
Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Mysql For Developers
Mysql For DevelopersMysql For Developers
Mysql For Developers
 
MySQL: Know more about open Source Database
MySQL: Know more about open Source DatabaseMySQL: Know more about open Source Database
MySQL: Know more about open Source Database
 
02 database oprimization - improving sql performance - ent-db
02  database oprimization - improving sql performance - ent-db02  database oprimization - improving sql performance - ent-db
02 database oprimization - improving sql performance - ent-db
 
Query Optimization in SQL Server
Query Optimization in SQL ServerQuery Optimization in SQL Server
Query Optimization in SQL Server
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best Practices
 
Amazon Redshift For Data Analysts
Amazon Redshift For Data AnalystsAmazon Redshift For Data Analysts
Amazon Redshift For Data Analysts
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
Mohan Testing
Mohan TestingMohan Testing
Mohan Testing
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 
MemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks WebcastMemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks Webcast
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
 

Más de AWS Germany

Más de AWS Germany (20)

Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
Analytics Web Day | From Theory to Practice: Big Data Stories from the FieldAnalytics Web Day | From Theory to Practice: Big Data Stories from the Field
Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
 
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
 
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
Modern Applications Web Day | Impress Your Friends with Your First Serverless...Modern Applications Web Day | Impress Your Friends with Your First Serverless...
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
 
Modern Applications Web Day | Manage Your Infrastructure and Configuration on...
Modern Applications Web Day | Manage Your Infrastructure and Configuration on...Modern Applications Web Day | Manage Your Infrastructure and Configuration on...
Modern Applications Web Day | Manage Your Infrastructure and Configuration on...
 
Modern Applications Web Day | Container Workloads on AWS
Modern Applications Web Day | Container Workloads on AWSModern Applications Web Day | Container Workloads on AWS
Modern Applications Web Day | Container Workloads on AWS
 
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
Modern Applications Web Day | Continuous Delivery to Amazon EKS with SpinnakerModern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
 
Building Smart Home skills for Alexa
Building Smart Home skills for AlexaBuilding Smart Home skills for Alexa
Building Smart Home skills for Alexa
 
Hotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructure
Hotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructureHotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructure
Hotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructure
 
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless WorkshopWild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
 
AWS Programme für Nonprofits
AWS Programme für NonprofitsAWS Programme für Nonprofits
AWS Programme für Nonprofits
 
Microservices and Data Design
Microservices and Data DesignMicroservices and Data Design
Microservices and Data Design
 
Serverless vs. Developers – the real crash
Serverless vs. Developers – the real crashServerless vs. Developers – the real crash
Serverless vs. Developers – the real crash
 
Query your data in S3 with SQL and optimize for cost and performance
Query your data in S3 with SQL and optimize for cost and performanceQuery your data in S3 with SQL and optimize for cost and performance
Query your data in S3 with SQL and optimize for cost and performance
 
Secret Management with Hashicorp’s Vault
Secret Management with Hashicorp’s VaultSecret Management with Hashicorp’s Vault
Secret Management with Hashicorp’s Vault
 
EKS Workshop
 EKS Workshop EKS Workshop
EKS Workshop
 
Scale to Infinity with ECS
Scale to Infinity with ECSScale to Infinity with ECS
Scale to Infinity with ECS
 
Containers on AWS - State of the Union
Containers on AWS - State of the UnionContainers on AWS - State of the Union
Containers on AWS - State of the Union
 
Deploying and Scaling Your First Cloud Application with Amazon Lightsail
Deploying and Scaling Your First Cloud Application with Amazon LightsailDeploying and Scaling Your First Cloud Application with Amazon Lightsail
Deploying and Scaling Your First Cloud Application with Amazon Lightsail
 

Último

If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
David Celestin
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 

Último (15)

If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 

How to Fine-Tune Performance Using Amazon Redshift

  • 1. Amazon Redshift: Good practices for performance optimization
  • 2. What is Amazon Redshift? •Petabyte-scale columnar database •Fast response time • ~ 10 times faster than typical relational database •Cost-effective (around 1000 $ per TB per year)
  • 3. When customers are using Amazon Redshift? • Reduce cost by extending DW instead of adding HW • Migrate from existing DWH system • Respond faster to business: provision in minutes Replacing traditional DWH • Improve performance by an order of magnitude • Make more data available for analysis • Access business data via standard reporting tools Analyzing Big Data • Add analytics functionality to the applications • Scale DW capacity as demand grows • Reduce SW and HW costs by an order of magnitude Providing services and SaaS
  • 5. Redshift reduces IO • With raw storage you do unnecessary IO • With columnar storage you only read the data you need Column storage • COPY compresses automatically • You can analyze and override • More performance, less cost Data compression • Track the minimum and maximum value for each block • Skip over blocks that do not contain relevant data Zone maps Direct-attached storage • > 2 GB/sec scan rate • Optimized for data processing • High disk density
  • 7. Main possible Issues with Redshift performance • Incorrect column encoding • Skewed table data • Queries not benefiting from sort keys – Excessive scanning • Tables without statistics or which need vacuum • Tables with very large VARCHAR columns • Queries waiting on queue slots • Queries that are disk-based – incorrect sizing, GROUP BY distinct values, UNION, hash joins with DISTINCT values • Commit queue waits • Inefficient data loads • Inefficient use of Temporary Tables • Large nested loop JOINs • Inappropriate Join Cardinality
  • 8. Incorrect column encoding Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied. To determine if you are deviating from this best practice, you can use the v_extended_table_info view from the Amazon Redshift Utils GitHub repository. If you find that you have tables without optimal column encoding, then use the Amazon Redshift Column Encoding Utility from the Utils repository to apply encoding. This command line utility uses the ANALYZE COMPRESSION command on each table. If encoding is required, it generates a SQL script which creates a new table with the correct encoding, copies all the data into the new table, and then transactionally renames the new table to the old name while retaining the original data. •Raw Encoding •Byte-Dictionary Encoding •Delta Encoding •LZO Encoding •Mostly Encoding •Runlength Encoding •Text255 and Text32k Encodings •Zstandard Encoding
  • 9. Skewed table data If skew is a problem, you typically see that node performance is uneven on the cluster. Use table_inspector.sql, to see how data blocks in a distribution key map to the slices and nodes in the cluster. Consider changing the distribution key to a column that exhibits high cardinality and uniform distribution. Evaluate a candidate column as a distribution key by creating a new table using CTAS: CREATE TABLE my_test_table DISTKEY (<column name>) AS SELECT <column name> FROM <table name>; You can use SCT for setting the proper Distribution Key and Style during the migration process
  • 10. Deal with tables with very large VARCHAR columns During processing of complex queries, intermediate query results might need to be stored in temporary blocks. These temporary tables are not compressed, so unnecessarily wide columns consume excessive memory and temporary disk space, which can affect query performance. SELECT database, schema || '.' || "table" AS "table", max_varchar FROM svv_table_info WHERE max_varchar > 150 ORDER BY 2; Use the following query to generate a list of tables that should have their maximum column widths reviewed: Identify which table columns have wide varchar columns and then determine the true maximum width for each wide column: SELECT max(len(rtrim(column_name))) FROM table_name; If you query the top running queries for the database using the top_queries.sql admin script, pay special attention to SELECT * queries which include the JSON fragment column. If end users query these large columns but don’t use actually execute JSON functions against them, consider moving them into another table that only contains the primary key column of the original table and the JSON column.
  • 11. Queries not benefiting from sort keys Amazon Redshift tables can have a sort key column identified, which acts like an index in other databases, but which does not incur a storage cost as with other platforms (for more information, see Choosing Sort Keys). A sort key should be created on those columns which are most commonly used in WHERE clauses. If you have a known query pattern, then COMPOUND sort keys give the best performance; if end users query different columns equally, then use an INTERLEAVED sort key. If using compound sort keys, review your queries to ensure that their WHERE clauses specify the sort columns in the same order they were defined in the compound key. To determine which tables don’t have sort keys, run the following query against the v_extended_table_info view from the Amazon Redshift Utils repository: SELECT * FROM admin.v_extended_table_info WHERE sortkey IS null; If you do not have a sort key this potentially can lead to a excessive scanning issue.
  • 12. Queries waiting on queue slots (ICT) Amazon Redshift runs queries using a queuing system known as workload management (WLM). You can define up to 8 queues to separate workloads from each other, and set the concurrency on each queue to meet your overall throughput requirements. In some cases, the queue to which a user or query has been assigned is completely busy and a user’s query must wait for a slot to be open. During this time, the system is not executing the query at all, which is a sign that you may need to increase concurrency. First, you need to determine if any queries are queuing, using the queuing_queries.sql admin script. Review the maximum concurrency that your cluster has needed in the past with wlm_apex.sql, down to an hour-by-hour historical analysis with wlm_apex_hourly.sql. Keep in mind that while increasing concurrency allows more queries to run, each query will get a smaller share of the memory allocated to its queue
  • 13. Queries that are disk-based (ICT) SELECT q.query, trim(q.cat_text) FROM ( SELECT query, replace( listagg(text,' ') WITHIN GROUP (ORDER BY sequence), 'n', ' ') AS cat_text FROM stl_querytext WHERE userid>1 GROUP BY query) q JOIN ( SELECT distinct query FROM svl_query_summary WHERE is_diskbased='t' AND (LABEL LIKE 'hash%' OR LABEL LIKE 'sort%’ OR LABEL LIKE 'aggr%’) AND userid > 1) qs ON qs.query = q.query; If a query isn’t able to completely execute in memory, it may need to use disk-based temporary storage for parts of an explain plan. The additional disk I/O slows down the query; this can be addressed by increasing the amount of memory allocated to a session (for more information, see WLM Dynamic Memory Allocation). To determine if any queries have been writing to disk, use the following query: Based on the user or the queue assignment rules, you can increase the amount of memory given to the selected queue to prevent queries needing to spill to disk to complete.
  • 14. Commit queue waits (ICT) Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to a commit queue. If you are committing too often on your database, you will start to see waits on the commit queue increase, which can be viewed with the commit_stats.sql admin script. This script shows the largest queue length and queue time for queries run in the past two days. If you have queries that are waiting on the commit queue, then look for sessions that are committing multiple times per session, such as ETL jobs that are logging progress or inefficient data loads. One of the worst practices is to insert data into Amazon Redshift row by row. Use COPY command or your ETL tool compatible with Amazon Redshift instead.
  • 15. Inefficient use of Temporary Tables Amazon Redshift provides temporary tables, which are like normal tables except that they are only visible within a single session. When the user disconnects the session, the tables are automatically deleted. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. The CREATE TABLE statement gives you complete control over the definition of the temporary table, while the SELECT … INTO and C(T)TAS commands use the input data to determine column names, sizes and data types, and use default storage properties. These default storage properties may cause issues if not carefully considered. Amazon Redshift’s default table structure is to use EVEN distribution with no column encoding. This is a sub-optimal data structure for many types of queries, and if you are using the SELECT…INTO syntax you cannot set the column encoding or distribution and sort keys. If you use the CREATE TABLE AS (CTAS) syntax, you can specify a distribution style and sort keys, and Redshift will automatically apply LZO encoding for everything other than sort keys, booleans, reals and doubles. If you consider this automatic encoding sub- optimal, and require further control, use the CREATE TABLE syntax rather than CTAS. If you are creating temporary tables, it is highly recommended that you convert all SELECT…INTO syntax to use the CREATE statement. This ensures that your temporary tables have column encodings and are distributed in a fashion that is sympathetic to the other entities that are part of the workflow.
  • 16. Tables without statistics • Amazon Redshift, like other databases, requires statistics about tables and the composition of data blocks being stored in order to make good decisions when planning a query (for more information, see Analyzing Tables). Without good statistics, the optimiser may make suboptimal choices about the order in which to access tables, or how to join datasets together. • The ANALYZE Command History topic in the Amazon Redshift Developer Guide supplies queries to help you address missing or stale statistics, and you can also simply run the missing_table_stats.sql admin script to determine which tables are missing stats, or the statement below to determine tables that have stale statistics: SELECT database, schema || '.' || "table" AS "table", stats_off FROM svv_table_info WHERE stats_off > 5 ORDER BY 2;
  • 17. Tables which need VACUUM In Amazon Redshift, data blocks are immutable. When rows are DELETED or UPDATED, they are simply logically deleted (flagged for deletion) but not physically removed from disk. Updates result in a new block being written with new data appended. Both of these operations cause the previous version of the row to continue consuming disk space and continue being scanned when a query scans the table. As a result, table storage space is increased and performance degraded due to otherwise avoidable disk I/O during scans. A VACUUM command recovers the space from deleted rows and restores the sort order. You can use the perf_alert.sql admin script to identify tables that have had alerts about scanning a large number of deleted rows raised in the last seven days. To address issues with tables with missing or stale statistics or where vacuum is required, run another AWS Labs utility, Analyze & Vacuum Schema. This ensures that you always keep up-to-date statistics, and only vacuum tables that actually need reorganization.
  • 18. Analyze the performance of the queries and address issues • Your best friends in a Redshift world are: • ANALYZE for identifying out-of-date statistics • SVL_QUERY_SUMMARY View for summary of all the useful data around the behavior of your queries and high- level overview of the cluster • Query Alerts – for receiving an important information around deviations in behavior of your queries • SVL_QUERY_REPORT View – for collecting details about your queries health and performance
  • 19. Setup and finetune monitoring and maintenance procedures • In addition to your preferred monitoring methods and multiple partners solutions we do have some helpful tools: • Amazon Redshift Column Encoding Utility • Use table_inspector.sql, to see how data blocks in a distribution key map to the slices and nodes in the cluster. • Run the missing_table_stats.sql admin script to determine which tables are missing stats. • Use the perf_alert.sql admin script to identify tables that have had alerts about scanning a large number of deleted rows raised in the last seven days. • Use top_queries.sql to determine the top running queries. • Review the maximum concurrency that your cluster has needed in the past with wlm_apex.sql, down to an hour-by-hour historical analysis with wlm_apex_hourly.sql. • View the commit stats with the commit_stats.sql admin script
  • 20. Eliminate Nested Loops • Due to SQL query specifying a join condition that requires a ”brute force” approach between two large tables • Quite easy to spot • Look for “Nested Loop” in a Query Plans or STL_ALERT_EVENT_LOG ON a.date > b.date ON a.text LIKE b.text ON a.x = b.x OR a.y = b.y • Rewrite inequality JOIN condition as a window function • Use small nested loop instead of two large tables • Maybe you can do a nested loop join and persist the results in a separate relational table
  • 21. Inappropriate Joins Cardinality • Query “Fan-out” • Look for high number of rows generated from joins (higher than the sum of rows of all scanned tables) in the query plan or execution metrics • Carefully review the joins logic • Use SVL_QUERY_SUMMARY to detect FROM house JOIN rooms ON rooms.house_id = house.id JOIN residents ON residents.house_id = house.id • If possible break large fan-out queries into several smaller queries • Use derived tables to transform one-to-many joins into one-to-one joins • Try out some advanced techniques like 1 and 2
  • 22. Suboptimal WHERE clause If your WHERE clause causes excessive table scans, you might see a SCAN step in the segment with the highest maxtime value in SVL_QUERY_SUMMARY. For more information, see Using the SVL_QUERY_SUMMARY View. To fix this issue, add a WHERE clause to the query based on the primary sort column of the largest table. This approach will help minimize scanning time. For more information, see Amazon Redshift Best Practices for Designing Tables.
  • 23. Insufficiently Restrictive Predicate If your query has an insufficiently restrictive predicate, you might see a SCAN step in the segment with the highest maxtime value in SVL_QUERY_SUMMARY that has a very high rows value compared to the rows value in the final RETURN step in the query. For more information, see Using the SVL_QUERY_SUMMARY View. To fix this issue, try adding a predicate to the query or making the existing predicate more restrictive to narrow the output.
  • 24. Very Large Result Set If your query returns a very large result set, consider rewriting the query to use UNLOAD to write the results to Amazon S3. This approach will improve the performance of the RETURN step by taking advantage of parallel processing. For more information on checking for a very large result set, see Using the SVL_QUERY_SUMMARY View.
  • 25. Large SELECT List If your query has an unusually large SELECT list, you might see a bytes value that is high relative to the rows value for any step (in comparison to other steps) in SVL_QUERY_SUMMARY. This high bytes value can be an indicator that you are selecting a lot of columns. For more information, see Using the SVL_QUERY_SUMMARY View. To fix this issue, review the columns you are selecting and see if any can be removed.
  • 26. Working with SVL_QUERY_SUMMARY SELECT query, elapsed, substring FROM svl_qlog ORDER BY query DESC limit 5; Select your query ID Collect query data SELECT * FROM svl_query_summary WHERE query = MyQueryID ORDER BY stm, seg, step; For analysis please refer to: https://docs.aws.amazon.com/redshift/latest/dg/using-SVL- Query-Summary.html
  • 27. Some additional materials can be found here • Best practices for designing queries • Best practices for designing tables • Top 10 performance tuning techniques • Troubleshooting queries Amazon Redshift Developer Guide https://docs.aws.amazon.com/redshift/l atest/dg/redshift-dg.pdf