Loading very large data sets can take a long time and consume a lot of computing resources. How data is loaded can also affect query performance. We will discuss best practices for loading data efficiently using COPY commands, bulk inserts, and staging tables. We will also cover the key decisions that will heavily influence overall query performance. These design choices also have a significant effect on storage requirements, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process queries.
3. AWS Database Amazon Redshift
Fast, Powerful, Fully Managed, Petabyte-Scale
Services Data Warehouse Service
Amazon DynamoDB
Scalable High Performance Fast, Predictable, Highly-Scalable NoSQL Data Store
Application Storage in the Cloud
Amazon RDS
Deployment & Administration Managed Relational Database Service for
MySQL, Oracle and SQL Server
Application Services
Amazon ElastiCache
Compute Storage Database In-Memory Caching Service
Networking
AWS Global Infrastructure
4. objectives
design and build a petabyte-scale data warehouse service
A Lot Faster
Amazon
Redshift A Lot Cheaper
A Whole Lot Simpler
5. Redshift Dramatically Reduces I/O
• Direct-attached storage Id Age State
123 20 CA
• Large data block sizes 345 25 WA
• Columnar storage 678 40 FL
• Data compression
• Zone maps
Row storage Column storage
6. Redshift Runs on Optimized Hardware Click to grow
…to 1.6PB
HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate
HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage
• Optimized for I/O intensive workloads
• HS1.8XL available on Amazon EC2
• Runs in HPC - fast network
• High disk density
7. data generated
Gap cost +
data volume
effort
data available
for analysis
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
8. Redshift is Priced to Analyze All Your Data
$0.85 per hour for on-demand (2TB)
$999 per TB per year (3-yr reservation)
10. Ingestion – Best Practices
• Goal
1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead
• Best Practices
Preferred method - COPY from S3
Loads data in sorted order through the compute nodes
Single Copy command, Split data into multiple files
Strongly recommend that you gzip large datasets
copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-
Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’;
• If you must ingest through SQL
Multi-row inserts insert into category_stage values
Avoid large number of singleton (default, default, default, default),
insert/update/delete operations (20, default, 'Country', default),
• To copy from another table (21, 'Concerts', 'Rock', default);
CREATE TABLE AS or INSERT INTO SELECT
11. Ingestion – Best Practices (Cont’d)
select query, trim(filename), curtime, status
• Verifying load data files
from stl_load_commits
For US east – S3 provides where filename like '%tickit%' order by query;
eventual consistency
• Verify files are in S3 query | btrim | curtime | status
-------+---------------------------+----------------------------+--------
• Listing Object Keys 22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1
22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1
• Query Redshift after 22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1
22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1
load. This query 22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1
22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1
returns entries for 22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1
loading the tables in 22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 |
22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 |
1
1
the TICKIT database… 22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1
22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1
12. Ingestion – Best Practices (Cont’d)
• Redshift does not currently support an upsert statement. Use staging tables to perform an
upsert by doing a join on the staging table with the target – Update then Insert
• Redshift does not currently enforce primary key constraint, if you COPY same data twice, it
will be duplicated
• Increase the memory available to a COPY or VACUUM by increasing wlm_query_slot_count
set wlm_query_slot_count to 3
• Run the ANALYZE command whenever you’ve made a non-trivial number of changes to your
data to ensure your table statistics are current
• Amazon Redshift system table that can be helpful in troubleshooting data load
issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust
MAX ERRORS as needed.
• Check character set : Support UTF8 up to 3 bytes long
• View the console for errors
14. Choose a Sort key
• Goal
Skip over data blocks to minimize IO
• Best Practice
Sort based on range or equality predicate (WHERE clause)
If you access recent data frequently, sort based on TIMESTAMP
15. Choose a Distribution Key
• Goal
Distribute data evenly across nodes
Minimize data movement among nodes : Co-located Joins and Co-located Aggregates
• Best Practice
Consider using Join key as distribution key (JOIN clause)
If multiple joins, use the foreign key of the largest dimension as distribution key
Consider using Group By column as distribution key (GROUP BY clause)
• Avoid
Keys used as equality filter as your distribution key
• If de-normalized tables and no aggregates, do not specify a distribution key -Redshift will
use round robin
16. Distribution Key – Verify Data Skew
Check the data distribution
select slice, col, num_values, minvalue, maxvalue
from svv_diskusage where name='users' and col =0
order by slice, col;
slice| col | num_values | minvalue | maxvalue
-----+-----+------------+----------+----------
0 | 0 | 12496 | 4 | 49987
1 | 0 | 12498 | 1 | 49988
2 | 0 | 12497 | 2 | 49989
3 | 0 | 12499 | 3 | 49990
17. Example
Select sum( S.Price * S.Quantity )
FROM SALES S
Dist key (C) = ProductID
JOIN CATEGORY C ON C.ProductId = S.ProductId Dist key (S) = ProductID
JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId Dist key (F) = FranchiseID
Where C.CategoryId = ‘Produce’ And F.State = ‘WA’
AND S.Date Between ‘1/1/2013’ AND ‘1/31/2013’ Sort key (S) = Date
-- Total Produce sold in Washington in January 2013
18. Query Performance – Best Practices
• Encode date and time using “TIMESTAMP” data type instead of “CHAR”
• Specify Constraints
Redshift does not enforce constraints (primary key, foreign key, unique values) but
the optimizer uses it
Loading and/or applications need to be aware
• Specify redundant predicate on the sort column
SELECT * FROM tab1, tab2
WHERE tab1.key = tab2.key
AND tab1.timestamp > '1/1/2013'
AND tab2.timestamp > '1/1/2013';
• WLM settings
19. Workload Manager
• Allows you to manage and adjust query concurrency
• WLM allows you to
Increase query concurrency up to 15
Define user groups and query groups
Segregate short and long running queries
Help improve performance of individual queries
• Be aware: query workload is distributed to every compute node
Increasing concurrency may not always help due to resource contention
• CPU, Memory and I/O
Total throughput may increase by letting one query complete first and allowing
other queries to wait
20. Workload Manager
• Default : 1 queue with a concurrency of 5
• Define up to 8 queues with a total concurrency of 15
• Redshift has a super user queue internally
21. Summary
• Avoid large number of singleton DML statements
if possible
• Use COPY for uploading large datasets
• Choose Sort and Distribution keys with care
• Encode data and time with TIMESTAMP data type
• Experiment with WLM settings
22. More Information
Best Practices for Designing Tables
http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
Best Practices for Data Loading
http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
View the Redshift Developer Guide at:
http://aws.amazon.com/documentation/redshift/