Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve the delivery of your queries and overall database performance. This session explains how to create an optimized schema, use workload management, and tune your queries.
AWS Speaker: Ian Robinson, Specialist Solution Architect, Big Data and Analytics, EMEA - Amazon Web Services
6. Choose Best Table Distribution Style
All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Key
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution
15. Mukuru
• 1 million+ registered customers
• 6,000+ pay-in locations within South Africa
• 1,000+ roaming consultants
• 130 information centers within South Africa
• 28 branches across South Africa
• 425,000+ like on Facebook
• 1 transfer every 8 seconds
Largest International Money Transfer
Organisation in the SADC region
16. Creation of Business Intelligence Department
Amazon RDS
Real Time Read-Me
Replica
S3 Bucket Redshift Data
Warehouse
QuickSight
Business
Intelligence
Reporting Tool
Cron Job
Git Pull
Bash Script
Copy csv to S3
Copy csv to Redshift
Transform in Redshift
Integrity scripts
ETL Dashboard
Machine Learning
17. Learnings
• Quick to set up Redshift environment
• No DBA needed - recovery of table 5 minutes
• Copy function – multiple tasks
• ETL process - let Redshift do the transforming
• Analyze & Vacuum large tables regularly
• Awaiting Amazon Glue
Goal: on table by table basis, to distribute data evenly across every slice in cluster
More importantly ensure that each slice is doing an equal amount of work per query
Another consideration when distributing data: we want to avoid having to redistributing or broadcasting data at query execution time
KEY
For large fact tables and largest dimension tables you will likely want to distribute on a distribution KEY
Each row will be assigned to a slice based on a hash of that row’s distribution key value
Choose column involved in most expensive join or column that frequently occurs in GROUP BY clause
Ensure it is a high cardinality column (relative to number of slices)
ALL
For small (~5M) dimension tables, choose all
Copy to each compute node in cluster
Ensures that data on both sides of join is co-located
EVEN
If neither KEY nor ALL is appropriate, choose EVEN
This will assign rows to slices on a round-robin basis
It’s the default distribution style
With previous strategy we ensure we do equal amount of work on each slice
Our goal now is to ensure we do a minimum amount of work on each slice
This comes down to doing the minimum amount of IO necessary to process the data relevant to the query
If data is sorted on disk in ways that align with the predicates in our most important queries, we’ll be able to identify the minimum number of blocks that we have to take off disk
If the rows, however, are scattered all over the place, we’ll have to materialize many more blocks into memory, and then filter gainst all that data we’ve brought into memory in order to identify the relevant rows.
This is unnecessarily expensive, both in terms of IO and memory
We’re doing an equal amount of work on each slice, and we’re doing the absolute minimum amount of work necessary per slice to service the query
Now we need to ensure we dedicate just enough system resources to servicing each query
We do this by controlling the number of concurrent queries, and the memory assigned to each query
Too little memory, and intermediate results will spill to disk, slowing down the query by an order of magnitude, and holding up other queries waiting to be executed
Too much memory, and we’ll inhibit our ability to process more queries concurrently – it’s just wasted resource