Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017

•

3 recomendaciones•1,003 vistas

Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve the delivery of your queries and overall database performance. This session explains how to create an optimized schema, use workload management, and tune your queries. AWS Speaker: Ian Robinson, Specialist Solution Architect, Big Data and Analytics, EMEA - Amazon Web Services

Tecnología

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Robinson
Specialist Solutions Architect, Data & Analytics, EMEA
5 July, 2017
Deep Dive on Amazon Redshift

Managed
Massively parallel
Petabyte-scale
Relational data warehouse
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper

Amazon Redshift Cluster Architecture
Massively parallel, shared nothing
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore
• 2, 16 or 32 slices
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node

Your Mission…
• Use just enough cluster resources
• Minimum amount of work
• Equally on each slice

Do an Equal Amount of Work
on Each Slice

Choose Best Table Distribution Style
All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Key
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution

Avoid Selectively Filtering on Distribution Key
WHERE o_orderdate = current_date

Do the Minimum Amount of
Work on Each Slice

Columnar storage
+
Large data block sizes
+
Data compression
+
Zone maps
+
Direct-attached storage
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Reduced I/O = Enhanced Performance

Use Cluster Resources
Efficiently to Complete Queries
as Quickly as Possible

Amazon Redshift Workload Management
Waiting
Workload Management
BI tools
SQL clients
Analytics tools
Client
Running
Queries: 80% memory
ETL: 20% memory
4 Slots
2 Slots
80/4 = 20% per slot
20/2 = 10% per slot

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Justus Roux
5 July 2017
Redshift Deep Dive
Learnings from Mukuru

Talking Points
Background of Mukuru
Process of Creating a Business Intelligence Department
Learnings

Mukuru
• 1 million+ registered customers
• 6,000+ pay-in locations within South Africa
• 1,000+ roaming consultants
• 130 information centers within South Africa
• 28 branches across South Africa
• 425,000+ like on Facebook
• 1 transfer every 8 seconds
Largest International Money Transfer
Organisation in the SADC region

Creation of Business Intelligence Department
Amazon RDS
Real Time Read-Me
Replica
S3 Bucket Redshift Data
Warehouse
QuickSight
Business
Intelligence
Reporting Tool
Cron Job
Git Pull
Bash Script
Copy csv to S3
Copy csv to Redshift
Transform in Redshift
Integrity scripts
ETL Dashboard
Machine Learning

Learnings
• Quick to set up Redshift environment
• No DBA needed - recovery of table 5 minutes
• Copy function – multiple tasks
• ETL process - let Redshift do the transforming
• Analyze & Vacuum large tables regularly
• Awaiting Amazon Glue

Más contenido relacionado

La actualidad más candente

Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Amazon Web Services

Structured, Unstructured and Streaming Big Data on the AWSAmazon Web Services

AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)Amazon Web Services

AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...Amazon Web Services

Database and Analytics on the AWS CloudAmazon Web Services

Migrating Your Databases to AWS: Deep Dive on Amazon RDS and AWS Database Mig...Amazon Web Services

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services

BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services

ENT201 A Tale of Two Pizzas: Accelerating Software Delivery with AWS Develope...Amazon Web Services

Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services

BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services

Getting Started with Amazon DynamoDBAmazon Web Services

Building Serverless Web Applications - DevDay Los Angeles 2017Amazon Web Services

AWS re:Invent 2016: Taking Data to the Extreme (MBL202)Amazon Web Services

Migrating to Amazon RDS with Database Migration ServiceAmazon Web Services

AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAmazon Web Services Korea

BDA309 Building Your Data Lake on AWSAmazon Web Services

Visualizing Big Data Insights with Amazon QuickSightAmazon Web Services

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services

Introduction to Amazon Kinesis AnalyticsAmazon Web Services

La actualidad más candente (20)

Choosing the Right Database for the Job: Relational, Cache, or NoSQL?

Structured, Unstructured and Streaming Big Data on the AWS

AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)

AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...

Database and Analytics on the AWS Cloud

Migrating Your Databases to AWS: Deep Dive on Amazon RDS and AWS Database Mig...

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR

ENT201 A Tale of Two Pizzas: Accelerating Software Delivery with AWS Develope...

Big Data Architectural Patterns and Best Practices on AWS

BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR

Getting Started with Amazon DynamoDB

Building Serverless Web Applications - DevDay Los Angeles 2017

AWS re:Invent 2016: Taking Data to the Extreme (MBL202)

Migrating to Amazon RDS with Database Migration Service

AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry

BDA309 Building Your Data Lake on AWS

Visualizing Big Data Insights with Amazon QuickSight

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2

Introduction to Amazon Kinesis Analytics

Similar a Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017

AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...Amazon Web Services

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services

DoneDeal - AWS Data Analytics Platformmartinbpeters

Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services

Using real time big data analytics for competitive advantageAmazon Web Services

Getting Started with Amazon RedshiftAmazon Web Services

Immersion Day - Como simplificar o acesso ao seu ambiente analíticoAmazon Web Services LATAM

How Glidewell Moves Data to Amazon RedshiftAttunity

찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)Amazon Web Services Korea

Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services

Launching Your First Big Data Project on AWSAmazon Web Services

Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...Julien SIMON

AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)Amazon Web Services

AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)Amazon Web Services

Getting Started with Amazon RedshiftAmazon Web Services

Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services

AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015Amazon Web Services Korea

Benefícios e melhores práticas no uso do Amazon RedshiftAmazon Web Services LATAM

(DAT202) Managed Database Options on AWSAmazon Web Services

Similar a Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017 (20)

AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

DoneDeal - AWS Data Analytics Platform

Data & Analytics - Session 2 - Introducing Amazon Redshift

Using real time big data analytics for competitive advantage

Getting Started with Amazon Redshift

Immersion Day - Como simplificar o acesso ao seu ambiente analítico

How Glidewell Moves Data to Amazon Redshift

찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)

Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...

Launching Your First Big Data Project on AWS

Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...

AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)

AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)

Getting Started with Amazon Redshift

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift

AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015

Benefícios e melhores práticas no uso do Amazon Redshift

(DAT202) Managed Database Options on AWS

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services

Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services

Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services

Costruire Applicazioni Moderne con AWSAmazon Web Services

Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services

Open banking as a serviceAmazon Web Services

Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services

OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services

Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services

Computer Vision con AWSAmazon Web Services

Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services

Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services

API moderne real-time per applicazioni mobili e webAmazon Web Services

Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services

Tools for building your MVP on AWSAmazon Web Services

How to Build a Winning Pitch DeckAmazon Web Services

Building a web application without serversAmazon Web Services

Fundraising EssentialsAmazon Web Services

AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services

Introduzione a Amazon Elastic Container ServiceAmazon Web Services

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...

Big Data per le Startup: come creare applicazioni Big Data in modalità Server...

Esegui pod serverless con Amazon EKS e AWS Fargate

Costruire Applicazioni Moderne con AWS

Come spendere fino al 90% in meno con i container e le istanze spot

Open banking as a service

Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...

OpsWorks Configuration Management: automatizza la gestione e i deployment del...

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads

Computer Vision con AWS

Database Oracle e VMware Cloud on AWS i miti da sfatare

Crea la tua prima serverless ledger-based app con QLDB e NodeJS

API moderne real-time per applicazioni mobili e web

Database Oracle e VMware Cloud™ on AWS: i miti da sfatare

Tools for building your MVP on AWS

How to Build a Winning Pitch Deck

Building a web application without servers

Fundraising Essentials

AWS_HK_StartupDay_Building Interactive websites while automating for efficien...

Introduzione a Amazon Elastic Container Service

Último

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Artificial Intelligence: Facts and MythsJoaquim Jorge

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Slack Application Development 101 Slidespraypatel2

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Histor y of HAM Radio presentation slidevu2urc

A Year of the Servo Reboot: Where Are We Now?Igalia

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017

2. Managed Massively parallel Petabyte-scale Relational data warehouse Amazon Redshift a lot faster a lot simpler a lot cheaper

3. Amazon Redshift Cluster Architecture Massively parallel, shared nothing Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, backup, restore • 2, 16 or 32 slices 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores S3 / EMR / DynamoDB / SSH JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node

4. Your Mission… • Use just enough cluster resources • Minimum amount of work • Equally on each slice

5. Do an Equal Amount of Work on Each Slice

6. Choose Best Table Distribution Style All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 All data on every node Key Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Same key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round robin distribution

7. Avoid Data Skew

8. Avoid Selectively Filtering on Distribution Key WHERE o_orderdate = current_date

9. Do the Minimum Amount of Work on Each Slice

10. Columnar storage + Large data block sizes + Data compression + Zone maps + Direct-attached storage analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 Reduced I/O = Enhanced Performance

11. Use Cluster Resources Efficiently to Complete Queries as Quickly as Possible

12. Amazon Redshift Workload Management Waiting Workload Management BI tools SQL clients Analytics tools Client Running Queries: 80% memory ETL: 20% memory 4 Slots 2 Slots 80/4 = 20% per slot 20/2 = 10% per slot

14. Talking Points Background of Mukuru Process of Creating a Business Intelligence Department Learnings

15. Mukuru • 1 million+ registered customers • 6,000+ pay-in locations within South Africa • 1,000+ roaming consultants • 130 information centers within South Africa • 28 branches across South Africa • 425,000+ like on Facebook • 1 transfer every 8 seconds Largest International Money Transfer Organisation in the SADC region

16. Creation of Business Intelligence Department Amazon RDS Real Time Read-Me Replica S3 Bucket Redshift Data Warehouse QuickSight Business Intelligence Reporting Tool Cron Job Git Pull Bash Script Copy csv to S3 Copy csv to Redshift Transform in Redshift Integrity scripts ETL Dashboard Machine Learning

17. Learnings • Quick to set up Redshift environment • No DBA needed - recovery of table 5 minutes • Copy function – multiple tasks • ETL process - let Redshift do the transforming • Analyze & Vacuum large tables regularly • Awaiting Amazon Glue

18.

19. Thank You

Notas del editor

2048
Goal: on table by table basis, to distribute data evenly across every slice in cluster More importantly ensure that each slice is doing an equal amount of work per query Another consideration when distributing data: we want to avoid having to redistributing or broadcasting data at query execution time
KEY For large fact tables and largest dimension tables you will likely want to distribute on a distribution KEY Each row will be assigned to a slice based on a hash of that row’s distribution key value Choose column involved in most expensive join or column that frequently occurs in GROUP BY clause Ensure it is a high cardinality column (relative to number of slices) ALL For small (~5M) dimension tables, choose all Copy to each compute node in cluster Ensures that data on both sides of join is co-located EVEN If neither KEY nor ALL is appropriate, choose EVEN This will assign rows to slices on a round-robin basis It’s the default distribution style
With previous strategy we ensure we do equal amount of work on each slice Our goal now is to ensure we do a minimum amount of work on each slice This comes down to doing the minimum amount of IO necessary to process the data relevant to the query
If data is sorted on disk in ways that align with the predicates in our most important queries, we’ll be able to identify the minimum number of blocks that we have to take off disk If the rows, however, are scattered all over the place, we’ll have to materialize many more blocks into memory, and then filter gainst all that data we’ve brought into memory in order to identify the relevant rows. This is unnecessarily expensive, both in terms of IO and memory
We’re doing an equal amount of work on each slice, and we’re doing the absolute minimum amount of work necessary per slice to service the query Now we need to ensure we dedicate just enough system resources to servicing each query We do this by controlling the number of concurrent queries, and the memory assigned to each query Too little memory, and intermediate results will spill to disk, slowing down the query by an order of magnitude, and holding up other queries waiting to be executed Too much memory, and we’ll inhibit our ability to process more queries concurrently – it’s just wasted resource
QUERY_SUMMARY and QUERY_REPORT system views

Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017

Similar a Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017 (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

Deep Dive on Amazon Redshift - AWS Summit Cape Town 2017

Notas del editor