Building a data warehouse with AWS Redshift, Matillion and Yellowfin

•Descargar como PPT, PDF•

2 recomendaciones•1,978 vistas

screencast of the process to set up a data warehouse with AWS Redshift along with Matillion ETL and Yellowfin for data visualization

Tecnología

Building a Data Warehouse on AWS
Amazon
S3
Amazon
Redshift
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Answers
Visualize
@Lynn Langit

AWS Marketplace
Enterprise software store for business users who need simplified procurement
•2000+ product listings
•to browse, test and buy software
•1-click deployment
•to launch, in multiple regions around the
world
•Pay-as-you-go pricing
•to use on demand
Advanced Analytics
Data Enablement
Business Intelligence

Building a Data Warehouse on AWS
Move data into Redshift
from S3 for analysis
Amazon
S3
Amazon
Redshift
AWS Marketplace
Partners
Matillion
Visualize
Yellowfin
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Answers

Our Scenario and Source Files
File Types
-- Text - .csv
-- Compressed - .gz
File Categories
Details / Events
-- Flights
-- Weather
Metadata
-- Airports
-- Carriers
“In this scenario we will use Matillion ETL
for Redshift to prepare two separate data
sources ready for analysis.
The sample data is US airport flight
information from 1995 -> 2008. Every flight
to or from a US airport (and whether it left
on time or not) is included.
The second data set is weather data, taken
from NOAA, including the daily weather
readings for each US Airport.”

Using Matillion ETL for Redshift
• Create Instance (AMI/EC2) of Matillion/AWS Marketplace
• Connect Matillion to Redshift

Table distribution styles
Distribution Key All
Node 1
Slice
1
Slice
1
Slice
2
Slice
2
Node 2
Slice
3
Slice
3
Slice
4
Slice
4
Node 1
Slice
1
Slice
1
Slice
2
Slice
2
Node 2
Slice
3
Slice
3
Slice
4
Slice
4
key1
key2
key3
key4
All data on
every node
Same key to same location
Node 1
Slice
1
Slice
1
Slice
2
Slice
2
Node 2
Slice
3
Slice
3
Slice
4
Slice
4
Even
Round robin
distribution

Sort Keys
• Single Column - [ SORTKEY ( date ) ]
• Queries that use 1st
column (i.e. date) as primary filter
• Compound - [ SORTKEY COMPOUND ( date, region,
country) ]
• Queries that use 1st
column as primary filter, then other columns
• Interleaved - [ SORTKEY INTERLEAVED ( date,
region, country) ]
• Queries that use different columns in filter

Time Series Data – Vacuum Operation
Unsorted
Region
Sorted
Region
Sorted
Sorted
Sorted
Append in Sort Key Order
Sort Unsorted
Region
Merge

Automate – https://github.com/lynnlangit/AWSDataWarehouse

Más contenido relacionado

La actualidad más candente

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services

AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)Amazon Web Services

AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...Amazon Web Services

AWS Batch: Simplifying batch computing in the cloudAdrian Hornsby

AWS re:Invent 2016: Taking Data to the Extreme (MBL202)Amazon Web Services

Big problems Big Data, simple solutionsClaudio Pontili

Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering

Introduction to AWS KinesisSteven Ensslen

Introduction to Amazon AthenaAmazon Web Services

AWS Kinesis - Streams, Firehose, AnalyticsSerhat Can

Scaling Galaxy on Google Cloud PlatformLynn Langit

Introduction to Amazon Kinesis AnalyticsAmazon Web Services

Optimizing Storage for Big Data Analytics WorkloadsAmazon Web Services

Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services

NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services

Introduction to AWS GlueAmazon Web Services

Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Amazon Web Services

Real-Time Log Analytics using Amazon Kinesis and Amazon Elasticsearch Service...Amazon Web Services

Simplify Big Data with AWSJulien SIMON

(WRK302) Event-Driven ProgrammingAmazon Web Services

La actualidad más candente (20)

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)

AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...

AWS Batch: Simplifying batch computing in the cloud

AWS re:Invent 2016: Taking Data to the Extreme (MBL202)

Big problems Big Data, simple solutions

Scaling Traffic from 0 to 139 Million Unique Visitors

Introduction to AWS Kinesis

Introduction to Amazon Athena

AWS Kinesis - Streams, Firehose, Analytics

Scaling Galaxy on Google Cloud Platform

Introduction to Amazon Kinesis Analytics

Optimizing Storage for Big Data Analytics Workloads

Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...

NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.

Introduction to AWS Glue

Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...

Real-Time Log Analytics using Amazon Kinesis and Amazon Elasticsearch Service...

Simplify Big Data with AWS

(WRK302) Event-Driven Programming

Similar a Building a data warehouse with AWS Redshift, Matillion and Yellowfin

Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services

Data Transformation Patterns in AWS - AWS Online Tech TalksAmazon Web Services

Build Data Lakes and Analytics on AWS Amazon Web Services

Aws meetup 20190427Sridevi Murugayen

Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services

Analyzing Mixpanel Data into Amazon RedshiftGeorge Psistakis

Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfSasikumarPalanivel3

Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfsaidbilgen

Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...Amazon Web Services

Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018Amazon Web Services

Big Data@Scale_AWSPSSummit_SingaporeAmazon Web Services

FSI301 An Architecture for Trade Capture and Regulatory ReportingAmazon Web Services

AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...Sungmin Kim

在 Amazon Web Services 實現大數據應用-電子商務的案例分享Amazon Web Services

AWS Data Lake: data analysis @ scaleAmazon Web Services

AWS Big Data PlatformAmazon Web Services

Building your First Big Data Application on AWSAmazon Web Services

Implementazione di una soluzione Data Lake.pdfAmazon Web Services

Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumAmazon Web Services

Big Data on AWSAmazon Web Services

Similar a Building a data warehouse with AWS Redshift, Matillion and Yellowfin (20)

Success has Many Query Engines- Tel Aviv Summit 2018

Data Transformation Patterns in AWS - AWS Online Tech Talks

Build Data Lakes and Analytics on AWS

Aws meetup 20190427

Building a Modern Data Warehouse - Deep Dive on Amazon Redshift

Analyzing Mixpanel Data into Amazon Redshift

Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf

Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...

Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018

Big Data@Scale_AWSPSSummit_Singapore

FSI301 An Architecture for Trade Capture and Regulatory Reporting

AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...

在 Amazon Web Services 實現大數據應用-電子商務的案例分享

AWS Data Lake: data analysis @ scale

AWS Big Data Platform

Building your First Big Data Application on AWS

Implementazione di una soluzione Data Lake.pdf

Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum

Big Data on AWS

Más de Lynn Langit

VariantSpark on AWSLynn Langit

Serverless ArchitecturesLynn Langit

10+ Years of Teaching Kids ProgrammingLynn Langit

Blastn plus jupyter on DockerLynn Langit

Testing in Ballerina LanguageLynn Langit

Teaching Kids to create Alexa SkillsLynn Langit

Practical cloudLynn Langit

Understanding Jupyter notebooks using bioinformatics examplesLynn Langit

Genome-scale Big Data PipelinesLynn Langit

Teaching Kids ProgrammingLynn Langit

Practical CloudLynn Langit

Serverless RealityLynn Langit

Genomic Scale Big Data PipelinesLynn Langit

VariantSpark - a Spark library for genomicsLynn Langit

Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit

Google Cloud and Data Pipeline PatternsLynn Langit

Redis Labs and SQL ServerLynn Langit

What is 'Teaching Kids Programming'Lynn Langit

Teaching Kids Programming for DevelopersLynn Langit

Cloud Big Data ArchitecturesLynn Langit

Más de Lynn Langit (20)

VariantSpark on AWS

Serverless Architectures

10+ Years of Teaching Kids Programming

Blastn plus jupyter on Docker

Testing in Ballerina Language

Teaching Kids to create Alexa Skills

Practical cloud

Understanding Jupyter notebooks using bioinformatics examples

Genome-scale Big Data Pipelines

Teaching Kids Programming

Practical Cloud

Serverless Reality

Genomic Scale Big Data Pipelines

VariantSpark - a Spark library for genomics

Bioinformatics Data Pipelines built by CSIRO on AWS

Google Cloud and Data Pipeline Patterns

Redis Labs and SQL Server

What is 'Teaching Kids Programming'

Teaching Kids Programming for Developers

Cloud Big Data Architectures

Último

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

ICT role in 21st century education and its challengesrafiqahmad00786416

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Why Teams call analytics are critical to your entire businesspanagenda

DBX First Quarter 2024 Investor PresentationDropbox

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Exploring Multimodal Embeddings with MilvusZilliz

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Elevate Developer Efficiency & build GenAI Application with Amazon QBhuvaneswari Subramani

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

1. Building a Data Warehouse on AWS Amazon S3 Amazon Redshift CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers Visualize @Lynn Langit

2. AWS Marketplace Enterprise software store for business users who need simplified procurement •2000+ product listings •to browse, test and buy software •1-click deployment •to launch, in multiple regions around the world •Pay-as-you-go pricing •to use on demand Advanced Analytics Data Enablement Business Intelligence

3. Building a Data Warehouse on AWS Move data into Redshift from S3 for analysis Amazon S3 Amazon Redshift AWS Marketplace Partners Matillion Visualize Yellowfin CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers

4. Setup

5. Our Scenario and Source Files File Types -- Text - .csv -- Compressed - .gz File Categories Details / Events -- Flights -- Weather Metadata -- Airports -- Carriers “In this scenario we will use Matillion ETL for Redshift to prepare two separate data sources ready for analysis. The sample data is US airport flight information from 1995 -> 2008. Every flight to or from a US airport (and whether it left on time or not) is included. The second data set is weather data, taken from NOAA, including the daily weather readings for each US Airport.”

6. Loading data from S3 in to Redshift

7. Using Matillion ETL for Redshift • Create Instance (AMI/EC2) of Matillion/AWS Marketplace • Connect Matillion to Redshift

8. Loading Data in Redshift

9. Table distribution styles Distribution Key All Node 1 Slice 1 Slice 1 Slice 2 Slice 2 Node 2 Slice 3 Slice 3 Slice 4 Slice 4 Node 1 Slice 1 Slice 1 Slice 2 Slice 2 Node 2 Slice 3 Slice 3 Slice 4 Slice 4 key1 key2 key3 key4 All data on every node Same key to same location Node 1 Slice 1 Slice 1 Slice 2 Slice 2 Node 2 Slice 3 Slice 3 Slice 4 Slice 4 Even Round robin distribution

10. Sort Keys • Single Column - [ SORTKEY ( date ) ] • Queries that use 1st column (i.e. date) as primary filter • Compound - [ SORTKEY COMPOUND ( date, region, country) ] • Queries that use 1st column as primary filter, then other columns • Interleaved - [ SORTKEY INTERLEAVED ( date, region, country) ] • Queries that use different columns in filter

11. Time Series Data – Vacuum Operation Unsorted Region Sorted Region Sorted Sorted Sorted Append in Sort Key Order Sort Unsorted Region Merge

12. Visualizing with Yellowfin

13. Automate – https://github.com/lynnlangit/AWSDataWarehouse

Notas del editor

Collect logs in an Amazon Kinesis Stream Launch Amazon EMR and Amazon Redshift clusters Use Hive on Amazon EMR to access data in an Amazon Kinesis stream Use Hive on Amazon EMR to transform, partition and output data to Amazon S3 Load data in parallel into Amazon Redshift from Amazon S3 Bonus: use Hive and Amazon DynamoDB to enable Amazon Kinesis “checkpointing”
Big Data software on AWS Marketplace:http://amzn.to/1va4KQ6
Public data from -- s3://demo-data-sets-west/airline/data/
http://docs.aws.amazon.com/general/latest/gr/rande.html http://docs.aws.amazon.com/redshift/latest/dg/r_STV_SLICES.html
Redshift is a distributed system: A cluster contains a leader node and compute nodes A compute node contains slices (one per core) that contain data Data is distributed among slices in 3 ways: Even – Rows distributed in Round Robin fashion (default) Key – Rows distributed based on a distribution key (hash of a defined column) All - Rows distributed to all slices Queries run on all slices in parallel Optimal query throughput can be achieved when data is evenly spread across slices
When you append data, it’s appended to the unsorted region in sorted order When you vacuum, the unsorted region is sorted first, then merged into the sorted regions This can be really expensive If you append data only in the order of your sortkeys, you’ll never have to vacuum Mycroft does this automatically

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Building a data warehouse with AWS Redshift, Matillion and Yellowfin

Similar a Building a data warehouse with AWS Redshift, Matillion and Yellowfin (20)

Más de Lynn Langit

Más de Lynn Langit (20)

Último

Último (20)

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

Notas del editor