SlideShare una empresa de Scribd logo
1 de 47
Analytics and Processing
agenda overview
10:00 AM Registration
10:30 AM Introduction to Big Data @ AWS
12:00 PM Lunch + Registration for Technical Sessions
12:30 PM Data Collection and Storage
1:45PM Real-time Event Processing
3:00PM Analytics (incl Machine Learning)
4:30 PM Open Q&A Roundtable
Collect Process Analyze
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
primitive patterns
EMR Redshift
Machine
Learning
Process and Analyze
• Hadoop
 Ad-hoc exploration of un-structured datasets
 Batch Processing on Large datasets
• Data Warehouses
 Analysis via Visualization tools
 Interactive querying of structured data
• Machine learning
 Predictions for what will happen
 Smart applications
Hadoop and Data Warehouses
Databases
Files
Data warehouse Data Marts Reports
Hadoop
Ad-hoc Exploration
Media
Cloud
ETL
Amazon EMR
Elastic MapReduce
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Control the cluster
Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add and remove compute capacity on your cluster
Match compute
demands with
cluster sizing.
Resizable clusters
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon EMR
clusters with no data loss
• Point multiple Amazon EMR clusters
at same data in Amazon S3
EMR
EMR
Amazon
S3
EMRFS makes it easier to leverage S3
• Better performance and error handling options
• Transparent to applications – Use “s3://”
• Consistent view
 For consistent list and read-after-write for new puts
• Support for Amazon S3 server-side and client-side
encryption
• Faster listing using EMRFS metadata
EMRFS - S3 client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Amazon S3 EMRFS metadata
in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number
of objects
Without
Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Fast listing of S3 objects using
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.
Optimize to leverage HDFS
• Iterative workloads
 If you’re processing the same dataset more than once
• Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to copy
to HDFS for processing
Pattern #1: Batch processing
GBs of logs pushed
to Amazon S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
Load subset into
Redshift DW
Pattern #2: Online data-store
Data pushed to
Amazon S3
Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database
24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
Pattern #3: Interactive query
TBs of logs sent
daily
Logs stored in S3
Transient EMR
clusters
Hive Metastore
File formats
• Row oriented
 Text files
 Sequence files
• Writable object
 Avro data files
• Described by schema
• Columnar format
 Object Record Columnar (ORC)
 Parquet
Logical Table
Row oriented
Column oriented
Choosing the right file format
• Processing and query tools
 Hive, Impala, and Presto.
• Evolution of schema
 Avro for schema and Presto for storage.
• File format “splittability”
 Avoid JSON/XML Files. Use them as records.
Choosing the right compression
• Time sensitive: faster compressions are a better choice
• Large amount of data: use space-efficient compressions
Algorithm Splittable? Compression Ratio
Compress +
Decompress Speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
Dealing with small files
• Reduce HDFS block size (e.g., 1 MB [default is 128 MB])
 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-
hadoop --args “-m,dfs.block.size=1048576”
• Better: use S3DistCp to combine smaller files together
 S3DistCp takes a pattern and target path to combine smaller input files
into larger ones
 Supply a target size and compression codec
DEMO: Log Processing using Amazon EMR
• Aggregating small files using s3distcp
• Defining Hive tables with data on Amazon S3
• Transforming dataset using Batch processing
• Interactive querying using Presto and Spark-Sql
Amazon S3
Log Bucket
Amazon
EMR
Processed and
structured log data
Amazon Redshift
Amazon Redshift Architecture
• Leader Node
 SQL endpoint
 Stores metadata
 Coordinates query execution
• Compute Nodes
 Local, columnar storage
 Execute queries in parallel
 Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
 Optimized for data processing
 DW1: HDD; scale from 2TB to 1.6PB
 DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Amazon Redshift Node Types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/Year
• Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores
3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles
16 TB compressed, 2 GB/sec scan rate
• High performance at smaller storage size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/Year
• Scale from 160GB to 256TB
DW2.L *New*: 16 GB RAM, 2 Cores,
160 GB compressed SSD storage
DW2.8XL *New*: 256 GB RAM, 32 Cores,
2.56 TB of compressed SSD storage
Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
Direct-attached storage
• With row storage you do
unnecessary I/O
• To get total amount, you have
to read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
Direct-attached storage
With column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
Direct-attached storage
• COPY compresses automatically
• You can analyze and override
• More performance, less cost
Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
Direct-attached storage
• Track the minimum and
maximum value for each block
• Skip over blocks that don’t
contain relevant data
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
Direct-attached storage
• Use local storage for
performance
• Maximize scan rates
• Automatic replication
and continuous backup
• HDD & SSD platforms
Amazon Redshift parallelizes and
distributes everything
Query
Load
Backup/Restore
Resize
Amazon Redshift parallelizes and
distributes everything
Query
Load
Backup/Restore
Resize
• Load in parallel from Amazon S3 or
DynamoDB or any SSH connection
• Data automatically distributed and
sorted according to DDL
• Scales linearly with the number of
nodes in the cluster
Amazon Redshift parallelizes and
distributes everything
Query
Load
Backup/Restore
Resize
• Backups to Amazon S3 are automatic,
continuous and incremental
• Configurable system snapshot retention
period. Take user snapshots on-demand
• Cross region backups for disaster recovery
• Streaming restores enable you to resume
querying faster
Amazon Redshift parallelizes and
distributes everything
Query
Load
Backup/Restore
Resize
• Resize while remaining online
• Provision a new cluster in the background
• Copy data in parallel from node to node
• Only charged for source cluster
Amazon Redshift parallelizes and
distributes everything
Query
Load
Backup/Restore
Resize
• Automatic SQL endpoint
switchover via DNS
• Decommission the source cluster
• Simple operation via Console or API
Amazon Redshift works with your
existing analysis tools
JDBC/ODBC
Connect using drivers
from PostgreSQL.org
Amazon Redshift
Custom ODBC and JDBC Drivers
• Up to 35% higher performance than open source drivers
• Supported by Informatica, Microstrategy, Pentaho, Qlik,
SAS, Tableau
• Will continue to support PostgreSQL open source drivers
• Download drivers from console
User Defined Functions
• We’re enabling User Defined Functions (UDFs) so
you can add your own
 Scalar and Aggregate Functions supported
• You’ll be able to write UDFs using Python 2.7
 Syntax is largely identical to PostgreSQL UDF Syntax
 System and network calls within UDFs are prohibited
• Comes with Pandas, NumPy, and SciPy pre-installed
 You’ll also be able import your own libraries for even more
flexibility
Scalar UDF example – URL parsing
Rather than using complex REGEX expressions, you can import
standard Python URL parsing libraries and use them in your SQL
Interleaved Multi Column Sort
• Currently support Compound Sort Keys
 Optimized for applications that filter data by one leading column
• Adding support for Interleaved Sort Keys
 Optimized for filtering data by up to eight columns
 No storage overhead unlike an index
 Lower maintenance penalty compared to indexes
Compound Sort Keys Illustrated
• Records in Redshift are
stored in blocks.
• For this illustration, let’s
assume that four records fill
a block
• Records with a given cust_id
are all in one block
• However, records with a
given prod_id are spread
across four blocks
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved Sort Keys Illustrated
• Records with a given
cust_id are spread
across two blocks
• Records with a given
prod_id are also spread
across two blocks
• Data is sorted in equal
measures for both keys
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
How to use the feature
• New keyword ‘INTERLEAVED’ when defining sort keys
 Existing syntax will still work and behavior is unchanged
 You can choose up to 8 columns to include and can query with any or
all of them
• No change needed to queries
• Benefits are significant
[ SORTKEY [ COMPOUND | INTERLEAVED ] ( column_name [, ...] ) ]
SELECT
INTO OUTFILE
s3cmd
COPY
Staging Prod
SQL
bcp
SQL Server
Redshift Use Case
Operational Reporting with Redshift
Amazon S3
Log Bucket
Amazon
EMR
Processed and
structured log data
Amazon
Redshift
Operational
Reports
Thank you
Questions?

Más contenido relacionado

La actualidad más candente

Databases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSDatabases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWS
Amazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
Amazon Web Services
 

La actualidad más candente (20)

Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWS
 
Getting started on your AWS migration journey
Getting started on your AWS migration journeyGetting started on your AWS migration journey
Getting started on your AWS migration journey
 
AWS Data Analytics on AWS
AWS Data Analytics on AWSAWS Data Analytics on AWS
AWS Data Analytics on AWS
 
Databases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSDatabases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWS
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Migrating to the Cloud
Migrating to the CloudMigrating to the Cloud
Migrating to the Cloud
 
Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...
Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...
Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
AWS 101
AWS 101AWS 101
AWS 101
 

Destacado

AWS_Architecture_e-commerce
AWS_Architecture_e-commerceAWS_Architecture_e-commerce
AWS_Architecture_e-commerce
SEONGTAEK OH
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution Showcase
Amazon Web Services
 
Amazon Web Services Customer Case Study, 9flats.com
Amazon Web Services Customer Case Study, 9flats.comAmazon Web Services Customer Case Study, 9flats.com
Amazon Web Services Customer Case Study, 9flats.com
Amazon Web Services
 

Destacado (20)

(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
 
DPACC Acceleration Progress and Demonstration
DPACC Acceleration Progress and DemonstrationDPACC Acceleration Progress and Demonstration
DPACC Acceleration Progress and Demonstration
 
Analytics & Reporting for Amazon Cloud Logs
Analytics & Reporting for Amazon Cloud LogsAnalytics & Reporting for Amazon Cloud Logs
Analytics & Reporting for Amazon Cloud Logs
 
World's best AWS Cloud Log Analytics & Management Tool
World's best AWS Cloud Log Analytics & Management ToolWorld's best AWS Cloud Log Analytics & Management Tool
World's best AWS Cloud Log Analytics & Management Tool
 
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKGDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace Webinar
 
AWS_Architecture_e-commerce
AWS_Architecture_e-commerceAWS_Architecture_e-commerce
AWS_Architecture_e-commerce
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution Showcase
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
 
AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Reg...
AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Reg...AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Reg...
AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Reg...
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
One Click Enterprise IoT Services - March 2017 AWS Online Tech Talks
One Click Enterprise IoT Services - March 2017 AWS Online Tech TalksOne Click Enterprise IoT Services - March 2017 AWS Online Tech Talks
One Click Enterprise IoT Services - March 2017 AWS Online Tech Talks
 
AWS re:Invent 2016: Understanding IoT Data: How to Leverage Amazon Kinesis in...
AWS re:Invent 2016: Understanding IoT Data: How to Leverage Amazon Kinesis in...AWS re:Invent 2016: Understanding IoT Data: How to Leverage Amazon Kinesis in...
AWS re:Invent 2016: Understanding IoT Data: How to Leverage Amazon Kinesis in...
 
AWS re:Invent 2016: Reduce Your Blast Radius by Using Multiple AWS Accounts P...
AWS re:Invent 2016: Reduce Your Blast Radius by Using Multiple AWS Accounts P...AWS re:Invent 2016: Reduce Your Blast Radius by Using Multiple AWS Accounts P...
AWS re:Invent 2016: Reduce Your Blast Radius by Using Multiple AWS Accounts P...
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
Amazon Web Services Customer Case Study, 9flats.com
Amazon Web Services Customer Case Study, 9flats.comAmazon Web Services Customer Case Study, 9flats.com
Amazon Web Services Customer Case Study, 9flats.com
 
Deploy, Manage & Scale Your Apps with Elastic Beanstalk
Deploy, Manage & Scale Your Apps with Elastic BeanstalkDeploy, Manage & Scale Your Apps with Elastic Beanstalk
Deploy, Manage & Scale Your Apps with Elastic Beanstalk
 

Similar a AWS Analytics

Similar a AWS Analytics (20)

Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
 
Introduction to Database Services
Introduction to Database ServicesIntroduction to Database Services
Introduction to Database Services
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

AWS Analytics

  • 2. agenda overview 10:00 AM Registration 10:30 AM Introduction to Big Data @ AWS 12:00 PM Lunch + Registration for Technical Sessions 12:30 PM Data Collection and Storage 1:45PM Real-time Event Processing 3:00PM Analytics (incl Machine Learning) 4:30 PM Open Q&A Roundtable
  • 3. Collect Process Analyze Store Data Collection and Storage Data Processing Event Processing Data Analysis primitive patterns EMR Redshift Machine Learning
  • 4. Process and Analyze • Hadoop  Ad-hoc exploration of un-structured datasets  Batch Processing on Large datasets • Data Warehouses  Analysis via Visualization tools  Interactive querying of structured data • Machine learning  Predictions for what will happen  Smart applications
  • 5. Hadoop and Data Warehouses Databases Files Data warehouse Data Marts Reports Hadoop Ad-hoc Exploration Media Cloud ETL
  • 7. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Control the cluster
  • 8. Try different configurations to find your optimal architecture CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  • 9. Easy to add and remove compute capacity on your cluster Match compute demands with cluster sizing. Resizable clusters
  • 10. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  • 11. Amazon S3 as your persistent data store • Separate compute and storage • Resize and shut down Amazon EMR clusters with no data loss • Point multiple Amazon EMR clusters at same data in Amazon S3 EMR EMR Amazon S3
  • 12. EMRFS makes it easier to leverage S3 • Better performance and error handling options • Transparent to applications – Use “s3://” • Consistent view  For consistent list and read-after-write for new puts • Support for Amazon S3 server-side and client-side encryption • Faster listing using EMRFS metadata
  • 13. EMRFS - S3 client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 14. Amazon S3 EMRFS metadata in Amazon DynamoDB • List and read-after-write consistency • Faster list operations Number of objects Without Consistent Views With Consistent Views 1,000,000 147.72 29.70 100,000 12.70 3.69 Fast listing of S3 objects using EMRFS metadata *Tested using a single node cluster with a m3.xlarge instance.
  • 15. Optimize to leverage HDFS • Iterative workloads  If you’re processing the same dataset more than once • Disk I/O intensive workloads Persist data on Amazon S3 and use S3DistCp to copy to HDFS for processing
  • 16. Pattern #1: Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 Load subset into Redshift DW
  • 17. Pattern #2: Online data-store Data pushed to Amazon S3 Daily Amazon EMR cluster Extract, Transform, and Load (ETL) data into database 24/7 Amazon EMR cluster running HBase holds last 2 years’ worth of data Front-end service uses HBase cluster to power dashboard with high concurrency
  • 18. Pattern #3: Interactive query TBs of logs sent daily Logs stored in S3 Transient EMR clusters Hive Metastore
  • 19. File formats • Row oriented  Text files  Sequence files • Writable object  Avro data files • Described by schema • Columnar format  Object Record Columnar (ORC)  Parquet Logical Table Row oriented Column oriented
  • 20. Choosing the right file format • Processing and query tools  Hive, Impala, and Presto. • Evolution of schema  Avro for schema and Presto for storage. • File format “splittability”  Avoid JSON/XML Files. Use them as records.
  • 21. Choosing the right compression • Time sensitive: faster compressions are a better choice • Large amount of data: use space-efficient compressions Algorithm Splittable? Compression Ratio Compress + Decompress Speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  • 22. Dealing with small files • Reduce HDFS block size (e.g., 1 MB [default is 128 MB])  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure- hadoop --args “-m,dfs.block.size=1048576” • Better: use S3DistCp to combine smaller files together  S3DistCp takes a pattern and target path to combine smaller input files into larger ones  Supply a target size and compression codec
  • 23. DEMO: Log Processing using Amazon EMR • Aggregating small files using s3distcp • Defining Hive tables with data on Amazon S3 • Transforming dataset using Batch processing • Interactive querying using Presto and Spark-Sql Amazon S3 Log Bucket Amazon EMR Processed and structured log data
  • 25. Amazon Redshift Architecture • Leader Node  SQL endpoint  Stores metadata  Coordinates query execution • Compute Nodes  Local, columnar storage  Execute queries in parallel  Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH • Two hardware platforms  Optimized for data processing  DW1: HDD; scale from 2TB to 1.6PB  DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 26. Amazon Redshift Node Types • Optimized for I/O intensive workloads • High disk density • On demand at $0.85/hour • As low as $1,000/TB/Year • Scale from 2TB to 1.6PB DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan rate • High performance at smaller storage size • High compute and memory density • On demand at $0.25/hour • As low as $5,500/TB/Year • Scale from 160GB to 256TB DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage
  • 27. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage • With row storage you do unnecessary I/O • To get total amount, you have to read everything ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  • 28. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage With column storage, you only read the data you need ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  • 29. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage • COPY compresses automatically • You can analyze and override • More performance, less cost
  • 30. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage • Track the minimum and maximum value for each block • Skip over blocks that don’t contain relevant data 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  • 31. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage • Use local storage for performance • Maximize scan rates • Automatic replication and continuous backup • HDD & SSD platforms
  • 32. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize
  • 33. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize • Load in parallel from Amazon S3 or DynamoDB or any SSH connection • Data automatically distributed and sorted according to DDL • Scales linearly with the number of nodes in the cluster
  • 34. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize • Backups to Amazon S3 are automatic, continuous and incremental • Configurable system snapshot retention period. Take user snapshots on-demand • Cross region backups for disaster recovery • Streaming restores enable you to resume querying faster
  • 35. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize • Resize while remaining online • Provision a new cluster in the background • Copy data in parallel from node to node • Only charged for source cluster
  • 36. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize • Automatic SQL endpoint switchover via DNS • Decommission the source cluster • Simple operation via Console or API
  • 37. Amazon Redshift works with your existing analysis tools JDBC/ODBC Connect using drivers from PostgreSQL.org Amazon Redshift
  • 38. Custom ODBC and JDBC Drivers • Up to 35% higher performance than open source drivers • Supported by Informatica, Microstrategy, Pentaho, Qlik, SAS, Tableau • Will continue to support PostgreSQL open source drivers • Download drivers from console
  • 39. User Defined Functions • We’re enabling User Defined Functions (UDFs) so you can add your own  Scalar and Aggregate Functions supported • You’ll be able to write UDFs using Python 2.7  Syntax is largely identical to PostgreSQL UDF Syntax  System and network calls within UDFs are prohibited • Comes with Pandas, NumPy, and SciPy pre-installed  You’ll also be able import your own libraries for even more flexibility
  • 40. Scalar UDF example – URL parsing Rather than using complex REGEX expressions, you can import standard Python URL parsing libraries and use them in your SQL
  • 41. Interleaved Multi Column Sort • Currently support Compound Sort Keys  Optimized for applications that filter data by one leading column • Adding support for Interleaved Sort Keys  Optimized for filtering data by up to eight columns  No storage overhead unlike an index  Lower maintenance penalty compared to indexes
  • 42. Compound Sort Keys Illustrated • Records in Redshift are stored in blocks. • For this illustration, let’s assume that four records fill a block • Records with a given cust_id are all in one block • However, records with a given prod_id are spread across four blocks 1 1 1 1 2 3 4 1 4 4 4 2 3 4 4 1 3 3 3 2 3 4 3 1 2 2 2 2 3 4 2 1 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id cust_id prod_id other columns blocks
  • 43. 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id Interleaved Sort Keys Illustrated • Records with a given cust_id are spread across two blocks • Records with a given prod_id are also spread across two blocks • Data is sorted in equal measures for both keys 1 1 2 2 2 1 2 3 3 4 4 4 3 4 3 1 3 4 4 2 1 2 3 3 1 2 2 4 3 4 1 1 cust_id prod_id other columns blocks
  • 44. How to use the feature • New keyword ‘INTERLEAVED’ when defining sort keys  Existing syntax will still work and behavior is unchanged  You can choose up to 8 columns to include and can query with any or all of them • No change needed to queries • Benefits are significant [ SORTKEY [ COMPOUND | INTERLEAVED ] ( column_name [, ...] ) ]
  • 46. Operational Reporting with Redshift Amazon S3 Log Bucket Amazon EMR Processed and structured log data Amazon Redshift Operational Reports

Notas del editor

  1. Six main reasons why Amazon EMR
  2. In the next few slides, we’ll talk about data persistence models with Amazon EMR. The first pattern is Amazon S3 as HDFS. With this data persistence model, data gets stored on Amazon S3. HDFS does not play any role in storing data. As a matter of fact, HDFS is only there for temporary storage. Another common thing I hear is that storing data on Amazon S3 instead of HDFS slows my job down a lot because data has to get copied to the HDFS/disk first before processing starts. That’s incorrect. If you tell Hadoop that your data is on Amazon S3, Hadoop reads directly from Amazon S3 and streams data to Mappers without toughing the disk. Not to be completely correct, data does touch HDFS when data has to shuffle from mappers to reducers, but as I mentioned, HDFS acts as the temp space and nothing more. EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.
  3. And every other feature that comes with Amazon S3. Features such as SSE, LifeCycle, etc. And again keep in mind that Amazon S3 as the storage is the main reason why we can’t build elastic clusters where nodes get added and removed dynamically without any data loss.
  4. In the next few slides, we’ll talk about data persistence models with EMR. The first pattern is Amazon S3 as HDFS. With this data persistence model, data gets stored on Amazon S3. HDFS does not play any role in storing data. As a matter of fact, HDFS is only there for temporary storage. Another common thing I hear is that storing data on Amazon S3 instead of HDFS slows my job down a lot because data has to get copied to HDFS/disk first before processing starts. That’s incorrect. If you tell Hadoop that your data is on Amazon S3, Hadoop reads directly from Amazon S3 and streams data to Mappers without toughing the disk. Not to be completely correct, data does touch HDFS when data has to shuffle from mappers to reducers, but as I mentioned, HDFS acts as the temp space and nothing more. EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.
  5. In the next few slides, we’ll talk about data persistence models with EMR. The first pattern is Amazon S3 as HDFS. With this data persistence model, data gets stored on Amazon S3. HDFS does not play any role in storing data. As a matter of fact, HDFS is only there for temporary storage. Another common thing I hear is that storing data on Amazon S3 instead of HDFS slows my job down a lot because data has to get copied to HDFS/disk first before processing starts. That’s incorrect. If you tell Hadoop that your data is on Amazon S3, Hadoop reads directly from Amazon S3 and streams data to Mappers without toughing the disk. Not to be completely correct, data does touch HDFS when data has to shuffle from mappers to reducers, but as I mentioned, HDFS acts as the temp space and nothing more. EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.
  6. EMR example #3: EMR for ETL and query engine for investigations which require all raw data
  7. Give guidance
  8. CloudFront logs arrive out of order.
  9. Read only the data you need
  10. Read only the data you need
  11. Read only the data you need
  12. Read only the data you need
  13. Read only the data you need
  14. Redshift works with customer’s BI tool of choice through Postgres drivers and a JDBC, ODBC connection. A number of partners shown here have certified integration with Redshift, meaning they have done testing to validate/build Redshift integration and make using Redshift easy from a UI perspective. If there are tools customer’s use not shown we can work with Redshift on getting them integrated.
  15. So, we started with our MySQL server. But this time we would run directly on the server itself SQL statements that would dump the data out to local files. Then using s3cmd we copied the flat files into our S3 bucket. Select data from MySQL and use the S3cmd to copy these flat files to S3. Use BCP to export data into an EC2 instance, which generates and copies flat files to S3. And then instead of using EMR, we just run some crazy SQL statements to transform the data into the Production version of Redshift. Copy data into a staging schema in Redshift where it can be transformed via SQL to the final table structure and loaded into the production schema. Use standard tools, like Microstrategy and Tableau, to provide business views into the data. And then of course we need a good way for business users to look at the data, and that’s where MicroStrategy and Tableau come into play.
  16. CloudFront logs arrive out of order.