SlideShare una empresa de Scribd logo
1 de 53
Big Data on AWS 
Structured, Unstructured & Streaming 
Russell Nash
v 
Structure 
High Low 
Large 
Size 
Small 
Traditional 
Database 
Hadoop 
NoSQL 
MPP Database
Structured Unstructured Streaming 
MPP Databases 
Amazon Redshift 
Hadoop 
Amazon EMR 
Real-time Analysis 
Amazon Kinesis
v 
• Standard SQL 
• Optimized for fast analysis 
• Very scalable
v 
Amazon Redshift
v 
Q1. What is it?
v MPP SQL Database 
Optimised for Analytics 
Gigabytes to Petabytes 
Fully relational 
Fully managed 
Amazon 
Redshift
v 
Q2. How does it work?
JDBC/ODBC
JDBC/ODBC 
ID Name 
1 John Smith 
2 Jane Jones 
3 Peter Black 
4 Pat Partridge 
5 Sarah Cyan 
6 Brian Snail 
1 John Smith 
4 Pat Partridge 
2 Jane Jones 
5 Sarah Cyan 
3 Peter Black 
6 Brian Snail
Dramatically reduces I/O 
v 
• Column storage 
• Data compression 
• Zone maps 
ID Age State Amount 
123 20 CA 500 
345 25 WA 250 
678 40 FL 125 
957 37 WA 375 
• With row storage you do unnecessary I/O 
• To get average Amount by State, you have 
to read everything
v 
ID Age State Amount 
123 20 CA 500 
345 25 WA 250 
678 40 FL 125 
957 37 WA 375 
• With column storage, you only 
read the data you need 
• Column storage 
• Data compression 
• Zone maps 
Dramatically reduces I/O
Dramatically reduces I/O 
v analyze compression listing; 
Table | Column | Encoding 
---------+----------------+---------- 
listing | listid | delta 
listing | sellerid | delta32k 
listing | eventid | delta32k 
listing | dateid | bytedict 
listing | numtickets | bytedict 
listing | priceperticket | delta32k 
listing | totalprice | mostly32 
listing | listtime | raw 
• Column storage 
• Data compression 
• Zone maps 
• COPY compresses automatically 
• You can analyze and override 
• More performance, less cost
Dramatically reduces I/O 
v 
• Column storage 
• Data compression 
• Zone maps 
10 | 13 | 14 | 26 |… 
… | 100 | 245 | 324 
375 | 393 | 417… 
… 512 | 549 | 623 
637 | 712 | 809 … 
… | 834 | 921 | 959 
10 
324 
375 
623 
637 
959 
• Track the minimum and maximum 
value for each block 
• Skip over blocks that don’t contain 
relevant data
v 
Q3. What’s good about it? 
Performance, Scalability, Ease of Use, Cost
Performance Evaluation on 2B Rows 
v 
Traditional 
SQL Database 
Amazon 
Redshift 
Aggregate by month 02:08:35 00:35:46 00:00:12
DW2.L 160 GB 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 
2 PB
v 
Q4. How do I integrate with Redshift?
Works with your existing analysis tools 
v 
JDBC/ODBC 
Amazon Redshift
S3 
Redshift 
DynamoDB 
EMR 
Linux 
Loading data
Amazon 
Redshift 
Source 
Systems 
ETL
Structured Unstructured Streaming 
MPP Databases 
Amazon Redshift 
Hadoop 
Amazon EMR 
Real-time Analysis 
Amazon Kinesis
Input 
File 
Functions Output 
Hadoop cluster 
1. Very Flexible 
2. Very Scalable 
3. Often Transient
Amazon Elastic MapReduce (EMR) 
v
v 
Q1. What is it? 
Managed Hadoop
Input 
File 
EC2 
EC2 
EC2 
Functions Output 
EC2 
EC2 
EC2 
EMR cluster
v 
Q2. How does it work?
v 
S3 EMR Cluster 
EMR 
1. Put the 
data into S3 
2. Choose: Hadoop 
distribution, # of nodes, types 
of nodes, Hadoop apps like 
Hive/Pig/HBase 
4. Get the output 
from S3 
3. Launch the cluster using 
the EMR console, CLI, SDK, 
or APIs
v 
EMR Cluster 
EMR 
S3 
You can easily resize 
the cluster 
And launch parallel 
clusters using the same 
data
v 
EMR Cluster 
EMR 
S3 
Use Spot 
nodes to save 
time and 
money
v 
S3 EMR Cluster 
When processing is complete, you 
can terminate the cluster (and stop 
paying)
v 
EMR Cluster 
Or just store 
everything in HDFS 
(local disk)
v 
Q3. What’s good about it? 
Scalability, Cost & Ease of Use
v 
Scenario #1 
Duration: 
14 Hours 
Scenario #2 
Duration: 
7 Hours 
EMR with spot instances 
#1: Cost without Spot 
4 instances *14 hrs * $0.50 = $28 
#2: Cost with Spot 
4 instances *7 hrs * $0.50 = $14 + 
5 instances * 7 hrs * $0.25 = $8.75 
Total = $22.75 
Time Savings: 50% 
Cost Savings: ~22%
Master instance group 
EMR cluster 
HDFS HDFS 
Core instance group Task instance group 
Amazon S3 
Great for 
Spot Instances
v 
The Hadoop Ecosystem
v 
Q4. How are customers using it?
Big Data Verticals 
Media/Advertising 
Targeted 
Advertising 
Image and 
Video 
Processing 
Oil & Gas 
Seismic 
Analysis 
Retail 
Recommendations 
Transactions 
Analysis 
Life Sciences 
Genome 
Analysis 
Financial Services 
Monte Carlo 
Simulations 
Risk Analysis 
Security 
Anti-virus 
Fraud Detection 
Image 
Recognition 
Social 
Network/Gaming 
User 
Demographics 
Usage analysis 
In-game metrics
Structured Unstructured Streaming 
MPP Databases 
Amazon Redshift 
Hadoop 
Amazon EMR 
Real-time Analysis 
Amazon Kinesis
Log Ingest Continual Metrics Real Time Data Analytics Complex Stream 
v 
Processing 
Software/ 
Technology 
IT server logs ingestion IT operational metrics 
dashboards 
Devices / Sensor Operational 
Intelligence 
Digital Ad Tech./ 
Marketing 
Advertising Data aggregation Advertising metrics like 
coverage, yield, conversion 
Analytics on User engagement 
with Ads 
Optimized bid/ buy 
engines 
Financial Services Market/ Financial Transaction 
order data collection 
Financial market data metrics Fraud monitoring, and Value-at- 
Risk assessment 
Auditing of market 
order data 
E-Commerce 
Online customer engagement 
data aggregation 
Consumer engagement metrics 
like page views, CTR 
Customer clickstream analytics Recommendation 
engines 
Real-time Scenarios in Industry Segments
v
v 
Q1. What is it?
v Kinesis 
Movement or activity in response to a stimulus. 
A fully managed service for real-time processing 
of high-volume, streaming data.
v 
Q2. How does it work?
Availability 
Zone 
Stream 
Availability 
Zone 
Availability 
Zone 
Data 
Sources 
Data 
Sources 
Data 
Sources 
Data 
Sources 
Data 
Sources 
Logging 
Metrics 
Analysis 
Machine 
Learning 
S3 
DynamoDB 
Redshift 
EMR 
Kinesis
Putting data into Kinesis 
• Each shard 
• 1000 Tx Per Second 
• 1MB Per Second 
• 50KB Payload Per Tx 
• Messages kept for 24 hours 
• Simple PUT interface to store data in Kinesis 
• A Partition Key is used to distribute the PUTs across Shards 
• A unique Sequence # is created
Getting data out of Kinesis 
v 
Kinesis Client Library (KCL): 
• Abstracts code from individual shards 
• Starts a Kinesis Worker for each shard 
• Increases and decreases workers 
• Tracks a Worker’s location in the stream
v 
Q3. What’s good about it?
v 
Easy Administration Real-time Performance High Throughput. 
Elastic 
Integration 
S3 
Redshift 
DynamoDB 
Storm 
ElasticSearch 
Build Real-time 
Applications 
. 
Low Cost
aws.amazon.com/big-data
Online Labs | Training 
Gain confidence and hands-on 
experience with AWS. Watch free 
Instructional Videos and explore Self- 
Paced Labs 
Instructor Led Classes 
Learn how to design, deploy and operate 
highly available, cost-effective and secure 
applications on AWS in courses led by 
qualified AWS instructors 
AWS Certification 
Validate your technical expertise 
with AWS and use practice exams 
to help you prepare for AWS 
Certification 
http://aws.amazon.com/training
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

Más contenido relacionado

La actualidad más candente

AWS 클라우드 기반 게임 아키텍처 사례 - AWS Summit Seoul 2017
AWS 클라우드 기반 게임 아키텍처 사례 - AWS Summit Seoul 2017AWS 클라우드 기반 게임 아키텍처 사례 - AWS Summit Seoul 2017
AWS 클라우드 기반 게임 아키텍처 사례 - AWS Summit Seoul 2017
Amazon Web Services Korea
 

La actualidad más candente (20)

Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflake
 
클라우드 네이티브 데이터베이스 서비스로 Oracle RAC 전환 - 김지훈 :: AWS 클라우드 마이그레이션 온라인
클라우드 네이티브 데이터베이스 서비스로 Oracle RAC 전환 - 김지훈 :: AWS 클라우드 마이그레이션 온라인클라우드 네이티브 데이터베이스 서비스로 Oracle RAC 전환 - 김지훈 :: AWS 클라우드 마이그레이션 온라인
클라우드 네이티브 데이터베이스 서비스로 Oracle RAC 전환 - 김지훈 :: AWS 클라우드 마이그레이션 온라인
 
Aws kms in 10 minutes
Aws kms in 10 minutesAws kms in 10 minutes
Aws kms in 10 minutes
 
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...
 
CI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the TimeCI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the Time
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항
[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항
[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항
 
AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인
AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인
AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인
 
IDC 서버 몽땅 AWS로 이전하기 위한 5가지 방법 - 윤석찬 (AWS 테크에반젤리스트)
IDC 서버 몽땅 AWS로 이전하기 위한 5가지 방법 - 윤석찬 (AWS 테크에반젤리스트) IDC 서버 몽땅 AWS로 이전하기 위한 5가지 방법 - 윤석찬 (AWS 테크에반젤리스트)
IDC 서버 몽땅 AWS로 이전하기 위한 5가지 방법 - 윤석찬 (AWS 테크에반젤리스트)
 
Encryption and Key Management in AWS
Encryption and Key Management in AWSEncryption and Key Management in AWS
Encryption and Key Management in AWS
 
Introduction to Amazon Relational Database Service (Amazon RDS)
Introduction to Amazon Relational Database Service (Amazon RDS)Introduction to Amazon Relational Database Service (Amazon RDS)
Introduction to Amazon Relational Database Service (Amazon RDS)
 
AWS RDS
AWS RDSAWS RDS
AWS RDS
 
AWS 클라우드 기반 게임 아키텍처 사례 - AWS Summit Seoul 2017
AWS 클라우드 기반 게임 아키텍처 사례 - AWS Summit Seoul 2017AWS 클라우드 기반 게임 아키텍처 사례 - AWS Summit Seoul 2017
AWS 클라우드 기반 게임 아키텍처 사례 - AWS Summit Seoul 2017
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
 
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
 
AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저
AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저
AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저
 
[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더
[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더
[Retail & CPG Day 2019] 마켓컬리 서비스 AWS 이관 및 최적화 여정 - 임상석, 마켓컬리 개발 리더
 

Similar a Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 

Similar a Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things (20)

Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

  • 1. Big Data on AWS Structured, Unstructured & Streaming Russell Nash
  • 2. v Structure High Low Large Size Small Traditional Database Hadoop NoSQL MPP Database
  • 3. Structured Unstructured Streaming MPP Databases Amazon Redshift Hadoop Amazon EMR Real-time Analysis Amazon Kinesis
  • 4. v • Standard SQL • Optimized for fast analysis • Very scalable
  • 6. v Q1. What is it?
  • 7. v MPP SQL Database Optimised for Analytics Gigabytes to Petabytes Fully relational Fully managed Amazon Redshift
  • 8. v Q2. How does it work?
  • 10. JDBC/ODBC ID Name 1 John Smith 2 Jane Jones 3 Peter Black 4 Pat Partridge 5 Sarah Cyan 6 Brian Snail 1 John Smith 4 Pat Partridge 2 Jane Jones 5 Sarah Cyan 3 Peter Black 6 Brian Snail
  • 11. Dramatically reduces I/O v • Column storage • Data compression • Zone maps ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 • With row storage you do unnecessary I/O • To get average Amount by State, you have to read everything
  • 12. v ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 • With column storage, you only read the data you need • Column storage • Data compression • Zone maps Dramatically reduces I/O
  • 13. Dramatically reduces I/O v analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw • Column storage • Data compression • Zone maps • COPY compresses automatically • You can analyze and override • More performance, less cost
  • 14. Dramatically reduces I/O v • Column storage • Data compression • Zone maps 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 • Track the minimum and maximum value for each block • Skip over blocks that don’t contain relevant data
  • 15. v Q3. What’s good about it? Performance, Scalability, Ease of Use, Cost
  • 16. Performance Evaluation on 2B Rows v Traditional SQL Database Amazon Redshift Aggregate by month 02:08:35 00:35:46 00:00:12
  • 17. DW2.L 160 GB 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 2 PB
  • 18. v Q4. How do I integrate with Redshift?
  • 19. Works with your existing analysis tools v JDBC/ODBC Amazon Redshift
  • 20. S3 Redshift DynamoDB EMR Linux Loading data
  • 21. Amazon Redshift Source Systems ETL
  • 22. Structured Unstructured Streaming MPP Databases Amazon Redshift Hadoop Amazon EMR Real-time Analysis Amazon Kinesis
  • 23. Input File Functions Output Hadoop cluster 1. Very Flexible 2. Very Scalable 3. Often Transient
  • 25. v Q1. What is it? Managed Hadoop
  • 26. Input File EC2 EC2 EC2 Functions Output EC2 EC2 EC2 EMR cluster
  • 27. v Q2. How does it work?
  • 28. v S3 EMR Cluster EMR 1. Put the data into S3 2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop apps like Hive/Pig/HBase 4. Get the output from S3 3. Launch the cluster using the EMR console, CLI, SDK, or APIs
  • 29. v EMR Cluster EMR S3 You can easily resize the cluster And launch parallel clusters using the same data
  • 30. v EMR Cluster EMR S3 Use Spot nodes to save time and money
  • 31. v S3 EMR Cluster When processing is complete, you can terminate the cluster (and stop paying)
  • 32. v EMR Cluster Or just store everything in HDFS (local disk)
  • 33. v Q3. What’s good about it? Scalability, Cost & Ease of Use
  • 34. v Scenario #1 Duration: 14 Hours Scenario #2 Duration: 7 Hours EMR with spot instances #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Time Savings: 50% Cost Savings: ~22%
  • 35. Master instance group EMR cluster HDFS HDFS Core instance group Task instance group Amazon S3 Great for Spot Instances
  • 36. v The Hadoop Ecosystem
  • 37. v Q4. How are customers using it?
  • 38. Big Data Verticals Media/Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendations Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/Gaming User Demographics Usage analysis In-game metrics
  • 39. Structured Unstructured Streaming MPP Databases Amazon Redshift Hadoop Amazon EMR Real-time Analysis Amazon Kinesis
  • 40. Log Ingest Continual Metrics Real Time Data Analytics Complex Stream v Processing Software/ Technology IT server logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence Digital Ad Tech./ Marketing Advertising Data aggregation Advertising metrics like coverage, yield, conversion Analytics on User engagement with Ads Optimized bid/ buy engines Financial Services Market/ Financial Transaction order data collection Financial market data metrics Fraud monitoring, and Value-at- Risk assessment Auditing of market order data E-Commerce Online customer engagement data aggregation Consumer engagement metrics like page views, CTR Customer clickstream analytics Recommendation engines Real-time Scenarios in Industry Segments
  • 41. v
  • 42. v Q1. What is it?
  • 43. v Kinesis Movement or activity in response to a stimulus. A fully managed service for real-time processing of high-volume, streaming data.
  • 44. v Q2. How does it work?
  • 45. Availability Zone Stream Availability Zone Availability Zone Data Sources Data Sources Data Sources Data Sources Data Sources Logging Metrics Analysis Machine Learning S3 DynamoDB Redshift EMR Kinesis
  • 46. Putting data into Kinesis • Each shard • 1000 Tx Per Second • 1MB Per Second • 50KB Payload Per Tx • Messages kept for 24 hours • Simple PUT interface to store data in Kinesis • A Partition Key is used to distribute the PUTs across Shards • A unique Sequence # is created
  • 47. Getting data out of Kinesis v Kinesis Client Library (KCL): • Abstracts code from individual shards • Starts a Kinesis Worker for each shard • Increases and decreases workers • Tracks a Worker’s location in the stream
  • 48. v Q3. What’s good about it?
  • 49. v Easy Administration Real-time Performance High Throughput. Elastic Integration S3 Redshift DynamoDB Storm ElasticSearch Build Real-time Applications . Low Cost
  • 51. Online Labs | Training Gain confidence and hands-on experience with AWS. Watch free Instructional Videos and explore Self- Paced Labs Instructor Led Classes Learn how to design, deploy and operate highly available, cost-effective and secure applications on AWS in courses led by qualified AWS instructors AWS Certification Validate your technical expertise with AWS and use practice exams to help you prepare for AWS Certification http://aws.amazon.com/training