SlideShare a Scribd company logo
1 of 48
aws.amazon.com/webinars/apac/webinar-week | #AWSWebinarWeek
Launching Your First Big Data Project on AWS
Russell Nash – AWS Solutions Architect
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon
EMR
Amazon
Redshift
Amazon Machine
Learning
Collect Process Analyze
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
Data Answers
Your first big data application on AWS
Collect Process Analyze
Store
Data Answers
Collect Process Analyze
Store
Data Answers
Collect Process Analyze
Store
Data Answers
SQL
Set up the AWS CLI
Amazon Kinesis
Create a single-shard Amazon Kinesis stream for incoming log data:
aws kinesis create-stream 
--stream-name AccessLogStream 
--shard-count 1
Amazon S3
YOUR-BUCKET-NAME
Amazon EMR
Launch a 3-node Amazon EMR cluster with Spark and Hive:
m3.xlarge
YOUR-AWS-SSH-KEY
Amazon Redshift

CHOOSE-A-REDSHIFT-PASSWORD
Your first big data application on AWS
2. PROCESS: Process data with
Amazon EMR using Spark & Hive
STORE
3. ANALYZE: Analyze data in
Amazon Redshift using SQLSQL
1. COLLECT: Stream data into
Amazon Kinesis with Log4J
1. Collect
Amazon Kinesis Log4J Appender
In a separate terminal window on your local machine,
download Log4J Appender:
Then download and save the sample Apache log file:
Amazon Kinesis Log4J Appender
Create a file called AwsCredentials.properties
with credentials for an IAM user with permission to write
to Amazon Kinesis:
accessKey=YOUR-IAM-ACCESS-KEY
secretKey=YOUR-SECRET-KEY
Then start the Amazon Kinesis Log4J Appender:
Log file format
v
Spark
• Fast, general purpose engine for
large-scale data processing
• Write applications quickly in Java,
Scala, or Python
• Combine SQL, streaming, and
complex analytics
v
Amazon Kinesis and Spark Streaming
Log4J
Appender
Amazon
Kinesis
Amazon
S3
Amazon
DynamoDB
Spark-Streaming uses
Kinesis Client Library
Amazon
EMR
Using Spark Streaming on Amazon EMR
-o TCPKeepAlive=yes -o ServerAliveInterval=30 
YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME
On your cluster, download the Amazon Kinesis client for Spark:
Using Spark Streaming on Amazon EMR
Cut down on console noise:
Start the Spark shell:
spark-shell --jars /usr/lib/spark/extras/lib/spark-streaming-
kinesis-asl.jar,amazon-kinesis-client-1.6.0.jar --driver-java-
options "-
Dlog4j.configuration=file:///etc/spark/conf/log4j.properties"
Using Spark Streaming on Amazon EMR
/* import required libraries */
Using Spark Streaming on Amazon EMR
/* Set up the variables as needed */
YOUR-REGION
YOUR-S3-BUCKET
/* Reconfigure the spark-shell */
Reading Amazon Kinesis with Spark Streaming
/* Setup the KinesisClient */
val kinesisClient = new AmazonKinesisClient(new
DefaultAWSCredentialsProviderChain())
kinesisClient.setEndpoint(endpointUrl)
/* Determine the number of shards from the stream */
val numShards =
kinesisClient.describeStream(streamName).getStreamDescription().getShards().size
()
/* Create one worker per Kinesis shard */
val ssc = new StreamingContext(sc, outputInterval)
val kinesisStreams = (0 until numShards).map { i =>
KinesisUtils.createStream(ssc, streamName,
endpointUrl,outputInterval,InitialPositionInStream.TRIM_HORIZON,
StorageLevel.MEMORY_ONLY)
}
Writing to Amazon S3 with Spark Streaming
/* Merge the worker Dstreams and translate the byteArray to string */
/* Write each RDD to Amazon S3 */
View the output files in Amazon S3
YOUR-S3-BUCKET
YOUR-S3-BUCKET
yyyy mm dd HH
2. Process
v
Spark SQL
• Spark's module for working with structured data using SQL
• Run unmodified Hive queries on existing data
Using Spark SQL on Amazon EMR
YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME
Start the Spark SQL shell:
spark-sql --driver-java-options "-
Dlog4j.configuration=file:///etc/spark/conf/log4j.properties"
Create a table that points to your Amazon S3
bucket
CREATE EXTERNAL TABLE access_log_raw(
host STRING, identity STRING,
user STRING, request_time STRING,
request STRING, status STRING,
size STRING, referrer STRING,
agent STRING
)
PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) ([^ "]*|"[^"]*")
(-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?"
)
LOCATION 's3://YOUR-S3-BUCKET/access-log-raw';
msck repair table access_log_raw;
Query the data with Spark SQL
-- return the first row in the stream
-- return count all items in the Stream
-- find the top 10 hosts
v
Preparing the data for Amazon Redshift import
• We will transform the data that is returned by the query before
writing it to our Amazon S3-stored external Hive table
• Hive user-defined functions (UDF) in use for the text
transformations: from_unixtime, unix_timestamp and hour
• The “hour” value is important: this is what’s used to split and
organize the output files before writing to Amazon S3. These splits
will allow us to more efficiently load the data into Amazon Redshift
later in the lab using the parallel “COPY” command.
Create an external table in Amazon S3
YOUR-S3-BUCKET
Configure partition and compression
-- setup Hive's "dynamic partitioning"
-- this will split output files when writing to Amazon S3
-- compress output files on Amazon S3 using Gzip
Write output to Amazon S3
-- convert the Apache log timestamp to a UNIX timestamp
-- split files in Amazon S3 by the hour in the log lines
INSERT OVERWRITE TABLE access_log_processed PARTITION (hour)
SELECT
from_unixtime(unix_timestamp(request_time,
'[dd/MMM/yyyy:HH:mm:ss Z]')),
host,
request,
status,
referrer,
agent,
hour(from_unixtime(unix_timestamp(request_time,
'[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour
FROM access_log_raw;
View the output files in Amazon S3
YOUR-S3-BUCKET
YOUR-S3-BUCKET
3. Analyze
Connect to Amazon Redshift
# using the PostgreSQL CLI
YOUR-REDSHIFT-ENDPOINT
Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x
drivers or native Amazon Redshift support
• Aginity Workbench for Amazon Redshift
• SQL Workbench/J
Create an Amazon Redshift table to hold your data
v
Loading data into Amazon Redshift
• “COPY” command loads files in parallel
COPY accesslogs
FROM 's3://YOUR-S3-BUCKET/access-log-processed'
CREDENTIALS
'aws_access_key_id=YOUR-IAM-
ACCESS_KEY;aws_secret_access_key=YOUR-IAM-SECRET-KEY'
DELIMITER 't' IGNOREHEADER 0
MAXERROR 0
GZIP;
Amazon Redshift test queries
-- find distribution of status codes over days
-- find the 404 status codes
-- show all requests for status as PAGE NOT FOUND
Your first big data application on AWS
A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors
v
Visualize the results
• Client-side JavaScript example using Plottable, a library built on D3
• Hosted on Amazon S3 for pennies a month
• AWS Lambda function used to query Amazon Redshift
…around the same cost as a cup of coffee
Try it yourself on the AWS cloud…
Service Est. Cost*
Amazon Kinesis $1.00
Amazon S3 (free tier) $0
Amazon EMR $0.44
Amazon Redshift $1.00
Est. Total $2.44
*Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running
for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage.
$3.50
v
Learn from AWS big data experts
blogs.aws.amazon.com/bigdata
Online Labs & Training
Gain confidence and hands-on
experience with AWS.
Watch free Instructional Videos and
explore Self-Paced Labs
Instructor Led Classes
Learn how to design, deploy and
operate highly available, cost-effective
and secure applications on AWS in
courses led by qualified AWS instructors
Validate your technical expertise
with AWS and use practice exams
to help you prepare for AWS
Certification
AWS Certification
More info at http://aws.amazon.com/training

More Related Content

What's hot

What's hot (20)

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
 
ENT302 Deep Dive on AWS Management Tools
ENT302 Deep Dive on AWS Management Tools ENT302 Deep Dive on AWS Management Tools
ENT302 Deep Dive on AWS Management Tools
 
Amazon EC2:Masterclass
Amazon EC2:MasterclassAmazon EC2:Masterclass
Amazon EC2:Masterclass
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdays
 
Escalando para sus primeros 10 millones de usuarios
Escalando para sus primeros 10 millones de usuariosEscalando para sus primeros 10 millones de usuarios
Escalando para sus primeros 10 millones de usuarios
 
AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)
AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)
AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)
 
AWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field Experience
AWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field ExperienceAWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field Experience
AWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field Experience
 
WKS407 Wild Rydes Takes Off – The Dawn of a New Unicorn
WKS407 Wild Rydes Takes Off – The Dawn of a New Unicorn WKS407 Wild Rydes Takes Off – The Dawn of a New Unicorn
WKS407 Wild Rydes Takes Off – The Dawn of a New Unicorn
 
Deep Dive on Amazon RDS (May 2016)
Deep Dive on Amazon RDS (May 2016)Deep Dive on Amazon RDS (May 2016)
Deep Dive on Amazon RDS (May 2016)
 
Amazon EC2 Masterclass
Amazon EC2 MasterclassAmazon EC2 Masterclass
Amazon EC2 Masterclass
 
ENT306 Migrating Large Scale Data Sets to the Cloud
ENT306 Migrating Large Scale Data Sets to the CloudENT306 Migrating Large Scale Data Sets to the Cloud
ENT306 Migrating Large Scale Data Sets to the Cloud
 
HSBC and AWS Day - Database Options on AWS
HSBC and AWS Day - Database Options on AWSHSBC and AWS Day - Database Options on AWS
HSBC and AWS Day - Database Options on AWS
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
(DVO401) Deep Dive into Blue/Green Deployments on AWS
(DVO401) Deep Dive into Blue/Green Deployments on AWS(DVO401) Deep Dive into Blue/Green Deployments on AWS
(DVO401) Deep Dive into Blue/Green Deployments on AWS
 
AWS re:Invent 2016: Scaling Up to Your First 10 Million Users (ARC201)
AWS re:Invent 2016: Scaling Up to Your First 10 Million Users (ARC201)AWS re:Invent 2016: Scaling Up to Your First 10 Million Users (ARC201)
AWS re:Invent 2016: Scaling Up to Your First 10 Million Users (ARC201)
 
Automated Compliance and Governance with AWS Config and AWS CloudTrail - June...
Automated Compliance and Governance with AWS Config and AWS CloudTrail - June...Automated Compliance and Governance with AWS Config and AWS CloudTrail - June...
Automated Compliance and Governance with AWS Config and AWS CloudTrail - June...
 
SMC302 Building Serverless Web Applications
SMC302 Building Serverless Web ApplicationsSMC302 Building Serverless Web Applications
SMC302 Building Serverless Web Applications
 
AWS Batch: Simplifying batch computing in the cloud
AWS Batch: Simplifying batch computing in the cloudAWS Batch: Simplifying batch computing in the cloud
AWS Batch: Simplifying batch computing in the cloud
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 

Viewers also liked

Viewers also liked (10)

TurboCharge Your Continuous Delivery Pipeline with Containers - Pop-up Loft
TurboCharge Your Continuous Delivery Pipeline with Containers - Pop-up LoftTurboCharge Your Continuous Delivery Pipeline with Containers - Pop-up Loft
TurboCharge Your Continuous Delivery Pipeline with Containers - Pop-up Loft
 
3 Secrets to Becoming a Cloud Security Superhero
3 Secrets to Becoming a Cloud Security Superhero 3 Secrets to Becoming a Cloud Security Superhero
3 Secrets to Becoming a Cloud Security Superhero
 
Real-time Analytics with Open-Source
Real-time Analytics with Open-SourceReal-time Analytics with Open-Source
Real-time Analytics with Open-Source
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
AWS Customer Presentation: Coca Cola Turkey migrates SAP ERP to AWS-SAPPHIRE ...
AWS Customer Presentation: Coca Cola Turkey migrates SAP ERP to AWS-SAPPHIRE ...AWS Customer Presentation: Coca Cola Turkey migrates SAP ERP to AWS-SAPPHIRE ...
AWS Customer Presentation: Coca Cola Turkey migrates SAP ERP to AWS-SAPPHIRE ...
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
TCS: Leveraging AWS for SAP on Oracle implementations
TCS: Leveraging AWS for SAP on Oracle implementationsTCS: Leveraging AWS for SAP on Oracle implementations
TCS: Leveraging AWS for SAP on Oracle implementations
 
(SEC326) Security Science Using Big Data
(SEC326) Security Science Using Big Data(SEC326) Security Science Using Big Data
(SEC326) Security Science Using Big Data
 
AWS May 2016 Webinar Series - AWS Services Overview
AWS May 2016 Webinar Series - AWS Services OverviewAWS May 2016 Webinar Series - AWS Services Overview
AWS May 2016 Webinar Series - AWS Services Overview
 
1. 利用微服務架構建立雲端影音平台 (Building Media Platform by Microservices Architecture)
1.	利用微服務架構建立雲端影音平台 (Building Media Platform by Microservices Architecture)1.	利用微服務架構建立雲端影音平台 (Building Media Platform by Microservices Architecture)
1. 利用微服務架構建立雲端影音平台 (Building Media Platform by Microservices Architecture)
 

Similar to AWS APAC Webinar Week - Launching Your First Big Data Project on AWS

Similar to AWS APAC Webinar Week - Launching Your First Big Data Project on AWS (20)

(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWS
 
AWS September Webinar Series - Building Your First Big Data Application on AWS
AWS September Webinar Series - Building Your First Big Data Application on AWS AWS September Webinar Series - Building Your First Big Data Application on AWS
AWS September Webinar Series - Building Your First Big Data Application on AWS
 
Amazed by AWS Series #4
Amazed by AWS Series #4Amazed by AWS Series #4
Amazed by AWS Series #4
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
My First Big Data Application
My First Big Data ApplicationMy First Big Data Application
My First Big Data Application
 
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3
 
Intro to AWS Lambda
Intro to AWS LambdaIntro to AWS Lambda
Intro to AWS Lambda
 
Serverless computing with AWS Lambda
Serverless computing with AWS Lambda Serverless computing with AWS Lambda
Serverless computing with AWS Lambda
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
DevOps for the Enterprise: Virtual Office Hours
DevOps for the Enterprise: Virtual Office HoursDevOps for the Enterprise: Virtual Office Hours
DevOps for the Enterprise: Virtual Office Hours
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Picking the right AWS backend for your Java application (May 2017)
Picking the right AWS backend for your Java application (May 2017)Picking the right AWS backend for your Java application (May 2017)
Picking the right AWS backend for your Java application (May 2017)
 
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

AWS APAC Webinar Week - Launching Your First Big Data Project on AWS

  • 1.
  • 3. Launching Your First Big Data Project on AWS Russell Nash – AWS Solutions Architect
  • 4. Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon RDS (Aurora) AWS Lambda KCL Apps Amazon EMR Amazon Redshift Amazon Machine Learning Collect Process Analyze Store Data Collection and Storage Data Processing Event Processing Data Analysis Data Answers
  • 5. Your first big data application on AWS
  • 9. Set up the AWS CLI
  • 10. Amazon Kinesis Create a single-shard Amazon Kinesis stream for incoming log data: aws kinesis create-stream --stream-name AccessLogStream --shard-count 1
  • 12. Amazon EMR Launch a 3-node Amazon EMR cluster with Spark and Hive: m3.xlarge YOUR-AWS-SSH-KEY
  • 14. Your first big data application on AWS 2. PROCESS: Process data with Amazon EMR using Spark & Hive STORE 3. ANALYZE: Analyze data in Amazon Redshift using SQLSQL 1. COLLECT: Stream data into Amazon Kinesis with Log4J
  • 16. Amazon Kinesis Log4J Appender In a separate terminal window on your local machine, download Log4J Appender: Then download and save the sample Apache log file:
  • 17. Amazon Kinesis Log4J Appender Create a file called AwsCredentials.properties with credentials for an IAM user with permission to write to Amazon Kinesis: accessKey=YOUR-IAM-ACCESS-KEY secretKey=YOUR-SECRET-KEY Then start the Amazon Kinesis Log4J Appender:
  • 19. v Spark • Fast, general purpose engine for large-scale data processing • Write applications quickly in Java, Scala, or Python • Combine SQL, streaming, and complex analytics
  • 20. v Amazon Kinesis and Spark Streaming Log4J Appender Amazon Kinesis Amazon S3 Amazon DynamoDB Spark-Streaming uses Kinesis Client Library Amazon EMR
  • 21. Using Spark Streaming on Amazon EMR -o TCPKeepAlive=yes -o ServerAliveInterval=30 YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME On your cluster, download the Amazon Kinesis client for Spark:
  • 22. Using Spark Streaming on Amazon EMR Cut down on console noise: Start the Spark shell: spark-shell --jars /usr/lib/spark/extras/lib/spark-streaming- kinesis-asl.jar,amazon-kinesis-client-1.6.0.jar --driver-java- options "- Dlog4j.configuration=file:///etc/spark/conf/log4j.properties"
  • 23. Using Spark Streaming on Amazon EMR /* import required libraries */
  • 24. Using Spark Streaming on Amazon EMR /* Set up the variables as needed */ YOUR-REGION YOUR-S3-BUCKET /* Reconfigure the spark-shell */
  • 25. Reading Amazon Kinesis with Spark Streaming /* Setup the KinesisClient */ val kinesisClient = new AmazonKinesisClient(new DefaultAWSCredentialsProviderChain()) kinesisClient.setEndpoint(endpointUrl) /* Determine the number of shards from the stream */ val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards().size () /* Create one worker per Kinesis shard */ val ssc = new StreamingContext(sc, outputInterval) val kinesisStreams = (0 until numShards).map { i => KinesisUtils.createStream(ssc, streamName, endpointUrl,outputInterval,InitialPositionInStream.TRIM_HORIZON, StorageLevel.MEMORY_ONLY) }
  • 26. Writing to Amazon S3 with Spark Streaming /* Merge the worker Dstreams and translate the byteArray to string */ /* Write each RDD to Amazon S3 */
  • 27. View the output files in Amazon S3 YOUR-S3-BUCKET YOUR-S3-BUCKET yyyy mm dd HH
  • 29. v Spark SQL • Spark's module for working with structured data using SQL • Run unmodified Hive queries on existing data
  • 30. Using Spark SQL on Amazon EMR YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME Start the Spark SQL shell: spark-sql --driver-java-options "- Dlog4j.configuration=file:///etc/spark/conf/log4j.properties"
  • 31. Create a table that points to your Amazon S3 bucket CREATE EXTERNAL TABLE access_log_raw( host STRING, identity STRING, user STRING, request_time STRING, request STRING, status STRING, size STRING, referrer STRING, agent STRING ) PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?" ) LOCATION 's3://YOUR-S3-BUCKET/access-log-raw'; msck repair table access_log_raw;
  • 32. Query the data with Spark SQL -- return the first row in the stream -- return count all items in the Stream -- find the top 10 hosts
  • 33. v Preparing the data for Amazon Redshift import • We will transform the data that is returned by the query before writing it to our Amazon S3-stored external Hive table • Hive user-defined functions (UDF) in use for the text transformations: from_unixtime, unix_timestamp and hour • The “hour” value is important: this is what’s used to split and organize the output files before writing to Amazon S3. These splits will allow us to more efficiently load the data into Amazon Redshift later in the lab using the parallel “COPY” command.
  • 34. Create an external table in Amazon S3 YOUR-S3-BUCKET
  • 35. Configure partition and compression -- setup Hive's "dynamic partitioning" -- this will split output files when writing to Amazon S3 -- compress output files on Amazon S3 using Gzip
  • 36. Write output to Amazon S3 -- convert the Apache log timestamp to a UNIX timestamp -- split files in Amazon S3 by the hour in the log lines INSERT OVERWRITE TABLE access_log_processed PARTITION (hour) SELECT from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')), host, request, status, referrer, agent, hour(from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour FROM access_log_raw;
  • 37. View the output files in Amazon S3 YOUR-S3-BUCKET YOUR-S3-BUCKET
  • 39. Connect to Amazon Redshift # using the PostgreSQL CLI YOUR-REDSHIFT-ENDPOINT Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers or native Amazon Redshift support • Aginity Workbench for Amazon Redshift • SQL Workbench/J
  • 40. Create an Amazon Redshift table to hold your data
  • 41. v Loading data into Amazon Redshift • “COPY” command loads files in parallel COPY accesslogs FROM 's3://YOUR-S3-BUCKET/access-log-processed' CREDENTIALS 'aws_access_key_id=YOUR-IAM- ACCESS_KEY;aws_secret_access_key=YOUR-IAM-SECRET-KEY' DELIMITER 't' IGNOREHEADER 0 MAXERROR 0 GZIP;
  • 42. Amazon Redshift test queries -- find distribution of status codes over days -- find the 404 status codes -- show all requests for status as PAGE NOT FOUND
  • 43. Your first big data application on AWS A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors
  • 44. v Visualize the results • Client-side JavaScript example using Plottable, a library built on D3 • Hosted on Amazon S3 for pennies a month • AWS Lambda function used to query Amazon Redshift
  • 45. …around the same cost as a cup of coffee Try it yourself on the AWS cloud… Service Est. Cost* Amazon Kinesis $1.00 Amazon S3 (free tier) $0 Amazon EMR $0.44 Amazon Redshift $1.00 Est. Total $2.44 *Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage. $3.50
  • 46. v Learn from AWS big data experts blogs.aws.amazon.com/bigdata
  • 47.
  • 48. Online Labs & Training Gain confidence and hands-on experience with AWS. Watch free Instructional Videos and explore Self-Paced Labs Instructor Led Classes Learn how to design, deploy and operate highly available, cost-effective and secure applications on AWS in courses led by qualified AWS instructors Validate your technical expertise with AWS and use practice exams to help you prepare for AWS Certification AWS Certification More info at http://aws.amazon.com/training