SlideShare una empresa de Scribd logo
1 de 68
What is Cloud Computing?
Using the Cloud to Crunch Your Data Adrian Cockcroft – acockcroft@netflix.com
What is Capacity Planning We care about CPU, Memory, Network and Disk resources, and Application response times We need to know how much of each resource we are using now, and will use in the future We need to know how much headroom we have to handle higher loads We want to understand how headroom varies, and how it relates to application response times and throughput
Capacity Planning Norms Capacity is expensive Capacity takes time to buy and provision Capacity only increases, can’t be shrunk easily Capacity comes in big chunks, paid up front Planning errors can cause big problems Systems are clearly defined assets Systems can be instrumented in detail
Capacity Planning in Clouds Capacity is expensive Capacity takes time to buy and provision Capacity only increases, can’t be shrunk easily Capacity comes in big chunks, paid up front Planning errors can cause big problems Systems are clearly defined assets Systems can be instrumented in detail
Capacity is expensivehttp://aws.amazon.com/s3/ & http://aws.amazon.com/ec2/ Storage (Amazon S3)  $0.150 per GB – first 50 TB / month of storage used $0.120 per GB – storage used / month over 500 TB Data Transfer (Amazon S3)  $0.100 per GB – all data transfer in $0.170 per GB – first 10 TB / month data transfer out $0.100 per GB – data transfer out / month over 150 TB Requests (Amazon S3 Storage access is via http) $0.01 per 1,000 PUT, COPY, POST, or LIST requests $0.01 per 10,000 GET and all other requests $0 per DELETE CPU (Amazon EC2) Small (Default) $0.085 per hour, Extra Large $0.68 per hour Network (Amazon EC2) Inbound/Outbound around $0.10 per GB
Capacity comes in big chunks, paid up front Capacity takes time to buy and provision No minimum price, monthly billing “Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously” Capacity only increases, can’t be shrunk easily Pay for what is actually used Planning errors can cause big problems Size only for what you need now
Systems are clearly defined assets You are running in a “stateless” multi-tenanted virtual image that can die or be taken away and replaced at any time You don’t know exactly where it is You can choose to locate “USA” or “Europe” You can specify zones that will not share components to avoid common mode failures
Systems can be instrumented in detail Need to use stateless monitoring tools Monitored nodes come and go by the hour Need to write role-name to hostname e.g. wwwprod002, not the EC2 default all monitoring by role-name Ganglia – automatic configuration Multicast replicated monitoring state No need to pre-define metrics and nodes
Acquisition requires management buy-in Anyone with a credit card and $10 is in business Data governance issues… Remember 1980’s when PC’s first turned up? Departmental budgets funded PC’s No standards, management or backups Central IT departments could lose control Decentralized use of clouds will be driven by teams seeking competitive advantage and business agility – ultimately unstoppable…
December 9, 2009 OK, so what should we do?
The Umbrella Strategy Monitor network traffic to cloud vendor API’s Catch unauthorized clouds as they form Measure, trend and predict cloud activity Pick two cloud standards, setup corporate accounts Better sharing of lessons as they are learned Create path of least resistance for users Avoid vendor lock-in Aggregate traffic to get bulk discounts Pressure the vendors to develop common standards, but don’t wait for them to arrive. The best APIs will be cloned eventually Sponsor a pathfinder project for each vendor Navigate a route through the clouds Don’t get caught unawares in the rain
Predicting the Weather Billing is variable, monthly in arrears Try to predict how much you will be billed… Not an issue to start with, but can’t be ignored Amazon has a cost estimator tool that may help Central analysis and collection of cloud metrics Cloud vendor usage metrics via API Your system and application metrics via Ganglia Intra-cloud bandwidth is free, analyze in the cloud! Based on standard analysis framework “Hadoop” Validation of vendor metrics and billing Charge-back for individual user applications
Use it to learn it… Focus on how you can use the cloud yourself to do large scale log processing You can upload huge datasets to a cloud and crunch them with a large cluster of computers using Amazon Elastic Map Reduce (EMR) Do it all from your web browser for a handful of dollars charged to a credit card. Here’s how
Cloud Crunch The Recipe You will need: A computer connected to the Internet The Firefox browser A Firefox specific browser extension A credit card and less than $1 to spend Big log files as ingredients Some very processor intensive queries
Recipe First we will warm up our computer by setting up Firefox and connecting to the cloud. Then we will upload our ingredients to be crunched, at about 20 minutes per gigabyte You should pick between one and twenty processors to crunch with, they are charged by the hour and the cloud takes about 10 minutes to warm up. The query itself starts by mapping the ingredients so that the mixture separates, then the excess is boiled off to make a nice informative reduction.
Costs Firefox and Extension – free Upload ingredients – 10 cents/Gigabyte Small Processors – 11.5 cents/hour each Download results – 17 cents/Gigabyte Storage – 15 cents/Gigabyte/month Service updates – 1 cent/1000 calls Service requests – 1 cent/10,000 calls Actual cost to run two example programs as described in this presentation was 26 cents
Faster Results at Same Cost! You may have trouble finding enough data and a complex enough query to keep the processors busy for an hour. In that case you can use fewer processors Conversely if you want quicker results you can use more and/or larger processors. Up to 20 systems with 8 CPU cores = 160 cores is immediately available. Oversize on request.
Step by Step Walkthrough to get you started Run Amazon Elastic MapReduceexamples Get up and running before this presentation is over….
Step 1 – Get Firefox http://www.mozilla.com/en-US/firefox/firefox.html
Step 2 – Get S3Fox Extension Next select the Add-ons option from the Tools menu, select “Get Add-ons” and search for S3Fox.
Step 3 – Learn AboutAWS Bring up http://aws.amazon.com/ to read about the services.  Amazon S3 is short for Amazon Simple Storage Service, which is part of the Amazon Web Services product  We will be using Amazon S3 to store data, and Amazon Elastic Map Reduce (EMR) to process it. Underneath EMR there is an Amazon Elastic Compute Cloud (EC2), which is created automatically for you each time you use EMR.
What is S3, EC2, EMR? Amazon Simple Storage Service lets you put data into the cloud that is addressed using a URL. Access to it can be private or public. Amazon Elastic Compute Cloud lets you pick the size and number of computers in your cloud. Amazon Elastic Map Reduce automatically builds a Hadoop cluster on top of EC2, feeds it data from S3, saves the results in S3, then removes the cluster and frees up the EC2 systems.
Step 4 – Sign Up For AWS Go to the top-right of the page and sign up at http://aws.amazon.com/ You can login using the same account you use to buy books!
Step 5 – Sign Up For Amazon S3 Follow the link to http://aws.amazon.com/s3/
Check The S3 Rate Card
Step 6 – Sign Up For Amazon EMR Go to http://aws.amazon.com/elasticmapreduce/
EC2 Signup The EMR signup combines the other needed  services such as EC2 in the sign up process and the rates for all are displayed. The EMR costs are in addition to the EC2 costs, so 1.5 cents/hour for EMR is added to the 10 cents/hour for EC2 making 11.5 cents/hour for each small instance running Linux with Hadoop.
EMR And EC2 Rates(old data – it is cheaper now with even bigger nodes)
Step 7 - How Big are EC2 Instances? See http://aws.amazon.com/ec2/instance-types/ The compute power is specified in a standard unit called an EC2 Compute Unit (ECU). Amazon states “One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.”
Standard Instances Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform. Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform. Extra Large Instance 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform.
Compute Intensive Instances High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of instance storage, 32-bit platform. High-CPU Extra Large Instance 7 GB of memory, 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform.
Step 8 – Getting Started http://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/ There is a Getting Started guide and Developer Documentation including sample applications We will be working through two of those applications. There is also a very helpful FAQ page that is worth reading through.
Step 9 – What is EMR?
Step 10 – How ToMapReduce? Code directly in Java Submit “streaming” command line scripts EMR bundles Perl, Python, Ruby and R Code MR sequences in Java using Cascading Process log files using Cascading Multitool Write dataflow scripts with Pig Write SQL queries using Hive
Step 11 – Setup Access Keys Getting Started Guide http://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/gsFirstSteps.html The URL to visit to get your key is: http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key
Enter Access Keys in S3Fox Open S3 Firefox Organizer and select the Manage Accounts button at the top left Enter your access keys into the popup.
Step 12 – Create An Output Folder Click Here
Setup Folder Permissions
Step 13 – Run Your First Job See the Getting Started Guide at: http://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/gsConsoleRunJob.html login to the EMR console at https://console.aws.amazon.com/elasticmapreduce/home create a new job flow called “Word crunch”, and select the sample application word count.
Set the Output S3 Bucket
Pick the Instance Count One small instance (i.e. computer) will do… Cost will be 11.5 cents for 1 hour (minimum)
Start the Job Flow The Job Flow takes a few minutes to get started, then completes in about 5 minutes run time
Step 14 – View The Results In S3Fox click on the refresh icon, then doubleclick on the crunchie folder Keep clicking until the output file is visible
Save To Your PC S3Fox – click on left arrow Save to PC Open in TextEdit See that the word “a” occurred 14716 times boring…. so try a more interesting demo! Click Here
Step 15 – Crunch Some Log Files Create a new output bucket with a new name Start a new job flow using the CloudFront Demo This uses the Cascading Multitool
Step 16 – How Much Did That Cost?
Wrap Up Easy, cheap, anyone can use it Now let’s look at how to write code…
Log & Data Analysis using Hadoop Based on slides by ShashiMadappa smadappa@netflix.com
Agenda 1. Hadoop 	- Background 	- Ecosystem  	- HDFS & Map/Reduce 	- Example 2. Log & Data Analysis @ Netflix 	- Problem / Data Volume 	- Current & Future Projects  3. Hadoop on Amazon EC2 	- Storage options 	- Deployment options
Hadoop Apache open-source software for reliable, scalable, distributed computing. Originally a sub-project of the Lucene search engine. Used to analyze and transform large data sets. Supports partial failure, Recoverability, Consistency & Scalability
Hadoop Sub-Projects Core HDFS: A distributed file system that provides high throughput access to data.  Map/Reduce : A framework for processing large data sets. HBase : A distributed database that supports structured data storage for large tables Hive : An infrastructure for ad hoc querying (sql like) Pig : A data-flow language and execution framework Avro : A data serialization system that provides dynamic integration with scripting languages. Similar to Thrift & Protocol Buffers  Cascading : Executing data processing workflow Chukwa, Mahout, ZooKeeper, and many more
Hadoop Distributed File System(HDFS) Features Cannot be mounted as a “file system” Access via command line or Java API Prefers large files (multiple terabytes) to many small files Files are write once, read many (append coming soon) Users, Groups and Permissions Name and Space Quotas  Blocksize and Replication factor are configurable per file Commands: hadoop dfs -ls, -du, -cp, -mv, -rm, -rmr Uploading files ,[object Object]
cat ReallyBigFile | hadoop dfs -put - mydata/ReallyBigFileDownloading files ,[object Object]
 hadoop dfs –tail [–f] mydata/foo,[object Object]
Map/Reduce Simple Dataflow
Detailed Dataflow
Input & Output Formats The application also chooses input and output formats, which define how the persistent data is read and written. These are interfaces and can be defined by the application. InputFormat Splits the input to determine the input to each map task. Defines a RecordReader that reads key, value pairs that are passed to the map task OutputFormat Given the key, value pairs and a filename, writes the reduce task output to persistent store.
Map/Reduce Processes Launching Application User application code Submits a specific kind of Map/Reduce job JobTracker Handles all jobs Makes all scheduling decisions TaskTracker Manager for all tasks on a given node Task Runs an individual map or reduce fragment for a given job Forks from the TaskTracker
Word Count Example Mapper Input: value: lines of text of input Output: key: word, value: 1 Reducer Input: key: word, value: set of counts Output: key: word, value: sum Launching program Defines the job Submits job to cluster
Map Phase public static class WordCountMapperextends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 	private final static IntWritable one = new IntWritable(1); 	private Text word = new Text(); 	public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 		String line = value.toString(); StringTokenizeritr = new StringTokenizer(line, “,”); 		while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); 		}   } }
Reduce Phase public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 	public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,  Reporter reporter) throws IOException { int sum = 0; 		while (values.hasNext()) { 			sum += values.next().get(); 		} output.collect(key, new IntWritable(sum)); 	} }
Running the Job public class WordCount { 	public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); 		// the keys are words (strings) conf.setOutputKeyClass(Text.class); 		// the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(WordCountMapper.class); conf.setReducerClass(WordCountReducer.class); conf.setInputPath(new Path(args[0]); conf.setOutputPath(new Path(args[1]); JobClient.runJob(conf);	 	} }
Running the Example	 Input Welcome to Netflix This is a great place to work Ouput Netflix			1 This			1 Welcome			1 a				1 great			1 is				1 place			1 to				2 work			1
WWW Access Log Structure
Analysis using access log Top URLs Users IP Size Time based analysis Session Duration Visits per session Identify attacks (DOS, Invalid Plug-in, etc) Study Visit patterns for  Resource Planning Preload relevant data Impact of WWW call on middle tier services
Why run Hadoop in the cloud? “Infinite” resources Hadoop scales linearly Elasticity Run a large cluster for a short time Grow or shrink a cluster on demand
Common sense and safety Amazon security is good But you are a few clicks away from making a data set public by mistake Common sense precautions Before you try it yourself… Get permission to move data to the cloud Scrub data to remove sensitive information System performance monitoring logs are a good choice for analysis

Más contenido relacionado

La actualidad más candente

DevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer ToolsDevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer ToolsAmazon Web Services
 
Managing WorkSpaces at Scale | AWS Public Sector Summit 2016
Managing WorkSpaces at Scale | AWS Public Sector Summit 2016Managing WorkSpaces at Scale | AWS Public Sector Summit 2016
Managing WorkSpaces at Scale | AWS Public Sector Summit 2016Amazon Web Services
 
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...Amazon Web Services
 
AWS 2016 re:Invent Launch Summary
AWS 2016 re:Invent Launch SummaryAWS 2016 re:Invent Launch Summary
AWS 2016 re:Invent Launch SummaryAmazon Web Services
 
Deep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECSDeep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECSAmazon Web Services
 
Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...
Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...
Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...Amazon Web Services
 
Announcing Amazon Lightsail - January 2017 AWS Online Tech Talks
Announcing Amazon Lightsail - January 2017 AWS Online Tech TalksAnnouncing Amazon Lightsail - January 2017 AWS Online Tech Talks
Announcing Amazon Lightsail - January 2017 AWS Online Tech TalksAmazon Web Services
 
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...Amazon Web Services
 
Leveraging elastic web scale computing with AWS
 Leveraging elastic web scale computing with AWS Leveraging elastic web scale computing with AWS
Leveraging elastic web scale computing with AWSShiva Narayanaswamy
 
Amazon CloudWatch Logs and AWS Lambda
Amazon CloudWatch Logs and AWS LambdaAmazon CloudWatch Logs and AWS Lambda
Amazon CloudWatch Logs and AWS LambdaAmazon Web Services
 
AWS re:Invent 2016: Save up to 90% and Run Production Workloads on Spot - Fea...
AWS re:Invent 2016: Save up to 90% and Run Production Workloads on Spot - Fea...AWS re:Invent 2016: Save up to 90% and Run Production Workloads on Spot - Fea...
AWS re:Invent 2016: Save up to 90% and Run Production Workloads on Spot - Fea...Amazon Web Services
 
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...Amazon Web Services
 
Deep Dive: Amazon Lumberyard & Amazon GameLift
Deep Dive: Amazon Lumberyard & Amazon GameLiftDeep Dive: Amazon Lumberyard & Amazon GameLift
Deep Dive: Amazon Lumberyard & Amazon GameLiftAmazon Web Services
 
AWS re:Invent 2016: Taking DevOps to the AWS Edge (CTD302)
AWS re:Invent 2016: Taking DevOps to the AWS Edge (CTD302)AWS re:Invent 2016: Taking DevOps to the AWS Edge (CTD302)
AWS re:Invent 2016: Taking DevOps to the AWS Edge (CTD302)Amazon Web Services
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesAmazon Web Services
 
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...Amazon Web Services
 
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...Amazon Web Services
 
Ciso executive summit 2012
Ciso executive summit 2012Ciso executive summit 2012
Ciso executive summit 2012Bill Burns
 
Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)Yan Cui
 
Enterprise DevOps at Scale with AWS | AWS Public Sector Summit 2016
Enterprise DevOps at Scale with AWS | AWS Public Sector Summit 2016Enterprise DevOps at Scale with AWS | AWS Public Sector Summit 2016
Enterprise DevOps at Scale with AWS | AWS Public Sector Summit 2016Amazon Web Services
 

La actualidad más candente (20)

DevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer ToolsDevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
 
Managing WorkSpaces at Scale | AWS Public Sector Summit 2016
Managing WorkSpaces at Scale | AWS Public Sector Summit 2016Managing WorkSpaces at Scale | AWS Public Sector Summit 2016
Managing WorkSpaces at Scale | AWS Public Sector Summit 2016
 
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
 
AWS 2016 re:Invent Launch Summary
AWS 2016 re:Invent Launch SummaryAWS 2016 re:Invent Launch Summary
AWS 2016 re:Invent Launch Summary
 
Deep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECSDeep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECS
 
Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...
Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...
Get the Most Out of Amazon EC2: A Deep Dive on Reserved, On-Demand, and Spot ...
 
Announcing Amazon Lightsail - January 2017 AWS Online Tech Talks
Announcing Amazon Lightsail - January 2017 AWS Online Tech TalksAnnouncing Amazon Lightsail - January 2017 AWS Online Tech Talks
Announcing Amazon Lightsail - January 2017 AWS Online Tech Talks
 
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
 
Leveraging elastic web scale computing with AWS
 Leveraging elastic web scale computing with AWS Leveraging elastic web scale computing with AWS
Leveraging elastic web scale computing with AWS
 
Amazon CloudWatch Logs and AWS Lambda
Amazon CloudWatch Logs and AWS LambdaAmazon CloudWatch Logs and AWS Lambda
Amazon CloudWatch Logs and AWS Lambda
 
AWS re:Invent 2016: Save up to 90% and Run Production Workloads on Spot - Fea...
AWS re:Invent 2016: Save up to 90% and Run Production Workloads on Spot - Fea...AWS re:Invent 2016: Save up to 90% and Run Production Workloads on Spot - Fea...
AWS re:Invent 2016: Save up to 90% and Run Production Workloads on Spot - Fea...
 
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
 
Deep Dive: Amazon Lumberyard & Amazon GameLift
Deep Dive: Amazon Lumberyard & Amazon GameLiftDeep Dive: Amazon Lumberyard & Amazon GameLift
Deep Dive: Amazon Lumberyard & Amazon GameLift
 
AWS re:Invent 2016: Taking DevOps to the AWS Edge (CTD302)
AWS re:Invent 2016: Taking DevOps to the AWS Edge (CTD302)AWS re:Invent 2016: Taking DevOps to the AWS Edge (CTD302)
AWS re:Invent 2016: Taking DevOps to the AWS Edge (CTD302)
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless Architectures
 
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
 
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
 
Ciso executive summit 2012
Ciso executive summit 2012Ciso executive summit 2012
Ciso executive summit 2012
 
Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)
 
Enterprise DevOps at Scale with AWS | AWS Public Sector Summit 2016
Enterprise DevOps at Scale with AWS | AWS Public Sector Summit 2016Enterprise DevOps at Scale with AWS | AWS Public Sector Summit 2016
Enterprise DevOps at Scale with AWS | AWS Public Sector Summit 2016
 

Destacado

What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to YouAmazon Web Services
 
Oracle Exalogic Elastic Cloud - Revolutionizing Data Center Consolidation
Oracle Exalogic Elastic Cloud - Revolutionizing Data Center ConsolidationOracle Exalogic Elastic Cloud - Revolutionizing Data Center Consolidation
Oracle Exalogic Elastic Cloud - Revolutionizing Data Center ConsolidationRex Wang
 
Pets vs. Cattle: The Elastic Cloud Story
Pets vs. Cattle: The Elastic Cloud StoryPets vs. Cattle: The Elastic Cloud Story
Pets vs. Cattle: The Elastic Cloud StoryRandy Bias
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)IMC Institute
 
NOV 8 2012: Cloud Expo SF "Elastic Cloud Infrastructure: Why the Enterprise W...
NOV 8 2012: Cloud Expo SF "Elastic Cloud Infrastructure: Why the Enterprise W...NOV 8 2012: Cloud Expo SF "Elastic Cloud Infrastructure: Why the Enterprise W...
NOV 8 2012: Cloud Expo SF "Elastic Cloud Infrastructure: Why the Enterprise W...troyangrignon
 
(EDU202) Enterprise Cloud Adoption Strategies in Higher Education | AWS re:In...
(EDU202) Enterprise Cloud Adoption Strategies in Higher Education | AWS re:In...(EDU202) Enterprise Cloud Adoption Strategies in Higher Education | AWS re:In...
(EDU202) Enterprise Cloud Adoption Strategies in Higher Education | AWS re:In...Amazon Web Services
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop MapR Technologies
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
 
AWS re:Invent 2016: How Harvard University Improves Scalable Cloud Network Se...
AWS re:Invent 2016: How Harvard University Improves Scalable Cloud Network Se...AWS re:Invent 2016: How Harvard University Improves Scalable Cloud Network Se...
AWS re:Invent 2016: How Harvard University Improves Scalable Cloud Network Se...Amazon Web Services
 
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017Amazon Web Services
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?Amazon Web Services
 
Why AWS in Education: Transforming Education in the Cloud
Why AWS in Education: Transforming Education in the CloudWhy AWS in Education: Transforming Education in the Cloud
Why AWS in Education: Transforming Education in the CloudAmazon Web Services
 
2015 Future of Cloud Computing Study
2015 Future of Cloud Computing Study2015 Future of Cloud Computing Study
2015 Future of Cloud Computing StudyNorth Bridge
 
Amazon Elastic MapReduce (EMR): Hadoop as a Service
Amazon Elastic MapReduce (EMR): Hadoop as a ServiceAmazon Elastic MapReduce (EMR): Hadoop as a Service
Amazon Elastic MapReduce (EMR): Hadoop as a ServiceVille Seppänen
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 

Destacado (20)

What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
Oracle Exalogic Elastic Cloud - Revolutionizing Data Center Consolidation
Oracle Exalogic Elastic Cloud - Revolutionizing Data Center ConsolidationOracle Exalogic Elastic Cloud - Revolutionizing Data Center Consolidation
Oracle Exalogic Elastic Cloud - Revolutionizing Data Center Consolidation
 
Pets vs. Cattle: The Elastic Cloud Story
Pets vs. Cattle: The Elastic Cloud StoryPets vs. Cattle: The Elastic Cloud Story
Pets vs. Cattle: The Elastic Cloud Story
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)
 
NOV 8 2012: Cloud Expo SF "Elastic Cloud Infrastructure: Why the Enterprise W...
NOV 8 2012: Cloud Expo SF "Elastic Cloud Infrastructure: Why the Enterprise W...NOV 8 2012: Cloud Expo SF "Elastic Cloud Infrastructure: Why the Enterprise W...
NOV 8 2012: Cloud Expo SF "Elastic Cloud Infrastructure: Why the Enterprise W...
 
(EDU202) Enterprise Cloud Adoption Strategies in Higher Education | AWS re:In...
(EDU202) Enterprise Cloud Adoption Strategies in Higher Education | AWS re:In...(EDU202) Enterprise Cloud Adoption Strategies in Higher Education | AWS re:In...
(EDU202) Enterprise Cloud Adoption Strategies in Higher Education | AWS re:In...
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
AWS re:Invent 2016: How Harvard University Improves Scalable Cloud Network Se...
AWS re:Invent 2016: How Harvard University Improves Scalable Cloud Network Se...AWS re:Invent 2016: How Harvard University Improves Scalable Cloud Network Se...
AWS re:Invent 2016: How Harvard University Improves Scalable Cloud Network Se...
 
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?
 
Keynote & Introduction
Keynote & IntroductionKeynote & Introduction
Keynote & Introduction
 
Why AWS in Education: Transforming Education in the Cloud
Why AWS in Education: Transforming Education in the CloudWhy AWS in Education: Transforming Education in the Cloud
Why AWS in Education: Transforming Education in the Cloud
 
2015 Future of Cloud Computing Study
2015 Future of Cloud Computing Study2015 Future of Cloud Computing Study
2015 Future of Cloud Computing Study
 
Amazon Elastic MapReduce (EMR): Hadoop as a Service
Amazon Elastic MapReduce (EMR): Hadoop as a ServiceAmazon Elastic MapReduce (EMR): Hadoop as a Service
Amazon Elastic MapReduce (EMR): Hadoop as a Service
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Getting Started with AWS
Getting Started with AWSGetting Started with AWS
Getting Started with AWS
 

Similar a Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop

GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SCGIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SCJim Tochterman
 
AWS APAC Webinar Series: How to Reduce Your Spend on AWS
AWS APAC Webinar Series: How to Reduce Your Spend on AWSAWS APAC Webinar Series: How to Reduce Your Spend on AWS
AWS APAC Webinar Series: How to Reduce Your Spend on AWSAmazon Web Services
 
How to Reduce your Spend on AWS
How to Reduce your Spend on AWSHow to Reduce your Spend on AWS
How to Reduce your Spend on AWSJoseph K. Ziegler
 
Amazon web services : Layman Introduction
Amazon web services : Layman IntroductionAmazon web services : Layman Introduction
Amazon web services : Layman IntroductionParashar Borkotoky
 
Cloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museumCloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museumRobert J. Stein
 
Cloud Computing Workshop
Cloud Computing WorkshopCloud Computing Workshop
Cloud Computing WorkshopCharlie Moad
 
Masterclass Webinar - Amazon Elastic Compute Cloud (EC2)
Masterclass Webinar - Amazon Elastic Compute Cloud (EC2)Masterclass Webinar - Amazon Elastic Compute Cloud (EC2)
Masterclass Webinar - Amazon Elastic Compute Cloud (EC2)Amazon Web Services
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyRobert Dempsey
 
Amazon cloud intance launch3
Amazon cloud intance launch3Amazon cloud intance launch3
Amazon cloud intance launch3Zenita Smythe
 
Amazon cloud intance launch3
Amazon cloud intance launch3Amazon cloud intance launch3
Amazon cloud intance launch3Zenita Smythe
 
Amazon cloud intance launch
Amazon cloud intance launchAmazon cloud intance launch
Amazon cloud intance launchZenita Smythe
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
EC2 Masterclass from the AWS User Group Scotland Meetup
EC2 Masterclass from the AWS User Group Scotland MeetupEC2 Masterclass from the AWS User Group Scotland Meetup
EC2 Masterclass from the AWS User Group Scotland MeetupIan Massingham
 
Exploring The Cloud
Exploring The CloudExploring The Cloud
Exploring The Cloudawesomesos
 
Elastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS PresentationElastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS PresentationKnoldus Inc.
 
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applicationsAWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applicationsAmazon Web Services
 
Introduction to EC2
Introduction to EC2Introduction to EC2
Introduction to EC2Mark Squires
 

Similar a Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop (20)

GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SCGIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
 
AWS APAC Webinar Series: How to Reduce Your Spend on AWS
AWS APAC Webinar Series: How to Reduce Your Spend on AWSAWS APAC Webinar Series: How to Reduce Your Spend on AWS
AWS APAC Webinar Series: How to Reduce Your Spend on AWS
 
How to Reduce your Spend on AWS
How to Reduce your Spend on AWSHow to Reduce your Spend on AWS
How to Reduce your Spend on AWS
 
Amazon web services : Layman Introduction
Amazon web services : Layman IntroductionAmazon web services : Layman Introduction
Amazon web services : Layman Introduction
 
Cloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museumCloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museum
 
Cloud Computing Workshop
Cloud Computing WorkshopCloud Computing Workshop
Cloud Computing Workshop
 
Optimize Cost Efficiency on AWS
Optimize Cost Efficiency on AWSOptimize Cost Efficiency on AWS
Optimize Cost Efficiency on AWS
 
Masterclass Webinar - Amazon Elastic Compute Cloud (EC2)
Masterclass Webinar - Amazon Elastic Compute Cloud (EC2)Masterclass Webinar - Amazon Elastic Compute Cloud (EC2)
Masterclass Webinar - Amazon Elastic Compute Cloud (EC2)
 
Amazon S3 and EC2
Amazon S3 and EC2Amazon S3 and EC2
Amazon S3 and EC2
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with Ruby
 
Amazon cloud intance launch3
Amazon cloud intance launch3Amazon cloud intance launch3
Amazon cloud intance launch3
 
Amazon cloud intance launch3
Amazon cloud intance launch3Amazon cloud intance launch3
Amazon cloud intance launch3
 
Amazon cloud intance launch
Amazon cloud intance launchAmazon cloud intance launch
Amazon cloud intance launch
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
EC2 Masterclass from the AWS User Group Scotland Meetup
EC2 Masterclass from the AWS User Group Scotland MeetupEC2 Masterclass from the AWS User Group Scotland Meetup
EC2 Masterclass from the AWS User Group Scotland Meetup
 
Harnessing The Cloud
Harnessing The CloudHarnessing The Cloud
Harnessing The Cloud
 
Exploring The Cloud
Exploring The CloudExploring The Cloud
Exploring The Cloud
 
Elastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS PresentationElastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS Presentation
 
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applicationsAWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
 
Introduction to EC2
Introduction to EC2Introduction to EC2
Introduction to EC2
 

Más de Adrian Cockcroft

Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Adrian Cockcroft
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Adrian Cockcroft
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowAdrian Cockcroft
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Adrian Cockcroft
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionAdrian Cockcroft
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAdrian Cockcroft
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSFAdrian Cockcroft
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformAdrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconAdrian Cockcroft
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumAdrian Cockcroft
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Adrian Cockcroft
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Adrian Cockcroft
 

Más de Adrian Cockcroft (20)

Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop

  • 1. What is Cloud Computing?
  • 2. Using the Cloud to Crunch Your Data Adrian Cockcroft – acockcroft@netflix.com
  • 3. What is Capacity Planning We care about CPU, Memory, Network and Disk resources, and Application response times We need to know how much of each resource we are using now, and will use in the future We need to know how much headroom we have to handle higher loads We want to understand how headroom varies, and how it relates to application response times and throughput
  • 4. Capacity Planning Norms Capacity is expensive Capacity takes time to buy and provision Capacity only increases, can’t be shrunk easily Capacity comes in big chunks, paid up front Planning errors can cause big problems Systems are clearly defined assets Systems can be instrumented in detail
  • 5. Capacity Planning in Clouds Capacity is expensive Capacity takes time to buy and provision Capacity only increases, can’t be shrunk easily Capacity comes in big chunks, paid up front Planning errors can cause big problems Systems are clearly defined assets Systems can be instrumented in detail
  • 6. Capacity is expensivehttp://aws.amazon.com/s3/ & http://aws.amazon.com/ec2/ Storage (Amazon S3) $0.150 per GB – first 50 TB / month of storage used $0.120 per GB – storage used / month over 500 TB Data Transfer (Amazon S3) $0.100 per GB – all data transfer in $0.170 per GB – first 10 TB / month data transfer out $0.100 per GB – data transfer out / month over 150 TB Requests (Amazon S3 Storage access is via http) $0.01 per 1,000 PUT, COPY, POST, or LIST requests $0.01 per 10,000 GET and all other requests $0 per DELETE CPU (Amazon EC2) Small (Default) $0.085 per hour, Extra Large $0.68 per hour Network (Amazon EC2) Inbound/Outbound around $0.10 per GB
  • 7. Capacity comes in big chunks, paid up front Capacity takes time to buy and provision No minimum price, monthly billing “Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously” Capacity only increases, can’t be shrunk easily Pay for what is actually used Planning errors can cause big problems Size only for what you need now
  • 8. Systems are clearly defined assets You are running in a “stateless” multi-tenanted virtual image that can die or be taken away and replaced at any time You don’t know exactly where it is You can choose to locate “USA” or “Europe” You can specify zones that will not share components to avoid common mode failures
  • 9. Systems can be instrumented in detail Need to use stateless monitoring tools Monitored nodes come and go by the hour Need to write role-name to hostname e.g. wwwprod002, not the EC2 default all monitoring by role-name Ganglia – automatic configuration Multicast replicated monitoring state No need to pre-define metrics and nodes
  • 10. Acquisition requires management buy-in Anyone with a credit card and $10 is in business Data governance issues… Remember 1980’s when PC’s first turned up? Departmental budgets funded PC’s No standards, management or backups Central IT departments could lose control Decentralized use of clouds will be driven by teams seeking competitive advantage and business agility – ultimately unstoppable…
  • 11. December 9, 2009 OK, so what should we do?
  • 12. The Umbrella Strategy Monitor network traffic to cloud vendor API’s Catch unauthorized clouds as they form Measure, trend and predict cloud activity Pick two cloud standards, setup corporate accounts Better sharing of lessons as they are learned Create path of least resistance for users Avoid vendor lock-in Aggregate traffic to get bulk discounts Pressure the vendors to develop common standards, but don’t wait for them to arrive. The best APIs will be cloned eventually Sponsor a pathfinder project for each vendor Navigate a route through the clouds Don’t get caught unawares in the rain
  • 13. Predicting the Weather Billing is variable, monthly in arrears Try to predict how much you will be billed… Not an issue to start with, but can’t be ignored Amazon has a cost estimator tool that may help Central analysis and collection of cloud metrics Cloud vendor usage metrics via API Your system and application metrics via Ganglia Intra-cloud bandwidth is free, analyze in the cloud! Based on standard analysis framework “Hadoop” Validation of vendor metrics and billing Charge-back for individual user applications
  • 14. Use it to learn it… Focus on how you can use the cloud yourself to do large scale log processing You can upload huge datasets to a cloud and crunch them with a large cluster of computers using Amazon Elastic Map Reduce (EMR) Do it all from your web browser for a handful of dollars charged to a credit card. Here’s how
  • 15. Cloud Crunch The Recipe You will need: A computer connected to the Internet The Firefox browser A Firefox specific browser extension A credit card and less than $1 to spend Big log files as ingredients Some very processor intensive queries
  • 16. Recipe First we will warm up our computer by setting up Firefox and connecting to the cloud. Then we will upload our ingredients to be crunched, at about 20 minutes per gigabyte You should pick between one and twenty processors to crunch with, they are charged by the hour and the cloud takes about 10 minutes to warm up. The query itself starts by mapping the ingredients so that the mixture separates, then the excess is boiled off to make a nice informative reduction.
  • 17. Costs Firefox and Extension – free Upload ingredients – 10 cents/Gigabyte Small Processors – 11.5 cents/hour each Download results – 17 cents/Gigabyte Storage – 15 cents/Gigabyte/month Service updates – 1 cent/1000 calls Service requests – 1 cent/10,000 calls Actual cost to run two example programs as described in this presentation was 26 cents
  • 18. Faster Results at Same Cost! You may have trouble finding enough data and a complex enough query to keep the processors busy for an hour. In that case you can use fewer processors Conversely if you want quicker results you can use more and/or larger processors. Up to 20 systems with 8 CPU cores = 160 cores is immediately available. Oversize on request.
  • 19. Step by Step Walkthrough to get you started Run Amazon Elastic MapReduceexamples Get up and running before this presentation is over….
  • 20. Step 1 – Get Firefox http://www.mozilla.com/en-US/firefox/firefox.html
  • 21. Step 2 – Get S3Fox Extension Next select the Add-ons option from the Tools menu, select “Get Add-ons” and search for S3Fox.
  • 22. Step 3 – Learn AboutAWS Bring up http://aws.amazon.com/ to read about the services. Amazon S3 is short for Amazon Simple Storage Service, which is part of the Amazon Web Services product We will be using Amazon S3 to store data, and Amazon Elastic Map Reduce (EMR) to process it. Underneath EMR there is an Amazon Elastic Compute Cloud (EC2), which is created automatically for you each time you use EMR.
  • 23. What is S3, EC2, EMR? Amazon Simple Storage Service lets you put data into the cloud that is addressed using a URL. Access to it can be private or public. Amazon Elastic Compute Cloud lets you pick the size and number of computers in your cloud. Amazon Elastic Map Reduce automatically builds a Hadoop cluster on top of EC2, feeds it data from S3, saves the results in S3, then removes the cluster and frees up the EC2 systems.
  • 24. Step 4 – Sign Up For AWS Go to the top-right of the page and sign up at http://aws.amazon.com/ You can login using the same account you use to buy books!
  • 25. Step 5 – Sign Up For Amazon S3 Follow the link to http://aws.amazon.com/s3/
  • 26. Check The S3 Rate Card
  • 27. Step 6 – Sign Up For Amazon EMR Go to http://aws.amazon.com/elasticmapreduce/
  • 28. EC2 Signup The EMR signup combines the other needed services such as EC2 in the sign up process and the rates for all are displayed. The EMR costs are in addition to the EC2 costs, so 1.5 cents/hour for EMR is added to the 10 cents/hour for EC2 making 11.5 cents/hour for each small instance running Linux with Hadoop.
  • 29. EMR And EC2 Rates(old data – it is cheaper now with even bigger nodes)
  • 30. Step 7 - How Big are EC2 Instances? See http://aws.amazon.com/ec2/instance-types/ The compute power is specified in a standard unit called an EC2 Compute Unit (ECU). Amazon states “One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.”
  • 31. Standard Instances Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform. Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform. Extra Large Instance 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform.
  • 32. Compute Intensive Instances High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of instance storage, 32-bit platform. High-CPU Extra Large Instance 7 GB of memory, 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform.
  • 33. Step 8 – Getting Started http://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/ There is a Getting Started guide and Developer Documentation including sample applications We will be working through two of those applications. There is also a very helpful FAQ page that is worth reading through.
  • 34. Step 9 – What is EMR?
  • 35. Step 10 – How ToMapReduce? Code directly in Java Submit “streaming” command line scripts EMR bundles Perl, Python, Ruby and R Code MR sequences in Java using Cascading Process log files using Cascading Multitool Write dataflow scripts with Pig Write SQL queries using Hive
  • 36. Step 11 – Setup Access Keys Getting Started Guide http://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/gsFirstSteps.html The URL to visit to get your key is: http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key
  • 37. Enter Access Keys in S3Fox Open S3 Firefox Organizer and select the Manage Accounts button at the top left Enter your access keys into the popup.
  • 38. Step 12 – Create An Output Folder Click Here
  • 40. Step 13 – Run Your First Job See the Getting Started Guide at: http://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/gsConsoleRunJob.html login to the EMR console at https://console.aws.amazon.com/elasticmapreduce/home create a new job flow called “Word crunch”, and select the sample application word count.
  • 41. Set the Output S3 Bucket
  • 42. Pick the Instance Count One small instance (i.e. computer) will do… Cost will be 11.5 cents for 1 hour (minimum)
  • 43. Start the Job Flow The Job Flow takes a few minutes to get started, then completes in about 5 minutes run time
  • 44. Step 14 – View The Results In S3Fox click on the refresh icon, then doubleclick on the crunchie folder Keep clicking until the output file is visible
  • 45. Save To Your PC S3Fox – click on left arrow Save to PC Open in TextEdit See that the word “a” occurred 14716 times boring…. so try a more interesting demo! Click Here
  • 46. Step 15 – Crunch Some Log Files Create a new output bucket with a new name Start a new job flow using the CloudFront Demo This uses the Cascading Multitool
  • 47. Step 16 – How Much Did That Cost?
  • 48. Wrap Up Easy, cheap, anyone can use it Now let’s look at how to write code…
  • 49. Log & Data Analysis using Hadoop Based on slides by ShashiMadappa smadappa@netflix.com
  • 50. Agenda 1. Hadoop - Background - Ecosystem - HDFS & Map/Reduce - Example 2. Log & Data Analysis @ Netflix - Problem / Data Volume - Current & Future Projects 3. Hadoop on Amazon EC2 - Storage options - Deployment options
  • 51. Hadoop Apache open-source software for reliable, scalable, distributed computing. Originally a sub-project of the Lucene search engine. Used to analyze and transform large data sets. Supports partial failure, Recoverability, Consistency & Scalability
  • 52. Hadoop Sub-Projects Core HDFS: A distributed file system that provides high throughput access to data. Map/Reduce : A framework for processing large data sets. HBase : A distributed database that supports structured data storage for large tables Hive : An infrastructure for ad hoc querying (sql like) Pig : A data-flow language and execution framework Avro : A data serialization system that provides dynamic integration with scripting languages. Similar to Thrift & Protocol Buffers Cascading : Executing data processing workflow Chukwa, Mahout, ZooKeeper, and many more
  • 53.
  • 54.
  • 55.
  • 58. Input & Output Formats The application also chooses input and output formats, which define how the persistent data is read and written. These are interfaces and can be defined by the application. InputFormat Splits the input to determine the input to each map task. Defines a RecordReader that reads key, value pairs that are passed to the map task OutputFormat Given the key, value pairs and a filename, writes the reduce task output to persistent store.
  • 59. Map/Reduce Processes Launching Application User application code Submits a specific kind of Map/Reduce job JobTracker Handles all jobs Makes all scheduling decisions TaskTracker Manager for all tasks on a given node Task Runs an individual map or reduce fragment for a given job Forks from the TaskTracker
  • 60. Word Count Example Mapper Input: value: lines of text of input Output: key: word, value: 1 Reducer Input: key: word, value: set of counts Output: key: word, value: sum Launching program Defines the job Submits job to cluster
  • 61. Map Phase public static class WordCountMapperextends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizeritr = new StringTokenizer(line, “,”); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
  • 62. Reduce Phase public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 63. Running the Job public class WordCount { public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(WordCountMapper.class); conf.setReducerClass(WordCountReducer.class); conf.setInputPath(new Path(args[0]); conf.setOutputPath(new Path(args[1]); JobClient.runJob(conf); } }
  • 64. Running the Example Input Welcome to Netflix This is a great place to work Ouput Netflix 1 This 1 Welcome 1 a 1 great 1 is 1 place 1 to 2 work 1
  • 65. WWW Access Log Structure
  • 66. Analysis using access log Top URLs Users IP Size Time based analysis Session Duration Visits per session Identify attacks (DOS, Invalid Plug-in, etc) Study Visit patterns for Resource Planning Preload relevant data Impact of WWW call on middle tier services
  • 67. Why run Hadoop in the cloud? “Infinite” resources Hadoop scales linearly Elasticity Run a large cluster for a short time Grow or shrink a cluster on demand
  • 68. Common sense and safety Amazon security is good But you are a few clicks away from making a data set public by mistake Common sense precautions Before you try it yourself… Get permission to move data to the cloud Scrub data to remove sensitive information System performance monitoring logs are a good choice for analysis