SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
SURVIVING HADOOP ON
 AWS IN PRODUCTION
DISCLAIMER:
I AM A BAD PERSON.
ABOUT ME
Chief Data Scientist at Yieldbot, Co-Founder at
                 StockTwits.
                  @sorenmacbeth
YIELDBOT
“Yieldbot's technology creates a marketplace where search
  advertisers buy real-time consumer intent on premium
                        publishers.”
WHERE WE ARE TODAY
MapR M3 on EMR
All data read from and written to S3
CLOJURE FOR DATA PROCESSING
  All of our MapReduce jobs are written in Cascalog .
     This gives us speed, flexability and testability.
More importantly, Clojure and Cascalog are fun to write.
CASCALOG EXAMPLE
(ns lucene-cascalog.core
  (:gen-class)
  (:use cascalog.api)
  (:import
   org.apache.lucene.analysis.standard.StandardAnalyzer
   org.apache.lucene.analysis.TokenStream
   org.apache.lucene.util.Version
   org.apache.lucene.analysis.tokenattributes.TermAttribute))

(defn tokenizer-seq
   "Build a lazy-seq out of a tokenizer with TermAttribute"
   [^TokenStream tokenizer ^TermAttribute term-att]
   (lazy-seq
     (when (.incrementToken tokenizer)
       (cons (.term term-att) (tokenizer-seq tokenizer term-att)))
))
HADOOP IS COMPLEX
“Fact: There are more Hadoop configuration options than
               there are stars our galaxy.”
EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT OF
 TUNING TO GET A HADOOP CLUSTER RUNNING WELL.
There are large companies that make money soley by
  configuring and supporting hadoop clusters for
               enterprise customers.
RUNNING HADOOP ON AWS
SO WHY RUN ON AWS?
       $$$
HADOOP ON AWS:
 AN PERSONAL HISTORY
PIG AND ELASTICMAPREDUCE
Slow development cycle; writing Java sucks.
CASCALOG AND ELASTICMAPREDUCE
Learning Emacs, Clojure, and Cascalog was hard, but
was worth it.
The way our jobs were designed sucked and didn't
work well with ElasticMapReduce
CASCALOG AND SELF-MANAGED HADOOP
             CLUSTER
We used a hacked up version of a cloudera python
script to launch and bootstrap a cluster.
We ran on spot instances
Cluster boot up time SUCKED and often failed. We
paid for instances during bootstrap and configuration
Our jobs weren't designed to tolerate things like spot
instances going away in the middle of a job.
Drinking heavily dulled the pain a little.
CASCALOG AND ELASTICMAPREDUCE
             AGAIN
Rebuilt data processing pipeline from scratch (only
took nine months!)
Data pipelines were broken out into a handful of fault-
tolerant jobflow steps; each steps writes output to S3.
EMR supported spot instances at this point.
WEIRD BUGS THAT WE'VE HIT
Bootstrap script errors
Random cluster fuckedupedness
AMI version changes
Vendor issues
My personal favourite: Invisible S3 write failures.
IF YOU MUST RUN ON AWS
Break your processing pipelines into stages; write out
to S3 after each stage.
Bake in (a lot) of variability into your expected jobflow
run times.
Compress the data your are reading and writing from
S3 as much as possible.
Drinking helps.
QUESTIONS?
YIELDBOT IS HIRING!
    http://yieldbot.com/jobs

Más contenido relacionado

La actualidad más candente

Advanced Discussion on Cloud Formation
Advanced Discussion on Cloud FormationAdvanced Discussion on Cloud Formation
Advanced Discussion on Cloud FormationHenry Huang
 
Artem Zhurbila 5 aws - cloud formation and beanstalk
Artem Zhurbila 5 aws - cloud formation and beanstalkArtem Zhurbila 5 aws - cloud formation and beanstalk
Artem Zhurbila 5 aws - cloud formation and beanstalkArtem Zhurbila
 
Intro to batch processing on AWS
Intro to batch processing on AWSIntro to batch processing on AWS
Intro to batch processing on AWSAmazon Web Services
 
AWS OpsWorks & Chef at the Hamburg Chef User Group 2014
AWS OpsWorks & Chef at the Hamburg Chef User Group 2014AWS OpsWorks & Chef at the Hamburg Chef User Group 2014
AWS OpsWorks & Chef at the Hamburg Chef User Group 2014Jonathan Weiss
 
ACUG 12 Clouds - Cloud Formation
ACUG 12 Clouds - Cloud FormationACUG 12 Clouds - Cloud Formation
ACUG 12 Clouds - Cloud Formationjoehack3r
 
Designing for elasticity on AWS - 9.11.2015
Designing for elasticity on AWS - 9.11.2015Designing for elasticity on AWS - 9.11.2015
Designing for elasticity on AWS - 9.11.2015Anton Babenko
 
Batchly - Automated AWS Cost Reduction
Batchly - Automated AWS Cost ReductionBatchly - Automated AWS Cost Reduction
Batchly - Automated AWS Cost ReductionCMPUTE
 
AWS Dublin Briefing - Cool AWS Use Cases
AWS Dublin Briefing - Cool AWS Use CasesAWS Dublin Briefing - Cool AWS Use Cases
AWS Dublin Briefing - Cool AWS Use CasesIan Massingham
 
Azure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartAzure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartDavide Mauri
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...
[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...
[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...AWSKRUG - AWS한국사용자모임
 
CTO Night & Days 2015 Winter - AWS Mobile Development
CTO Night & Days 2015 Winter - AWS Mobile DevelopmentCTO Night & Days 2015 Winter - AWS Mobile Development
CTO Night & Days 2015 Winter - AWS Mobile Development崇之 清水
 
Infrastructure as code, using Terraform
Infrastructure as code, using TerraformInfrastructure as code, using Terraform
Infrastructure as code, using TerraformHarkamal Singh
 
Mvp skill saturday ep09 _06072019_azure updates - july 2019
Mvp skill saturday ep09 _06072019_azure updates - july 2019Mvp skill saturday ep09 _06072019_azure updates - july 2019
Mvp skill saturday ep09 _06072019_azure updates - july 2019Kumton Suttiraksiri
 
Introduction to aws cloud formation
Introduction to aws cloud formationIntroduction to aws cloud formation
Introduction to aws cloud formationAniruddha jawanjal
 
DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...
DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...
DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...DevOpsDays Riga
 
Running BSD on AWS
Running BSD on AWSRunning BSD on AWS
Running BSD on AWSJulien SIMON
 

La actualidad más candente (19)

Advanced Discussion on Cloud Formation
Advanced Discussion on Cloud FormationAdvanced Discussion on Cloud Formation
Advanced Discussion on Cloud Formation
 
Artem Zhurbila 5 aws - cloud formation and beanstalk
Artem Zhurbila 5 aws - cloud formation and beanstalkArtem Zhurbila 5 aws - cloud formation and beanstalk
Artem Zhurbila 5 aws - cloud formation and beanstalk
 
Intro to batch processing on AWS
Intro to batch processing on AWSIntro to batch processing on AWS
Intro to batch processing on AWS
 
AWS OpsWorks & Chef at the Hamburg Chef User Group 2014
AWS OpsWorks & Chef at the Hamburg Chef User Group 2014AWS OpsWorks & Chef at the Hamburg Chef User Group 2014
AWS OpsWorks & Chef at the Hamburg Chef User Group 2014
 
ACUG 12 Clouds - Cloud Formation
ACUG 12 Clouds - Cloud FormationACUG 12 Clouds - Cloud Formation
ACUG 12 Clouds - Cloud Formation
 
Designing for elasticity on AWS - 9.11.2015
Designing for elasticity on AWS - 9.11.2015Designing for elasticity on AWS - 9.11.2015
Designing for elasticity on AWS - 9.11.2015
 
Batchly - Automated AWS Cost Reduction
Batchly - Automated AWS Cost ReductionBatchly - Automated AWS Cost Reduction
Batchly - Automated AWS Cost Reduction
 
AWS Dublin Briefing - Cool AWS Use Cases
AWS Dublin Briefing - Cool AWS Use CasesAWS Dublin Briefing - Cool AWS Use Cases
AWS Dublin Briefing - Cool AWS Use Cases
 
Azure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartAzure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstart
 
Autoscaling on Kubernetes
Autoscaling on KubernetesAutoscaling on Kubernetes
Autoscaling on Kubernetes
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...
[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...
[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...
 
CTO Night & Days 2015 Winter - AWS Mobile Development
CTO Night & Days 2015 Winter - AWS Mobile DevelopmentCTO Night & Days 2015 Winter - AWS Mobile Development
CTO Night & Days 2015 Winter - AWS Mobile Development
 
Infrastructure as code, using Terraform
Infrastructure as code, using TerraformInfrastructure as code, using Terraform
Infrastructure as code, using Terraform
 
Scala sydoct2011
Scala sydoct2011Scala sydoct2011
Scala sydoct2011
 
Mvp skill saturday ep09 _06072019_azure updates - july 2019
Mvp skill saturday ep09 _06072019_azure updates - july 2019Mvp skill saturday ep09 _06072019_azure updates - july 2019
Mvp skill saturday ep09 _06072019_azure updates - july 2019
 
Introduction to aws cloud formation
Introduction to aws cloud formationIntroduction to aws cloud formation
Introduction to aws cloud formation
 
DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...
DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...
DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...
 
Running BSD on AWS
Running BSD on AWSRunning BSD on AWS
Running BSD on AWS
 

Similar a Surviving Hadoop on AWS

AWS Startup Webinar | Developing on AWS
AWS Startup Webinar | Developing on AWSAWS Startup Webinar | Developing on AWS
AWS Startup Webinar | Developing on AWSAmazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWSAmazon Web Services Korea
 
Continuous Deployment with Amazon Web Services by Carlos Conde
Continuous Deployment with Amazon Web Services by Carlos Conde Continuous Deployment with Amazon Web Services by Carlos Conde
Continuous Deployment with Amazon Web Services by Carlos Conde Codemotion
 
Cloud computing - an insight into "how does it really work ?"
Cloud computing - an insight into "how does it really work ?" Cloud computing - an insight into "how does it really work ?"
Cloud computing - an insight into "how does it really work ?" Tikal Knowledge
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteDeepak Singh
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessLuciano Mammino
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Sid Anand
 
Serverless Data Lake on AWS
Serverless Data Lake on AWSServerless Data Lake on AWS
Serverless Data Lake on AWSThanh Nguyen
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyRobert Dempsey
 
AWS Cloud Kata 2014 | Jakarta - Startup Best Practices
AWS Cloud Kata 2014 | Jakarta - Startup Best PracticesAWS Cloud Kata 2014 | Jakarta - Startup Best Practices
AWS Cloud Kata 2014 | Jakarta - Startup Best PracticesAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 

Similar a Surviving Hadoop on AWS (20)

AWS Startup Webinar | Developing on AWS
AWS Startup Webinar | Developing on AWSAWS Startup Webinar | Developing on AWS
AWS Startup Webinar | Developing on AWS
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
 
Continuous Deployment with Amazon Web Services by Carlos Conde
Continuous Deployment with Amazon Web Services by Carlos Conde Continuous Deployment with Amazon Web Services by Carlos Conde
Continuous Deployment with Amazon Web Services by Carlos Conde
 
Startup Best Practices on AWS
Startup Best Practices on AWSStartup Best Practices on AWS
Startup Best Practices on AWS
 
Cloud computing - an insight into "how does it really work ?"
Cloud computing - an insight into "how does it really work ?" Cloud computing - an insight into "how does it really work ?"
Cloud computing - an insight into "how does it really work ?"
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
 
Shipping logs to splunk from a container in aws howto
Shipping logs to splunk from a container in aws howtoShipping logs to splunk from a container in aws howto
Shipping logs to splunk from a container in aws howto
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti Serverless
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Serverless Data Lake on AWS
Serverless Data Lake on AWSServerless Data Lake on AWS
Serverless Data Lake on AWS
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with Ruby
 
Cloud Talk
Cloud TalkCloud Talk
Cloud Talk
 
AWS Cloud Kata 2014 | Jakarta - Startup Best Practices
AWS Cloud Kata 2014 | Jakarta - Startup Best PracticesAWS Cloud Kata 2014 | Jakarta - Startup Best Practices
AWS Cloud Kata 2014 | Jakarta - Startup Best Practices
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 

Surviving Hadoop on AWS

  • 1. SURVIVING HADOOP ON AWS IN PRODUCTION
  • 2. DISCLAIMER: I AM A BAD PERSON.
  • 3. ABOUT ME Chief Data Scientist at Yieldbot, Co-Founder at StockTwits. @sorenmacbeth
  • 4. YIELDBOT “Yieldbot's technology creates a marketplace where search advertisers buy real-time consumer intent on premium publishers.”
  • 5. WHERE WE ARE TODAY MapR M3 on EMR All data read from and written to S3
  • 6. CLOJURE FOR DATA PROCESSING All of our MapReduce jobs are written in Cascalog . This gives us speed, flexability and testability. More importantly, Clojure and Cascalog are fun to write.
  • 7. CASCALOG EXAMPLE (ns lucene-cascalog.core (:gen-class) (:use cascalog.api) (:import org.apache.lucene.analysis.standard.StandardAnalyzer org.apache.lucene.analysis.TokenStream org.apache.lucene.util.Version org.apache.lucene.analysis.tokenattributes.TermAttribute)) (defn tokenizer-seq "Build a lazy-seq out of a tokenizer with TermAttribute" [^TokenStream tokenizer ^TermAttribute term-att] (lazy-seq (when (.incrementToken tokenizer) (cons (.term term-att) (tokenizer-seq tokenizer term-att))) ))
  • 9. “Fact: There are more Hadoop configuration options than there are stars our galaxy.”
  • 10. EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT OF TUNING TO GET A HADOOP CLUSTER RUNNING WELL. There are large companies that make money soley by configuring and supporting hadoop clusters for enterprise customers.
  • 12. SO WHY RUN ON AWS? $$$
  • 13. HADOOP ON AWS: AN PERSONAL HISTORY
  • 14. PIG AND ELASTICMAPREDUCE Slow development cycle; writing Java sucks.
  • 15. CASCALOG AND ELASTICMAPREDUCE Learning Emacs, Clojure, and Cascalog was hard, but was worth it. The way our jobs were designed sucked and didn't work well with ElasticMapReduce
  • 16. CASCALOG AND SELF-MANAGED HADOOP CLUSTER We used a hacked up version of a cloudera python script to launch and bootstrap a cluster. We ran on spot instances Cluster boot up time SUCKED and often failed. We paid for instances during bootstrap and configuration Our jobs weren't designed to tolerate things like spot instances going away in the middle of a job. Drinking heavily dulled the pain a little.
  • 17. CASCALOG AND ELASTICMAPREDUCE AGAIN Rebuilt data processing pipeline from scratch (only took nine months!) Data pipelines were broken out into a handful of fault- tolerant jobflow steps; each steps writes output to S3. EMR supported spot instances at this point.
  • 18. WEIRD BUGS THAT WE'VE HIT Bootstrap script errors Random cluster fuckedupedness AMI version changes Vendor issues My personal favourite: Invisible S3 write failures.
  • 19. IF YOU MUST RUN ON AWS Break your processing pipelines into stages; write out to S3 after each stage. Bake in (a lot) of variability into your expected jobflow run times. Compress the data your are reading and writing from S3 as much as possible. Drinking helps.
  • 21. YIELDBOT IS HIRING! http://yieldbot.com/jobs