Surviving Hadoop on AWS

•

5 recomendaciones•747 vistas

This document discusses the challenges of running Hadoop on AWS and describes the author's personal experience with different approaches. It notes that while AWS can save money, running Hadoop is complex and requires significant tuning. The author details trying Pig/EMR, then Cascalog/EMR, and managing their own Hadoop cluster before ultimately finding the most success with Cascalog/EMR after redesigning their data pipelines to be fault tolerant. Key lessons are to break processing into stages, write to S3 after each, plan for variability, compress S3 data, and "drinking helps."

ABOUT ME
Chief Data Scientist at Yieldbot, Co-Founder at
StockTwits.
@sorenmacbeth

YIELDBOT
“Yieldbot's technology creates a marketplace where search
advertisers buy real-time consumer intent on premium
publishers.”

WHERE WE ARE TODAY
MapR M3 on EMR
All data read from and written to S3

CLOJURE FOR DATA PROCESSING
All of our MapReduce jobs are written in Cascalog .
This gives us speed, flexability and testability.
More importantly, Clojure and Cascalog are fun to write.

CASCALOG EXAMPLE
(ns lucene-cascalog.core
(:gen-class)
(:use cascalog.api)
(:import
org.apache.lucene.analysis.standard.StandardAnalyzer
org.apache.lucene.analysis.TokenStream
org.apache.lucene.util.Version
org.apache.lucene.analysis.tokenattributes.TermAttribute))

(defn tokenizer-seq
"Build a lazy-seq out of a tokenizer with TermAttribute"
[^TokenStream tokenizer ^TermAttribute term-att]
(lazy-seq
(when (.incrementToken tokenizer)
(cons (.term term-att) (tokenizer-seq tokenizer term-att)))
))

“Fact: There are more Hadoop configuration options than
there are stars our galaxy.”

EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT OF
TUNING TO GET A HADOOP CLUSTER RUNNING WELL.
There are large companies that make money soley by
configuring and supporting hadoop clusters for
enterprise customers.

PIG AND ELASTICMAPREDUCE
Slow development cycle; writing Java sucks.

CASCALOG AND ELASTICMAPREDUCE
Learning Emacs, Clojure, and Cascalog was hard, but
was worth it.
The way our jobs were designed sucked and didn't
work well with ElasticMapReduce

CASCALOG AND SELF-MANAGED HADOOP
CLUSTER
We used a hacked up version of a cloudera python
script to launch and bootstrap a cluster.
We ran on spot instances
Cluster boot up time SUCKED and often failed. We
paid for instances during bootstrap and configuration
Our jobs weren't designed to tolerate things like spot
instances going away in the middle of a job.
Drinking heavily dulled the pain a little.

CASCALOG AND ELASTICMAPREDUCE
AGAIN
Rebuilt data processing pipeline from scratch (only
took nine months!)
Data pipelines were broken out into a handful of fault-
tolerant jobflow steps; each steps writes output to S3.
EMR supported spot instances at this point.

WEIRD BUGS THAT WE'VE HIT
Bootstrap script errors
Random cluster fuckedupedness
AMI version changes
Vendor issues
My personal favourite: Invisible S3 write failures.

IF YOU MUST RUN ON AWS
Break your processing pipelines into stages; write out
to S3 after each stage.
Bake in (a lot) of variability into your expected jobflow
run times.
Compress the data your are reading and writing from
S3 as much as possible.
Drinking helps.

YIELDBOT IS HIRING!
http://yieldbot.com/jobs

Más contenido relacionado

La actualidad más candente

Advanced Discussion on Cloud FormationHenry Huang

Artem Zhurbila 5 aws - cloud formation and beanstalkArtem Zhurbila

Intro to batch processing on AWSAmazon Web Services

AWS OpsWorks & Chef at the Hamburg Chef User Group 2014Jonathan Weiss

ACUG 12 Clouds - Cloud Formationjoehack3r

Designing for elasticity on AWS - 9.11.2015Anton Babenko

Batchly - Automated AWS Cost ReductionCMPUTE

AWS Dublin Briefing - Cool AWS Use CasesIan Massingham

Azure serverless Full-Stack kickstartDavide Mauri

Autoscaling on KubernetesJames Sturtevant

AWS EMR (Elastic Map Reduce) explainedHarsha KM

[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...AWSKRUG - AWS한국사용자모임

CTO Night & Days 2015 Winter - AWS Mobile Development崇之清水

Infrastructure as code, using TerraformHarkamal Singh

Scala sydoct2011Michael Neale

Mvp skill saturday ep09 _06072019_azure updates - july 2019Kumton Suttiraksiri

Introduction to aws cloud formationAniruddha jawanjal

DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...DevOpsDays Riga

Running BSD on AWSJulien SIMON

La actualidad más candente (19)

Advanced Discussion on Cloud Formation

Artem Zhurbila 5 aws - cloud formation and beanstalk

Intro to batch processing on AWS

AWS OpsWorks & Chef at the Hamburg Chef User Group 2014

ACUG 12 Clouds - Cloud Formation

Designing for elasticity on AWS - 9.11.2015

Batchly - Automated AWS Cost Reduction

AWS Dublin Briefing - Cool AWS Use Cases

Azure serverless Full-Stack kickstart

Autoscaling on Kubernetes

AWS EMR (Elastic Map Reduce) explained

[AWS Hero 스페셜] 서버리스 기반 검색 서비스 구축하기 - 이상현(스마일벤처스) :: AWS Community Day Online ...

CTO Night & Days 2015 Winter - AWS Mobile Development

Infrastructure as code, using Terraform

Scala sydoct2011

Mvp skill saturday ep09 _06072019_azure updates - july 2019

Introduction to aws cloud formation

DevOpsDaysRiga 2018: Anton Babenko - What you see is what you get… for AWS in...

Running BSD on AWS

Similar a Surviving Hadoop on AWS

AWS Startup Webinar | Developing on AWSAmazon Web Services

(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services

(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services

Running Presto and Spark on the Netflix Big Data PlatformEva Tse

20141021 AWS Cloud Taekwon - Startup Best Practices on AWSAmazon Web Services Korea

Continuous Deployment with Amazon Web Services by Carlos Conde Codemotion

Startup Best Practices on AWSAmazon Web Services

Cloud computing - an insight into "how does it really work ?" Tikal Knowledge

Systems Bioinformatics Workshop KeynoteDeepak Singh

Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services

Shipping logs to splunk from a container in aws howtoЕкатерина Задорожная

Semplificare l'observability per progetti ServerlessLuciano Mammino

Cloud Native Data Pipelines (DataEngConf SF 2017)Sid Anand

Serverless Data Lake on AWSThanh Nguyen

AWS glue technical enablement trainingInfo Alchemy Corporation

Thing you didn't know you could do in SparkSnappyData

The Future is Now: Leveraging the Cloud with RubyRobert Dempsey

Cloud TalkJohn Willis

AWS Cloud Kata 2014 | Jakarta - Startup Best PracticesAmazon Web Services

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services

Similar a Surviving Hadoop on AWS (20)

AWS Startup Webinar | Developing on AWS

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

Running Presto and Spark on the Netflix Big Data Platform

20141021 AWS Cloud Taekwon - Startup Best Practices on AWS

Continuous Deployment with Amazon Web Services by Carlos Conde

Startup Best Practices on AWS

Cloud computing - an insight into "how does it really work ?"

Systems Bioinformatics Workshop Keynote

Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...

Shipping logs to splunk from a container in aws howto

Semplificare l'observability per progetti Serverless

Cloud Native Data Pipelines (DataEngConf SF 2017)

Serverless Data Lake on AWS

AWS glue technical enablement training

Thing you didn't know you could do in Spark

The Future is Now: Leveraging the Cloud with Ruby

Cloud Talk

AWS Cloud Kata 2014 | Jakarta - Startup Best Practices

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR

Surviving Hadoop on AWS

1. SURVIVING HADOOP ON AWS IN PRODUCTION

2. DISCLAIMER: I AM A BAD PERSON.

3. ABOUT ME Chief Data Scientist at Yieldbot, Co-Founder at StockTwits. @sorenmacbeth

4. YIELDBOT “Yieldbot's technology creates a marketplace where search advertisers buy real-time consumer intent on premium publishers.”

5. WHERE WE ARE TODAY MapR M3 on EMR All data read from and written to S3

6. CLOJURE FOR DATA PROCESSING All of our MapReduce jobs are written in Cascalog . This gives us speed, flexability and testability. More importantly, Clojure and Cascalog are fun to write.

7. CASCALOG EXAMPLE (ns lucene-cascalog.core (:gen-class) (:use cascalog.api) (:import org.apache.lucene.analysis.standard.StandardAnalyzer org.apache.lucene.analysis.TokenStream org.apache.lucene.util.Version org.apache.lucene.analysis.tokenattributes.TermAttribute)) (defn tokenizer-seq "Build a lazy-seq out of a tokenizer with TermAttribute" [^TokenStream tokenizer ^TermAttribute term-att] (lazy-seq (when (.incrementToken tokenizer) (cons (.term term-att) (tokenizer-seq tokenizer term-att))) ))

8. HADOOP IS COMPLEX

9. “Fact: There are more Hadoop configuration options than there are stars our galaxy.”

10. EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT OF TUNING TO GET A HADOOP CLUSTER RUNNING WELL. There are large companies that make money soley by configuring and supporting hadoop clusters for enterprise customers.

11. RUNNING HADOOP ON AWS

12. SO WHY RUN ON AWS? $$$

13. HADOOP ON AWS: AN PERSONAL HISTORY

14. PIG AND ELASTICMAPREDUCE Slow development cycle; writing Java sucks.

15. CASCALOG AND ELASTICMAPREDUCE Learning Emacs, Clojure, and Cascalog was hard, but was worth it. The way our jobs were designed sucked and didn't work well with ElasticMapReduce

16. CASCALOG AND SELF-MANAGED HADOOP CLUSTER We used a hacked up version of a cloudera python script to launch and bootstrap a cluster. We ran on spot instances Cluster boot up time SUCKED and often failed. We paid for instances during bootstrap and configuration Our jobs weren't designed to tolerate things like spot instances going away in the middle of a job. Drinking heavily dulled the pain a little.

17. CASCALOG AND ELASTICMAPREDUCE AGAIN Rebuilt data processing pipeline from scratch (only took nine months!) Data pipelines were broken out into a handful of fault- tolerant jobflow steps; each steps writes output to S3. EMR supported spot instances at this point.

18. WEIRD BUGS THAT WE'VE HIT Bootstrap script errors Random cluster fuckedupedness AMI version changes Vendor issues My personal favourite: Invisible S3 write failures.

19. IF YOU MUST RUN ON AWS Break your processing pipelines into stages; write out to S3 after each stage. Bake in (a lot) of variability into your expected jobflow run times. Compress the data your are reading and writing from S3 as much as possible. Drinking helps.

20. QUESTIONS?

21. YIELDBOT IS HIRING! http://yieldbot.com/jobs

Surviving Hadoop on AWS

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a Surviving Hadoop on AWS

Similar a Surviving Hadoop on AWS (20)

Surviving Hadoop on AWS