This document discusses the challenges of running Hadoop on AWS and describes the author's personal experience with different approaches. It notes that while AWS can save money, running Hadoop is complex and requires significant tuning. The author details trying Pig/EMR, then Cascalog/EMR, and managing their own Hadoop cluster before ultimately finding the most success with Cascalog/EMR after redesigning their data pipelines to be fault tolerant. Key lessons are to break processing into stages, write to S3 after each, plan for variability, compress S3 data, and "drinking helps."
5. WHERE WE ARE TODAY
MapR M3 on EMR
All data read from and written to S3
6. CLOJURE FOR DATA PROCESSING
All of our MapReduce jobs are written in Cascalog .
This gives us speed, flexability and testability.
More importantly, Clojure and Cascalog are fun to write.
7. CASCALOG EXAMPLE
(ns lucene-cascalog.core
(:gen-class)
(:use cascalog.api)
(:import
org.apache.lucene.analysis.standard.StandardAnalyzer
org.apache.lucene.analysis.TokenStream
org.apache.lucene.util.Version
org.apache.lucene.analysis.tokenattributes.TermAttribute))
(defn tokenizer-seq
"Build a lazy-seq out of a tokenizer with TermAttribute"
[^TokenStream tokenizer ^TermAttribute term-att]
(lazy-seq
(when (.incrementToken tokenizer)
(cons (.term term-att) (tokenizer-seq tokenizer term-att)))
))
9. “Fact: There are more Hadoop configuration options than
there are stars our galaxy.”
10. EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT OF
TUNING TO GET A HADOOP CLUSTER RUNNING WELL.
There are large companies that make money soley by
configuring and supporting hadoop clusters for
enterprise customers.
15. CASCALOG AND ELASTICMAPREDUCE
Learning Emacs, Clojure, and Cascalog was hard, but
was worth it.
The way our jobs were designed sucked and didn't
work well with ElasticMapReduce
16. CASCALOG AND SELF-MANAGED HADOOP
CLUSTER
We used a hacked up version of a cloudera python
script to launch and bootstrap a cluster.
We ran on spot instances
Cluster boot up time SUCKED and often failed. We
paid for instances during bootstrap and configuration
Our jobs weren't designed to tolerate things like spot
instances going away in the middle of a job.
Drinking heavily dulled the pain a little.
17. CASCALOG AND ELASTICMAPREDUCE
AGAIN
Rebuilt data processing pipeline from scratch (only
took nine months!)
Data pipelines were broken out into a handful of fault-
tolerant jobflow steps; each steps writes output to S3.
EMR supported spot instances at this point.
18. WEIRD BUGS THAT WE'VE HIT
Bootstrap script errors
Random cluster fuckedupedness
AMI version changes
Vendor issues
My personal favourite: Invisible S3 write failures.
19. IF YOU MUST RUN ON AWS
Break your processing pipelines into stages; write out
to S3 after each stage.
Bake in (a lot) of variability into your expected jobflow
run times.
Compress the data your are reading and writing from
S3 as much as possible.
Drinking helps.