Hadoop 101 ETL Automation Smackdown

Hadoop 101 ETL
+ Automation Smackdown
Learning Big Data:

Which approach makes me the most valuable as developer?

Bio - Pete Carapetyan
• Java dev last 15 years, dev 20 years
• Grew up automating in a different industry
• Apparent obsession with systems & automation
• Since 2000 as dataFundamentals, now 2 man shop

Special Skills - Special Snowflakes
• Let me show you these Hadoop & Avro skills.
• Then, we code for the special snowflakes. (data)
• Thus we are more valuable, and can up our bill rates!
• This is Approach #1: Manual or Special Snowflake

My 2013
Manual Hadoop
Story
• 15 ETL jobs [Partial scope]

• Brilliant, ninja level team

• 1 year of competitive NIH*  
copy paste spaghetti coding -
AKA special snowﬂake
approach

• Not a fun year
*NIH: Not Invented Here

Special Snowﬂake Approach: Human drama!
What limitations of this manual  
special skills special snowﬂakes 
approach do we observe?

How To Un-Pack Either Approach?
What if we remove the human drama?

Now, what happens if we automate?
Automated Approach

Carrie
Our own internal project for
automating big data. 
 
Name inspired by the horror ﬁlm…

Also inspired by  
The Phoenix Project
• Results, not drama

• Focus only on bottleneck

• Brent as bottleneck

On Brent
• Brent is a team’s best asset!
Brent is a ninja.

• Brent is my dark side only
when treating every situation
like a special snowﬂake.

• Brent enjoys the attention.

• Brent is not the drama queen,
others bring the drama to him.
Brent?

Automation Basics
1. Brent spends time on clean
design, not NIH*

• [Camel] - Integration Server
2. Brent automates the rule,
codes the exception

• Apply metadata to templates
• Automated VM dev infrastructure
* NIH: Not Invented Here

Demo Clean
• Clean project folder

• Clean hadoop ﬁle system

• Clean hadoop DDL
https://www.youtube.com/watch?v=qR7XTzv5P_M&index=2&list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe

Later Demo
Integration Server
• Raw linux OS (Centos)

• Java

• Maven

• Ruby

• networking

• maven repo - binaries

• [created with vagrant]
https://www.youtube.com/watch?v=xgheERvulqw&index=3&list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe

Demo Metadata
Collection
• Simple properties

• Collected using a cheesy UI

• UI written in Ruby

Demo Generated
Code
• Camel ETL binary

• OSGi, versioned, modular jar

• Only 3 primary outputs!

• simple

• clean

• well designed (?)

• JUnit/integration tested

• Supporting scripting

• messy

Demo Server Deploy
• One line deploy/run command

• Compiles on server with Maven

• Also runnable as jar

Does it work?
• Make custom ﬁle

• Drop into ETL folder

• Inspect

Demo - Review
• Schema created

• DDL run

• Avro binary (JSON) transform

• Data Migration

• FTP to server

• Into HDFS partition

• Alter Table: Date Partition

Transform to Avro
• Not detailed in this talk

• Demo’d here as a binary

• Code listed at end of talk

Modular Binaries
• Each ETL

• Own binary, OSGi

• Own codebase

• Fully versioned

• Fully customizable after
generation

• Runs alone or as part of Camel
container(s)

• Tests on build

• Contains own supporting
scripts

Takeaways
• Brent coding the exception manually, rule by template.
• Brent has time to focus on design.
• Brent may lose some amount of desired attention :(
• Resulting code is
• clean
• consistent, easy to maintain
• But is there a Home Run?
• deﬁned as not possible via special snowﬂake approach

Home Run 1: Infrastructure As Code Demo
• [Jeff]

Home Run 2: Big Data, Beyond Hadoop!
1. Pick your provider
• Hadoop
• Cassandra
• Couchbase
• etc
2. Adopt your templates,
VMs, etc

Home Run 3: Idempotent Effort
• Idempotent eﬀort? Each subsequent run doesn’t have bad effect.
• Walkup - The 10 minute test
• Walkaway - Requirements
• Features
• Testing, technical debt, already in place for code
• VMs and recipes for dev, test, prod
• OSGi etc modularity for binaries
• Does what we see here pass this test?

What to leave with
• De-mystify: how to Avro/Hadoop a delimited ﬁle
• Review motives for automating this process
• Code automation basics
• Infrastructure automation basics
• Code for above

Further Hadoop Tutuorial Resources
• Hortonworks
• best free stuff? Except networking vas
• Cloudera
• Lots but appear to prefer to get paid
• Apache Hadoop
• haven’t tried but it is Apache

Wish To See More?
• In ofﬁce demos
• Your data

Code, Content, Contacts
• This Slide Deck: http://www.slideshare.net/datafundamentals/hadoop-big-data-35762308
• or just remember slideshare.net/datafundamentals it may be the only one there
• Youtube - 11 minute version of code demo - https://www.youtube.com/playlist?list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe
• Dev Code
• Carrie (ruby UI and generator) https://github.com/datafundamentals/df_ui_carrie
• Avro from delimited https://bitbucket.org/datafundamentals/avro_from_delimited
• Camel-Avro https://bitbucket.org/datafundamentals/camel-avro-etl
• Ops Code - cookbook recipes
• https://github.com/datafundamentals
• Contact
• pete@datafundamentals.com, jeff@datafundamentals.com
Be careful out there!

Hadoop 101 ETL Automation Smackdown

Hadoop 101 ETL Automation Smackdown

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (15)

Similar a Hadoop 101 ETL Automation Smackdown

Similar a Hadoop 101 ETL Automation Smackdown (20)

Último

Último (20)

Hadoop 101 ETL Automation Smackdown