This document discusses two approaches to ETL jobs in Hadoop: a manual "special snowflake" approach and an automated approach. The manual approach involves a team spending a year copying and pasting code for 15 jobs. This leads to spaghetti code and is not sustainable. The automated approach involves designing reusable templates and rules to automate the ETL process. This frees up the developer Brent to focus on design rather than manual work. It results in code that is clean, consistent, easy to maintain and passes the "10 minute test" of being idempotent. The document demonstrates generating ETL code from metadata and deploying the automated jobs to Hadoop.
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Hadoop 101 ETL Automation Smackdown
1. Hadoop 101 ETL
+ Automation Smackdown
Learning Big Data:
Which approach makes me the most valuable as developer?
2. Bio - Pete Carapetyan
• Java dev last 15 years, dev 20 years
• Grew up automating in a different industry
• Apparent obsession with systems & automation
• Since 2000 as dataFundamentals, now 2 man shop
3. Special Skills - Special Snowflakes
• Let me show you these Hadoop & Avro skills.
• Then, we code for the special snowflakes. (data)
• Thus we are more valuable, and can up our bill rates!
• This is Approach #1: Manual or Special Snowflake
4. My 2013
Manual Hadoop
Story
• 15 ETL jobs [Partial scope]
• Brilliant, ninja level team
• 1 year of competitive NIH*
copy paste spaghetti coding -
AKA special snowflake
approach
• Not a fun year
*NIH: Not Invented Here
17. Also inspired by
The Phoenix Project
• Results, not drama
• Focus only on bottleneck
• Brent as bottleneck
18. On Brent
• Brent is a team’s best asset!
Brent is a ninja.
• Brent is my dark side only
when treating every situation
like a special snowflake.
• Brent enjoys the attention.
• Brent is not the drama queen,
others bring the drama to him.
Brent?
19. Automation Basics
1. Brent spends time on clean
design, not NIH*
• [Camel] - Integration Server
2. Brent automates the rule,
codes the exception
• Apply metadata to templates
• Automated VM dev infrastructure
* NIH: Not Invented Here
21. Later Demo
Integration Server
• Raw linux OS (Centos)
• Java
• Maven
• Ruby
• networking
• maven repo - binaries
• [created with vagrant]
https://www.youtube.com/watch?v=xgheERvulqw&index=3&list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe
23. Demo Generated
Code
• Camel ETL binary
• OSGi, versioned, modular jar
• Only 3 primary outputs!
• simple
• clean
• well designed (?)
• JUnit/integration tested
• Supporting scripting
• messy
24. Demo Server Deploy
• One line deploy/run command
• Compiles on server with Maven
• Also runnable as jar
25. Does it work?
• Make custom file
• Drop into ETL folder
• Inspect
26. Demo - Review
• Schema created
• DDL run
• Avro binary (JSON) transform
• Data Migration
• FTP to server
• Into HDFS partition
• Alter Table: Date Partition
27. Transform to Avro
• Not detailed in this talk
• Demo’d here as a binary
• Code listed at end of talk
28. Modular Binaries
• Each ETL
• Own binary, OSGi
• Own codebase
• Fully versioned
• Fully customizable after
generation
• Runs alone or as part of Camel
container(s)
• Tests on build
• Contains own supporting
scripts
29. Takeaways
• Brent coding the exception manually, rule by template.
• Brent has time to focus on design.
• Brent may lose some amount of desired attention :(
• Resulting code is
• clean
• consistent, easy to maintain
• But is there a Home Run?
• defined as not possible via special snowflake approach
30. Home Run 1: Infrastructure As Code Demo
• [Jeff]
31. Home Run 2: Big Data, Beyond Hadoop!
1. Pick your provider
• Hadoop
• Cassandra
• Couchbase
• etc
2. Adopt your templates,
VMs, etc
32. Home Run 3: Idempotent Effort
• Idempotent effort? Each subsequent run doesn’t have bad effect.
• Walkup - The 10 minute test
• Walkaway - Requirements
• Features
• Testing, technical debt, already in place for code
• VMs and recipes for dev, test, prod
• OSGi etc modularity for binaries
• Does what we see here pass this test?
33. What to leave with
• De-mystify: how to Avro/Hadoop a delimited file
• Review motives for automating this process
• Code automation basics
• Infrastructure automation basics
• Code for above
34. Further Hadoop Tutuorial Resources
• Hortonworks
• best free stuff? Except networking vas
• Cloudera
• Lots but appear to prefer to get paid
• Apache Hadoop
• haven’t tried but it is Apache
35. Wish To See More?
• In office demos
• Your data
36. Code, Content, Contacts
• This Slide Deck: http://www.slideshare.net/datafundamentals/hadoop-big-data-35762308
• or just remember slideshare.net/datafundamentals it may be the only one there
• Youtube - 11 minute version of code demo - https://www.youtube.com/playlist?list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe
• Dev Code
• Carrie (ruby UI and generator) https://github.com/datafundamentals/df_ui_carrie
• Avro from delimited https://bitbucket.org/datafundamentals/avro_from_delimited
• Camel-Avro https://bitbucket.org/datafundamentals/camel-avro-etl
• Ops Code - cookbook recipes
• https://github.com/datafundamentals
• Contact
• pete@datafundamentals.com, jeff@datafundamentals.com
Be careful out there!