SlideShare a Scribd company logo
1 of 35
Download to read offline
Continuous Integration
on top of Hadoop
Wisely Chen and Neal Lee
Saturday, August 3, 13
Agenda
• Who I am
• Problem
• Solution
• Demo
• Q&A
Saturday, August 3, 13
Who I am
• Wisely Chen ( thegiive@gmail.com )
• Release manager of Yahoo![Taiwan] shopping and data team
• Loves to promote open source tech in Taiwan
• Hadoop Summit 2013 San Jose
• Ruby and Rails : Coscup 2006, Ubisunrise 2007, OSDC 2007
• Puppet : PHPConf 2012 , RubyConf 2012
• Release Practice :Webconf 2013, Coscup 2012
Saturday, August 3, 13
Who I am
• Neal Lee (@neal_lee)
• Data Engineer at Yahoo![Taiwan] data team
• Aims to build a easy to use self-service BI platform
connecting to Hadoop.
Saturday, August 3, 13
EC Data Team
拍賣/商城/購物中心
站台
流量/點擊/使用者行為 追蹤
Transactional
data
Tracking data
Data
Highway
Data Warehouse/
Data Mart
Data
Infra BI
Platform
Report
Recommendation
API
Machine
Learning
Serve
Saturday, August 3, 13
Problem : Debug
Saturday, August 3, 13
Problem : Performance
Saturday, August 3, 13
Solution
Saturday, August 3, 13
Continuous Integration
Saturday, August 3, 13
Continuous Integration
• A software engineering practice
• Maintain code repos
• Automate the build
• Make the build self-testing
• Everyone commit to the baseline everyday
• Every commit should be a build
• Test in a clone of production environment
• Make it easy to get the latest deliverables
• Everyone can see the result of latest build
• Automate deployment
Saturday, August 3, 13
We focus on
• A software engineering practice
• Maintain code repos
• Automate the build
• Make the build self-testing
• Everyone commits to the baseline everyday
• Every commit should be a build
• Test in a clone of production environment
• Make it easy to get the latest deliverables
• Everyone can see the result of latest build
• Automate deployment
Saturday, August 3, 13
CI on Hadoop Flow
Code
Unit
Test
Performance
Test
Deploy Doc Execution
Saturday, August 3, 13
One Click Deploy
Commit
Unit
Test
Performance
Test
Deploy Doc Execution
Saturday, August 3, 13
Toolset
Commit
Unit
Test
Performance
Test
Deploy Doc Execution
Vaidya
BASH
Saturday, August 3, 13
System diagram
CI Master
GitHub
Alpha
CI Slave
Beta Cluster
Hadoop
JobTracker
CI Slave Hadoop
node
Hadoop
node
Hadoop
node
Hadoop
node
Slave
Node
Prod ClusterGateway
Saturday, August 3, 13
Unit Test
Commit
Unit
Test
Performance
Test
Deploy Doc Execution
Saturday, August 3, 13
PigUnit
• A simple xUnit framework
• No cluster set up is required in local mode
• Unit testing, regression testing, and rapid
prototyping on the fly
Saturday, August 3, 13
Using PigUnit
• After
• Coding
• Write PigUnit test case
• Run local PigUnit test
• Push to cluster
• Run Pig on cluster
• Get right result !
• Before
• Coding
• Manual local test
• Push to cluster
• Run Pig on cluster
• Get right result !
Saturday, August 3, 13
Unit test is live doc
• Unit test is runnable live doc
• Pass test case and meet previous
requirement
Saturday, August 3, 13
Flexible
• Pig can use PigUnit
• MapReduce can use MapUnit
• Hive can use hive_test
Saturday, August 3, 13
Performance Test
Commit
Unit
Test
Performance
Test
Deploy Doc Execution
Saturday, August 3, 13
Vaidya
• Rule based performance diagnosis of M/R jobs
• Extensible framework
• You can add your own rules
• Write complex rules using existing rules
Saturday, August 3, 13
Performance Test
Pig Job
Pig Job
History
Vaidya
Vaidya
Rule
4
Pig Job
Conf
Notify
User
3
Performance
result
Next CI
Stage
1
1
2
2
2
5
1. Exec pig job with sampling data on beta server
2. Vaidya read job history,conf,rule
to check performance problem
3. If ok, create performance result
4. If job has performance issue,
notify user
5. Go to next CI stage
Sampling
data
1
Saturday, August 3, 13
Vaidya Rule<Diagnos)cTest>
<Title><![CDATA[Balanaced Reduce Partitioning]]></Title>
	
  <ClassName>
	
  	
  	
  	
  <![CDATA[
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  org.apache.hadoop.vaidya.postexdiagnosis.tests.BalancedReducePar77oning
	
  	
  	
  	
  ]]>
</ClassName>
<Descrip)on>
	
  	
  	
  	
  	
  <![CDATA[This	
  rule	
  tests	
  as	
  to	
  how	
  well	
  the	
  input	
  to	
  reduce	
  tasks	
  is	
  balanced]]>
</Descrip)on>
<Importance><![CDATA[High]]></Importance>
<SuccessThreshold><![CDATA[0.20]]></SuccessThreshold>
<Prescrip)on><![CDATA[advice]]></Prescrip)on>
</Diagnos)cTest>
See	
  if	
  the	
  reduce	
  job	
  is	
  
balance	
  or	
  not	
  
Rule	
  importance
Diagnose	
  success	
  
threshold
Test	
  Java	
  Class
Saturday, August 3, 13
Deploy
Commit
Unit
Test
Performance
Test
Deploy Doc Execution
Saturday, August 3, 13
Deploy
• Deploy to production cluster
• Easy to rollback
• Create a git tag
• Auto doc generating
• Each release should map to a ticket
• Auto comment in Bugzilla
Saturday, August 3, 13
Auto comment in bugzilla
Repo url
Release
Note
Issue status
change
Saturday, August 3, 13
Auto create git tag
Release Note
[Bug xxx] log....
Git Tag
Saturday, August 3, 13
DEMO
Saturday, August 3, 13
Demo
• Demo1 : Unit test fail
• Demo2 : Unit test success
• Demo3 : Check performance test
• Demo4 :Auto generate Doc
• Demo5 : Notify user
Saturday, August 3, 13
Demo
Saturday, August 3, 13
Conclusion
• CI will revolutionize your workflow
• CI will boost your productivity
Saturday, August 3, 13
Saturday, August 3, 13
Logic Debug
• Map/Reduce	
  job	
  oJen	
  takes	
  a	
  lot	
  of	
  )me	
  for	
  execu)on
• Repeated	
  Map/Reduce	
  execu)on	
  cost	
  	
  a	
  lot	
  of	
  )me	
  
during	
  logic	
  debugging	
  phase
• Need	
  a	
  way	
  to	
  find	
  out	
  logic	
  problem	
  before	
  
execu)on	
  produc)on	
  job
• Coding
Manual
Test
Exec
Get Bug
Saturday, August 3, 13
Performance
• Map/Reduce	
  performance	
  is	
  hard	
  to	
  es)mate	
  before	
  execu)on	
  
• Production Grid computing resource is shared by allYahoos
• Bad performance will affect otherYahoos Grid jobs
• Putting bad performance code on production grid is guilty
• We manually investigate the job performance before we actually execute it
on production Grid
Coding
Manual
Test
Manual
investgate
Get Bug
Saturday, August 3, 13

More Related Content

What's hot

Puppet Camp Austin 2015: Getting Started with Puppet
Puppet Camp Austin 2015: Getting Started with PuppetPuppet Camp Austin 2015: Getting Started with Puppet
Puppet Camp Austin 2015: Getting Started with Puppet
Puppet
 
Extreme CI Savings with Bamboo 3.1: The JIRA Story
Extreme CI Savings with Bamboo 3.1: The JIRA StoryExtreme CI Savings with Bamboo 3.1: The JIRA Story
Extreme CI Savings with Bamboo 3.1: The JIRA Story
Atlassian
 

What's hot (20)

LCE13: LAVA Multi-Node Testing
LCE13: LAVA Multi-Node TestingLCE13: LAVA Multi-Node Testing
LCE13: LAVA Multi-Node Testing
 
Java 9 Functionality and Tooling
Java 9 Functionality and ToolingJava 9 Functionality and Tooling
Java 9 Functionality and Tooling
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
 
LCA13: LAVA Workshop Day 1: Introduction
LCA13: LAVA Workshop Day 1: IntroductionLCA13: LAVA Workshop Day 1: Introduction
LCA13: LAVA Workshop Day 1: Introduction
 
LCE13: Test and Validation Summit: Evolution of Testing in Linaro (I)
LCE13: Test and Validation Summit: Evolution of Testing in Linaro (I)LCE13: Test and Validation Summit: Evolution of Testing in Linaro (I)
LCE13: Test and Validation Summit: Evolution of Testing in Linaro (I)
 
Perl Continous Integration
Perl Continous IntegrationPerl Continous Integration
Perl Continous Integration
 
Logstash and friends
Logstash and friendsLogstash and friends
Logstash and friends
 
Intro to Ratpack (CDJDN 2015-01-22)
Intro to Ratpack (CDJDN 2015-01-22)Intro to Ratpack (CDJDN 2015-01-22)
Intro to Ratpack (CDJDN 2015-01-22)
 
Nautilus
NautilusNautilus
Nautilus
 
Leveraging Open Source for Database Development: Database Version Control wit...
Leveraging Open Source for Database Development: Database Version Control wit...Leveraging Open Source for Database Development: Database Version Control wit...
Leveraging Open Source for Database Development: Database Version Control wit...
 
Puppet Camp Austin 2015: Getting Started with Puppet
Puppet Camp Austin 2015: Getting Started with PuppetPuppet Camp Austin 2015: Getting Started with Puppet
Puppet Camp Austin 2015: Getting Started with Puppet
 
Refactoring to Java 8 (QCon New York)
Refactoring to Java 8 (QCon New York)Refactoring to Java 8 (QCon New York)
Refactoring to Java 8 (QCon New York)
 
Adding unit tests to the database deployment pipeline
Adding unit tests to the database deployment pipelineAdding unit tests to the database deployment pipeline
Adding unit tests to the database deployment pipeline
 
Test driving-qml
Test driving-qmlTest driving-qml
Test driving-qml
 
Ratpack Web Framework
Ratpack Web FrameworkRatpack Web Framework
Ratpack Web Framework
 
Geb Best Practices
Geb Best PracticesGeb Best Practices
Geb Best Practices
 
Asynchronous job queues with python-rq
Asynchronous job queues with python-rqAsynchronous job queues with python-rq
Asynchronous job queues with python-rq
 
Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneDeveloper-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing one
 
Extreme CI Savings with Bamboo 3.1: The JIRA Story
Extreme CI Savings with Bamboo 3.1: The JIRA StoryExtreme CI Savings with Bamboo 3.1: The JIRA Story
Extreme CI Savings with Bamboo 3.1: The JIRA Story
 
Adding unit tests to the database deployment pipeline
Adding unit tests to the database deployment pipelineAdding unit tests to the database deployment pipeline
Adding unit tests to the database deployment pipeline
 

Viewers also liked

Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
Holden Karau
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
Anu Shetty
 

Viewers also liked (7)

Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Introduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unitIntroduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unit
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 

Similar to Coscup 2013 : Continuous Integration on top of hadoop

Intro to PHP Testing
Intro to PHP TestingIntro to PHP Testing
Intro to PHP Testing
Ran Mizrahi
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1
CIVEL Benoit
 

Similar to Coscup 2013 : Continuous Integration on top of hadoop (20)

Continuous Delivery - Automate & Build Better Software with Travis CI
Continuous Delivery - Automate & Build Better Software with Travis CIContinuous Delivery - Automate & Build Better Software with Travis CI
Continuous Delivery - Automate & Build Better Software with Travis CI
 
Intro to PHP Testing
Intro to PHP TestingIntro to PHP Testing
Intro to PHP Testing
 
Accelerating Your Test Execution Pipeline
Accelerating Your Test Execution PipelineAccelerating Your Test Execution Pipeline
Accelerating Your Test Execution Pipeline
 
Automated Visual Testing in NSW.Gov.AU
Automated Visual Testing in NSW.Gov.AUAutomated Visual Testing in NSW.Gov.AU
Automated Visual Testing in NSW.Gov.AU
 
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
 
Continuous delivery w projekcie open source - Marcin Stachniuk
Continuous delivery w projekcie open source - Marcin StachniukContinuous delivery w projekcie open source - Marcin Stachniuk
Continuous delivery w projekcie open source - Marcin Stachniuk
 
Heavenly hell – automated tests at scale wojciech seliga
Heavenly hell – automated tests at scale   wojciech seligaHeavenly hell – automated tests at scale   wojciech seliga
Heavenly hell – automated tests at scale wojciech seliga
 
So you-want-to-go-faster
So you-want-to-go-fasterSo you-want-to-go-faster
So you-want-to-go-faster
 
Test Automation using UiPath Test Suite - Developer Circle Part-2.pdf
Test Automation using UiPath Test Suite - Developer Circle Part-2.pdfTest Automation using UiPath Test Suite - Developer Circle Part-2.pdf
Test Automation using UiPath Test Suite - Developer Circle Part-2.pdf
 
How we realized SOA by Python at PyCon JP 2015
How we realized SOA by Python at PyCon JP 2015How we realized SOA by Python at PyCon JP 2015
How we realized SOA by Python at PyCon JP 2015
 
Comprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live ProductionComprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live Production
 
OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...
OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...
OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...
 
Automate Database Deployment - SQL In The City Workshop
Automate Database Deployment - SQL In The City WorkshopAutomate Database Deployment - SQL In The City Workshop
Automate Database Deployment - SQL In The City Workshop
 
Automate across Platform, OS, Technologies with TaaS
Automate across Platform, OS, Technologies with TaaSAutomate across Platform, OS, Technologies with TaaS
Automate across Platform, OS, Technologies with TaaS
 
AppEngine Performance Tuning
AppEngine Performance TuningAppEngine Performance Tuning
AppEngine Performance Tuning
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1
 
Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)
 
Scalamen and OT
Scalamen and OTScalamen and OT
Scalamen and OT
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Coscup 2013 : Continuous Integration on top of hadoop

  • 1. Continuous Integration on top of Hadoop Wisely Chen and Neal Lee Saturday, August 3, 13
  • 2. Agenda • Who I am • Problem • Solution • Demo • Q&A Saturday, August 3, 13
  • 3. Who I am • Wisely Chen ( thegiive@gmail.com ) • Release manager of Yahoo![Taiwan] shopping and data team • Loves to promote open source tech in Taiwan • Hadoop Summit 2013 San Jose • Ruby and Rails : Coscup 2006, Ubisunrise 2007, OSDC 2007 • Puppet : PHPConf 2012 , RubyConf 2012 • Release Practice :Webconf 2013, Coscup 2012 Saturday, August 3, 13
  • 4. Who I am • Neal Lee (@neal_lee) • Data Engineer at Yahoo![Taiwan] data team • Aims to build a easy to use self-service BI platform connecting to Hadoop. Saturday, August 3, 13
  • 5. EC Data Team 拍賣/商城/購物中心 站台 流量/點擊/使用者行為 追蹤 Transactional data Tracking data Data Highway Data Warehouse/ Data Mart Data Infra BI Platform Report Recommendation API Machine Learning Serve Saturday, August 3, 13
  • 10. Continuous Integration • A software engineering practice • Maintain code repos • Automate the build • Make the build self-testing • Everyone commit to the baseline everyday • Every commit should be a build • Test in a clone of production environment • Make it easy to get the latest deliverables • Everyone can see the result of latest build • Automate deployment Saturday, August 3, 13
  • 11. We focus on • A software engineering practice • Maintain code repos • Automate the build • Make the build self-testing • Everyone commits to the baseline everyday • Every commit should be a build • Test in a clone of production environment • Make it easy to get the latest deliverables • Everyone can see the result of latest build • Automate deployment Saturday, August 3, 13
  • 12. CI on Hadoop Flow Code Unit Test Performance Test Deploy Doc Execution Saturday, August 3, 13
  • 13. One Click Deploy Commit Unit Test Performance Test Deploy Doc Execution Saturday, August 3, 13
  • 15. System diagram CI Master GitHub Alpha CI Slave Beta Cluster Hadoop JobTracker CI Slave Hadoop node Hadoop node Hadoop node Hadoop node Slave Node Prod ClusterGateway Saturday, August 3, 13
  • 16. Unit Test Commit Unit Test Performance Test Deploy Doc Execution Saturday, August 3, 13
  • 17. PigUnit • A simple xUnit framework • No cluster set up is required in local mode • Unit testing, regression testing, and rapid prototyping on the fly Saturday, August 3, 13
  • 18. Using PigUnit • After • Coding • Write PigUnit test case • Run local PigUnit test • Push to cluster • Run Pig on cluster • Get right result ! • Before • Coding • Manual local test • Push to cluster • Run Pig on cluster • Get right result ! Saturday, August 3, 13
  • 19. Unit test is live doc • Unit test is runnable live doc • Pass test case and meet previous requirement Saturday, August 3, 13
  • 20. Flexible • Pig can use PigUnit • MapReduce can use MapUnit • Hive can use hive_test Saturday, August 3, 13
  • 22. Vaidya • Rule based performance diagnosis of M/R jobs • Extensible framework • You can add your own rules • Write complex rules using existing rules Saturday, August 3, 13
  • 23. Performance Test Pig Job Pig Job History Vaidya Vaidya Rule 4 Pig Job Conf Notify User 3 Performance result Next CI Stage 1 1 2 2 2 5 1. Exec pig job with sampling data on beta server 2. Vaidya read job history,conf,rule to check performance problem 3. If ok, create performance result 4. If job has performance issue, notify user 5. Go to next CI stage Sampling data 1 Saturday, August 3, 13
  • 24. Vaidya Rule<Diagnos)cTest> <Title><![CDATA[Balanaced Reduce Partitioning]]></Title>  <ClassName>        <![CDATA[                      org.apache.hadoop.vaidya.postexdiagnosis.tests.BalancedReducePar77oning        ]]> </ClassName> <Descrip)on>          <![CDATA[This  rule  tests  as  to  how  well  the  input  to  reduce  tasks  is  balanced]]> </Descrip)on> <Importance><![CDATA[High]]></Importance> <SuccessThreshold><![CDATA[0.20]]></SuccessThreshold> <Prescrip)on><![CDATA[advice]]></Prescrip)on> </Diagnos)cTest> See  if  the  reduce  job  is   balance  or  not   Rule  importance Diagnose  success   threshold Test  Java  Class Saturday, August 3, 13
  • 26. Deploy • Deploy to production cluster • Easy to rollback • Create a git tag • Auto doc generating • Each release should map to a ticket • Auto comment in Bugzilla Saturday, August 3, 13
  • 27. Auto comment in bugzilla Repo url Release Note Issue status change Saturday, August 3, 13
  • 28. Auto create git tag Release Note [Bug xxx] log.... Git Tag Saturday, August 3, 13
  • 30. Demo • Demo1 : Unit test fail • Demo2 : Unit test success • Demo3 : Check performance test • Demo4 :Auto generate Doc • Demo5 : Notify user Saturday, August 3, 13
  • 32. Conclusion • CI will revolutionize your workflow • CI will boost your productivity Saturday, August 3, 13
  • 34. Logic Debug • Map/Reduce  job  oJen  takes  a  lot  of  )me  for  execu)on • Repeated  Map/Reduce  execu)on  cost    a  lot  of  )me   during  logic  debugging  phase • Need  a  way  to  find  out  logic  problem  before   execu)on  produc)on  job • Coding Manual Test Exec Get Bug Saturday, August 3, 13
  • 35. Performance • Map/Reduce  performance  is  hard  to  es)mate  before  execu)on   • Production Grid computing resource is shared by allYahoos • Bad performance will affect otherYahoos Grid jobs • Putting bad performance code on production grid is guilty • We manually investigate the job performance before we actually execute it on production Grid Coding Manual Test Manual investgate Get Bug Saturday, August 3, 13