February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Hadoop Summit 2010 Challenges And Uniqueness Of Qe And Re Processes In Hadoop
1. Challenges and Uniqueness
of
QE and RE processes in Hadoop
Jayant Mahajan
Grid Computing, Yahoo! Bangalore
Feb 2010
-1- 1
2. Agenda
• Quality Checks for a Patch at Hadoop
• Additional QE at Yahoo!
• Tools used for Hadoop QE and RE
• Challenges
-2-
3. Quality checks for a patch commit in Hadoop
• Static Quality Analysis – Patch attached to Jira
– Verify Findbugs warnings
– Verify Javadoc warning
– Verify ReleaseAudit warnings
– Verify Unit Tests – if added or not
• Committer Review
• Unit Tests
– Junit
– Mini MR Tests
-3-
4. Quality checks for a patch commit in Hadoop (Contd ..)
COMMUNITY
Secondary Build
• Static analysis – findbugs
• Jdiff
• All Core unit tests with code coverage
• All Contrib unit tests code coverage
Jira Patch Set JIRA Patch commit
raised attached status to picked up
to JIRA “Patch for testing
SVN
Available” - HUDSON
• Static analysis – Findbug
• ReleaseAudit warning Committer
• Fast unit tests - TestNG Review
• Fast contrib unit tests (if
Development patching contrib)
-4-
5. Additional QE @ Y! for Hadoop
• We are the largest test team for Hadoop
• More than 1000 nodes dedicated for QE
• Hadoop testing at Yahoo
– Patch testing
– Automated Testing
– Manual Testing
-5-
6. Additional QE @ Y! for Hadoop (Contd ..)
COMMUNITY
Jira Patch Set JIRA Patch commit
raised attached status to picked up
to JIRA “Patch for testing
SVN
Available” - HUDSON
GIT
Development
YAHOO !
Test Environment
Manual Manual HUDSON - HUDSON GIT
Patch Functional Benchmark Release Y!Hadoop
Testing Testing and Build
Automation
-6-
7. Tools used for Hadoop QE and RE
• Hudson – Build automation
• SVN and GIT – Source Code Mgmt (SCM)
• Ant & ivy – Build and Dependency Mgmt
• Checkstyle – code standard checker
• Clover – code coverage
• Forrest – Documentation
• Jdiff – Track API changes
• Findbugs – Static analysis to find bugs
• Junit – Unit tests
• Bugzilla & Jira – Issue Tracking
-7-
8. Hudson
• Hudson is a Continuous Integration Server used to
execute and monitor job (Hudson job)
• Used for:
– Build
– Unit Tests
– Deployment
– Validation Jobs
– Automated tests
• http://hudson-ci.org/
-8-
9. Challenges in Hadoop QE and RE
• Reliability
– Loss of nodes
– Data corruption
– Loss of data blocks
• Scale
– Network issues
– Disk issues
• Performance
• Corner cases
• Repeatability
– Deployment
– Continuous Integration
-9-
10. Reliability
• MapReduce Reliability
– Fail Tasks
– Lost TT’s
• HDFS Reliability
– Bringing a rack down
– Corrupting data blocks
– Loss of data blocks
- 10 -
11. Scale
• Testing at scale when Hardware resource are limited
• If we want more nodes for testing, what will we do?
– Use simulation
▪ DataNode simulation
▪ TaskTracker simulation
– For example
▪ We need an environment of 3000 node cluster
▪ Run 3 instance of TT’s and DN’s per node on 1000 Node cluster
▪ This simulates an environment equivalent to 3000 node cluster
- 11 -
12. Performance
• Benchmark execution on 20 and 500 nodes
– Eg: Sort, Shuffle, DFSIO
• GridMix
– V1 - A standard mix of MR jobs of varying types and sizes
measuring throughput on a cluster
– V2 - Customized mix of MR jobs where the number of
small/large/medium jobs can be controlled
– V3
▪ It simulates user load pattern.
▪ Work load is generated from job history trace analysis
- 12 -
13. Corner Cases
• Challenges in reproducing a problem related to
– Timing issues
– Race conditions
– Out of memory issues
– Reproducing in the exact environment where it occurred.
• AspectJ
– Aspectj taps into source code and can run simulated scenarios
before/after/during a method.
– It can reproduce timing issues by introducing sleep statements.
– out of memory issues, by reducing the memory available duing run
time.
– Exact environments can reproduced by changing the configs of the
jobs in the go, when the exact configuration is not possible to
replicate.
- 13 -
14. Repeatability - Deployment
• Deployment Challenges
– Deploying on a multiple node cluster
– Deciding on a JTNode and NameNode
– Building configurations for variety of clusters
• Solution
– YUM repo for deployment
– Backup host for JTNode and Namenode
– Source code build & configuration build
- 14 -
15. Repeatability - CI
• Continuous Integration aka CI
– Software development process where members of the team
integrate their work frequently, usually daily
– Every integration is verified by automated build (including
tests) to verify integration errors as quickly as possible.
• CI @ Y!
– Commit build
– Secondary build
– Secondary smoke test build
– Automated deployment
- 15 -