2. Agenda
1. Connection Before Content
2. Testing Fundamental
3. Unit Tests
4. Integration Tests
5. Try it out
6. Performance
7. Diagnostics
3. Why Testing
1. Catch bugs early in the developing cycle
2. Transparency of current project status
3. Easy developing / refactoring: immediate feedback
4. Push developer to provide better and stable code
5. Decrease developing cycle times
5. Testing Fundamental
1. Unit testing - functional verification of each 'unit' (method /
class in Java)
2. Integration testing - verifies that the system works as a
whole
3. Performance testing - test the efficiency of the program.
Deepened by code AND cluster architecture
4. Diagnostic - the way to find problems in production.
--> 1 + 2 should be done BEFORE production
6. Unit Tests
Key Features
1. Simple (up to 10 lines)
2. Isolation (no DB connection, no cluster dependency etc...)
3. Deterministics - PASS or FAIL
4. Automated (of course)
Why Unit Tests
1. Prevent regression
2. Fast - no need of full MR env
3. Help in refactoring and updates
7. Unit Tests - MR jobs
Best Practices
1. Extract the tested code into isolated method/class
2. Do not test MR framework but pure Java
3. Use the same package for tests
MRUnit
1. Lib for MR unit tests
2. Apache project
3. Supports testing of mappers, reducers and full job (without full
cluster)
4. Supports counters testing (nice!)
9. Integration Tests - background
1. Unit tests test each unit (Mapper/Reducer), integration
test the integrated work
2. Test the integration with the framework
3. Does not limited by data volumes
10. Integration Tests - tips and tricks
Tips and tricks
1. Use MiniMRCluster / MiniDFSCluster for tests
2. Use Linux
3. Make dev == production
4. Use data sampling:
a. Random sampling
b. Biased sampling
5. Apache BigTop (never try that)
6. Use Cloudera CDH
11. Lets play a bit
1. Checkout the code:
git clone https://github.com/ophchu/mapreduce-tutorials.git
2. Make sure you manage to run the mapper test
3. Complete the MRUnit tests for the reducer and full job
4. Play with the MiniMRCluster/MiniDFSCluster test
12. Performance
Profiling (at a glance...)
1. Profile your code
2. Measure and tune what's matters to you
3. Benchmarking: micro and macro
4. Hadoop has a built-in profiler (e.g. using hprof)
13. Cluster Performance
1. Terasort test
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4.
1.2.jar teragen 1000 /user/dataint/terasort/input
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4.
1.2.jar terasort /user/dataint/terasort/input /user/dataint/terasort/output
2. MRBench - MR benchmarking
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-test.jar mrbench -numRuns 2
-maps 10 -reduces 10 -inputLines 100 -inputType random
3. NNBench - Name Node benchmarking
4. TestDFSIO - write and read performance
14. Diagnostics
1. Check web API (http://your_server:50030/jobtracker.jsp):
a. Nodes: how many up, how many down, check slots
b. Jobs: logs, failures, exceptions
c. Counters: expected
2. Configuration:
a. check job conf (job.xml)
b. Check env conf (http://your_server:50030/conf)
3. Jobs history (http://your_server:50030/jobhistory.jsp)
4. Log dirs:
a. Job tracker (http://your_server:50030/logs/)
b. Task trakcers