3. Timmy @ Big Data Inc.
New Data Processed
Output
Hadoop Workflow
4. Timmy @ Big Data Inc.
New Data Processed
Output
Hadoop Workflow
5. Timmy @ Big Data Inc.
New Data Processed
Output
Hadoop Workflow
6. Timmy @ Big Data Inc.
New Data Processed
Output
Hadoop Workflow
7. Timmy @ Big Data Inc.
New Data Processed
Output
Hadoop Workflow
8. Timmy @ Big Data Inc.
New Data Processed
Output
Hadoop Workflow
9. Timmy @ Big Data Inc.
New Data Processed
Output
Hadoop Workflow
10. Timmy @ Big Data Inc.
New Data Processed
Output
Hadoop Workflow
11. Timmy @ Big Data Inc.
New Data Processed
Output
Hadoop Workflow
12. Timmy @ Big Data Inc.
• Business requirement: 15 hour
turnaround on processing new data
13. Timmy @ Big Data Inc.
• Business requirement: 15 hour
turnaround on processing new data
• Current turnaround is 10 hours
14. Timmy @ Big Data Inc.
• Business requirement: 15 hour
turnaround on processing new data
• Current turnaround is 10 hours
• Plenty of extra capacity!
15. Timmy @ Big Data Inc.
• Company increases data collection rate
by 10%
16. Timmy @ Big Data Inc.
• Company increases data collection rate
by 10%
Surprise! Turnaround time
explodes to 30 hours!
17. Timmy @ Big Data Inc.
Fix it ASAP! We’re
losing customers!
18. Timmy @ Big Data Inc.
We need 2 times
more machines!
19. Timmy @ Big Data Inc.
We don’t even have that
much space in the
datacenter!
20. Timmy @ Big Data Inc.
Rack 1 Rack 2
Data Center
21. Timmy @ Big Data Inc.
Rack 1 Rack 2 New Rack
Data Center
22. Timmy @ Big Data Inc.
• Turnaround drops to 6 hours!!
?? ??
23. False Assumptions
• Will take 10% longer to process 10%
more data
• 50% more machines only creates 50%
more performance
24. What is a batch processing
system?
while (true) {
processNewData()
}
25. “Hours of Data”
• Assume constant rate of new data
• Measure amount of data in terms of
hours
26. Questions to answer
• How does a 10% increase in data cause
my turnaround time to increase by 200%?
27. Questions to answer
• How does a 10% increase in data cause
my turnaround time to increase by 200%?
• Why doesn’t the speed of my workflow
double when I double the number of
machines?
28. Questions to answer
• How does a 10% increase in data cause
my turnaround time to increase by 200%?
• Why doesn’t the speed of my workflow
double when I double the number of
machines?
• How many machines do I need for my
workflow to perform well and be fault-
tolerant?
31. Example
• Suppose you extend workflow with a
component that will take 2 hours on 10
hour dataset
32. Example
• Suppose you extend workflow with a
component that will take 2 hours on 10
hour dataset
• Workflow runtime may increase by a
lot more than 2 hours!
39. Example
• Increased runtime of workflow that
operates on 10 hours of data to 12
hours
• Next run, there will be 12 hours of data
to process
40. Example
• Increased runtime of workflow that
operates on 10 hours of data to 12
hours
• Next run, there will be 12 hours of data
to process
• Because more data, will take longer to
run
43. Example
• Which means next iteration will have
even more data
• And so on...
Does the runtime ever stabilize?
44. Example
• Which means next iteration will have
even more data
• And so on...
Does the runtime ever stabilize?
If so, when?
45. Math
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
Runtime for a single run of a workflow
46. Math
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
T = O + H x P
Runtime for a single run of a workflow
47. Overhead (O)
• Fixed time in workflow
– Job startup time
– Time spent independent of amount of
data
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
T = O + H x P
48. Time to Process One Hour of
Data (P)
• How long it takes to process one hour of data,
minus overhead
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
T = O + H x P
49. Time to Process One Hour of
Data (P)
• How long it takes to process one hour of data,
minus overhead
• P=1 -> Each hour adds one hour to runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
T = O + H x P
50. Time to Process One Hour of
Data (P)
• How long it takes to process one hour of data,
minus overhead
• P=1 -> Each hour adds one hour to runtime
• P=2 -> Each hour adds two hours to runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
T = O + H x P
51. Time to Process One Hour of
Data (P)
• How long it takes to process one hour of data,
minus overhead
• P=1 -> Each hour adds one hour to runtime
• P=2 -> Each hour adds two hours to runtime
• P = 0.5 -> Each hour adds 30 minutes to runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
T = O + H x P
52. Stable Runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
T=O+HxP
53. Stable Runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
T=O+HxP
Stabilizes when:
Runtime (T) = Hours of data processed (H)
54. Stable Runtime
Runtime = Overhead + (Hours of Data) x (Time to process one hour of data)
T=O+HxP
Stabilizes when:
Runtime (T) = Hours of data processed (H)
T=O+TxP
71. Increase in Data
• Less “extra capacity” -> more dramatic
deterioration in performance
72. Increase in Data
• Less “extra capacity” -> more dramatic
deterioration in performance
• Effect can also happen:
• Increase in hardware/software
failures
• Sharing cluster
73. Real life example
• How does optimizing out 30% of my
workflow runtime cause the runtime to
decrease by 80%?
80. Takeaways
• You should measure the O and P values of your
workflow to avoid disasters
• When P is high:
– Expand cluster
– OR: Optimize code that touches data
81. Takeaways
• You should measure the O and P values of your
workflow to avoid disasters
• When P is high:
– Expand cluster
– OR: Optimize code that touches data
• When P is low:
– Optimize overhead (i.e., reduce job startup time)