1. DO NOT USE PUBLICLY
Million Monkeys PRIOR TO 10/23/12
Headline Goes Here
Jesse Anderson | Curriculum Developer and Instructor
Speaker Name or Subhead Goes Here
November 2012
1
2. About Me
• Cloudera - Educational Services Team
• Twitter - @jessetanderson
• Blog and more info: http://www.jesse-anderson.com
• Screencasts on Pragmatic Programmers: Buy It Now on
http://www.jesse-anderson.com
• President – Northern Nevada Software Developers Group
2
3. About Cloudera
• Cloudera is “The commercial Hadoop company”
• Founded by leading experts on Hadoop from
Facebook, Google, Oracle and Yahoo
• Provides consulting and training services for Hadoop users
• Staff includes committers to virtually all Hadoop projects
3
4. Introduction
• Infinite Monkey Theorem
• Hadoop
• Million Monkeys Algorithm
• Business Case
4
6. Exponential Growth (aka Big Data)
Odds of finding a group Contiguous
Combinations
of characters is 1 in 26 Characters
raised to the power of
the number of 8 208,827,064,576
contiguous characters
9 5,429,503,678,976
10 141,167,095,653,376
6
14. Business Value of Scalability
Scaling does not require Adding more computers
massive re-engineering to cluster gets a
and complete rewrites of predictable increase in
code computational power and
storage
SAVE SAVE
14
15. Going Viral (and taking over the world)
Covered internationally 26,000 unique
in BBC, Wall Street visits from 119
Journal, Wired and countries in
Slashdot one day
15
16. Next Steps
• Books
• Hadoop: The Definitive Guide - Tom White
• Hadoop Operations - Eric Sammer
• Cloudera Training
• Developer, Admin, Hive and Pig, HBase, Essentials
• CDH
• Cloudera's Apache Distribution Including Hadoop
• Open Source
• VM Image
16
17. Conclusion
• MapReduce breaks up problem efficiently
• No code changes to scale
• Incredible scalability
• Enables previously impossible tasks
17
Interesting statistical question. Thought about since Aristotle.Randomness+Resouces+Time=Anything PossibleNo real monkeys – need virtual monkeys
Shakespeare lazy. Heavily influenced English Literature.Big Data isn’t always a huge file. It can be high computation.
This is not a map of MT and ID1 to 20 node testingKeep efficiency up RDBMS efficiency in gutter
Engineers not spending time coding to scale. Busy adding new features.No code changes for scaling. Took 1.5 months on one computer and 3.5 days on 20 nodesSpending on new computers gives a consistent, linear increase. Compare spending on RDBMS and Hadoop.