Hadoop started as an offline, batch-processing system. It made it practical to store and process much larger datasets than before. Subsequently, more interactive, online systems emerged, integrating with Hadoop. First among these was HBase, the key/value store. Now scalable interactive query engines are beginning to join the Hadoop ecosystem. Realtime is gradually becoming a viable peer to batch in big data.
1. DO NOT USE PUBLICLY
Beyond Batch PRIOR TO 10/23/12
Headline Goes Here
Doug Cutting
Speaker Name or Subhead Goes Here
October 2012
1
2. Hadoop Started As Batch
• Simple, powerful MapReduce
• Kills a lot of birds
• Efficient, scalable
• Compute at storage
• Shared platform
• Used by Pig, Hive, etc.
• Incredibly useful!
• But not sufficient
2
3. Big Data Is Not (Just) Batch
Its true themes are:
• Scalability
• Affordability
• Commodity hardware
• Open-source software
• Distributed & reliable
• Schema on read
• Data beats algorithms
3
4. HBase: First Non-Batch Component
Online key/value store
• Complement to batch
• Online put/get
• Batch load & analyze
• Best of both
• Popular combination
• A step towards the future…
4
5. Holy Grail Of Big Data
• Open source, commodity HW, etc.
• Linear scaling
• To scale, just buy more hardware
• On many axes
• Storage capacity
• Throughput & latency
• of batch & query
• Transactions, Joins, Indexes
• and batch!
5
you've heard a lot worried it might be hype bubbleyou might be hesitatingbelief: hadoop has a great futureover next few minutes tell you where hadoop is today and where hadoop's going so you can be comfortable adopting it for long-term profit from all of your data
Proven incredibly usefulEnables folks to benefit from vastly more dataNot something we’re ashamed of, rather proud of
… Need to look forward
…Back to today…
Major new capability in Impala not a niche another step towards a grander future we know where we're headed we shouldn't resist adoption use Impala today and expect more tomorrow