Brad Anderson from MapR gave a presentation on Hadoop and Storm. He explained that Hadoop is a distributed computing platform that ships functions to where the data is located. Storm is described as "Hadoop for real-time" processing. It provides guarantees for processing data reliably at scale across clusters. Topologies in Storm define the network of spouts that read data from sources and bolts that process the data streams.
2. whoami
• Brad Anderson
• Solutions Architect at MapR (Atlanta)
• ATLHUG co-chair
• NoSQL East Conference 2009
• “boorad” most places (twitter, github)
• banderson@maprtech.com
3. Hadoop: A Paradigm Shift
Distributed computing platform
– Large clusters
– Commodity hardware
Pioneered at Google
– Google File System, MapReduce and BigTable
Commercially available as Hadoop
4. Ship the Function to the Data
SAN/NAS
data data data
data data data
data data data
data data data
data data data
function
RDBMS
Traditional Architecture
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
Distributed Computing
13. One Platform for Big Data
…
99.999%
HA
Data
Protection
Disaster
Recovery
Scalability
&
Performance
Enterprise
Integration
Multi-
tenancy
Map
Reduce
File-Based
Applications
SQL Database Search Stream
Processing
Batc
h
Interactiv
e
Realtime
Batch
Log file Analysis
Data Warehouse Offload
Fraud Detection
Clickstream Analytics
Realtime
Sensor Analysis
“Twitterscraping”
Telematics
Process Optimization
Interactive
Forensic Analysis
Analytic Modeling
BI User Focus
29. Scaling Estimates
Twitter Firehose
Old School – 8+ separate
clusters, 20-25 nodes
• >3 Kafka nodes
• >2 TweetLoggers
• 5-10 Hadoop
• >2 Catcher nodes
• >3 Storm
• 3 zookeepers
• NAS for web storage
• >2 web servers
MapR – One Platform
• 5-10 nodes total
• Any node does any job
• Full HA included
• Backups included