This document discusses humongous data and how MongoDB and Hadoop can be used together to process large datasets. It begins with defining humongous data and how the amount of data being created is growing exponentially. It then demonstrates how MongoDB can be used for operational databases and basic data processing but is limited, while Hadoop is designed for large-scale data processing. The document concludes by discussing how technologies like MongoDB and Hadoop will continue to evolve to handle the growing sizes of data being created.
2. Talking about
What is Humongous Data
Why MongoDB & Hadoop
Getting Started (Demo)
Who’s using MongoDB & Hadoop
Future of Humongous Data
3. @spf13
AKA
Steve Francia
15+ years building
the internet
Father, husband,
skateboarder
Chief Solutions Architect @
responsible for drivers,
integrations, web & docs
5. 2000
Google Inc
Today announced it has released
the largest search engine on the
Internet.
Google’s new index, comprising
more than 1 billion URLs
6. 2008
Our indexing system for processing
links indicates that
we now count 1 trillion unique URLs
(and the number of individual web
pages out there is growing by
several billion pages per day).
13. Applications have
complex needs
MongoDB ideal operational
database
MongoDB ideal for BIG data
Not a data processing engine, but
provides processing functionality
14. MongoDB Map Reduce
Map()
MongoDB Data
Group(k)
emit(k,v)
map iterates on
documents
Document is $this
Sort(k)
1 at time per shard
Reduce(k,values)
k,v
Finalize(k,v)
Input matches output
k,v Can run multiple times
15. MongoDB Map Reduce
MongoDB map reduce quite capable... but with limits
- Javascript not best language for processing map
reduce
- Javascript limited in external data processing
libraries
- Adds load to data store
- Sharded environments do parallel processing
16. MongoDB
Aggregation
Most uses of MongoDB Map Reduce were for
aggregation
Aggregation Framework optimized for aggregate
queries
Fixes some of limits of MongoDB MR
- Can do realtime aggregation similar to SQL GroupBy
- parallel processing on sharded clusters
17. As your data processing
needs increase
you will want to use a
tool designed for the job
18. Hadoop Map Reduce
Runs on same
1 1
InputFormat Map (k , v , ctx) thread as map
Many map operations ctx.write(k2,v2) Combiner(k2,values2)
1 at time per input
split same as k 2, v 3
Mongo's emit
similar to
Mongo's reducer
similar to Partitioner(k2)
Mongo's group
Sort(keys2)
Reducer threads
similar to
Mongo's Finalize
Reduce(k3,values4)
Output Format Runs once per key
kf,vf
19. MongoDB & Hadoop
same as Mongo's Many map operations
MongoDB shard chunks (64mb) 1 at time per input split
Creates a list each split Map (k1,1v1,1ctx) Runs on same
of Input Splits Map (k ,1v ,1ctx) thread as map
each split Map (k , v , ctx)
single server or
sharded cluster (InputFormat) each split ctx.write(k2,v2)2
ctx.write(k2,v )2 Combiner(k2,values2)2
RecordReader ctx.write(k2,v ) Combiner(k2,values )2
Combiner(k2,values )
k2, 2v3 3
k , 2v 3
k ,v
Partitioner(k2)2
Partitioner(k )2
Partitioner(k )
Sort(keys2)
Sort(k2)2
Sort(k )
MongoDB
Reducer threads
Reduce(k2,values3)
Output Format Runs once per key
kf,vf
21. DEMO
Install Hadoop MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop
streaming
Write reducer in Python using Hadoop
streaming
Call myself a data scientist
22. Installing Mongo-hadoop
https://gist.github.com/1887726
hadoop_version '0.23'
hadoop_path="/usr/local/Cellar/hadoop/
$hadoop_version.0/libexec/lib"
git clone git://github.com/mongodb/mongo-
hadoop.git
cd mongo-hadoop
sed -i '' "s/default/$hadoop_version/g" build.sbt
cd streaming
./build.sh