Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.
2. Talking about
What is Humongous Data
Humongous Data & You
MongoDB & Data processing
Future of Humongous Data
3. @spf13
AKA
Steve Francia
15+ years building
the internet
Father, husband,
skateboarder
Chief Solutions Architect @
responsible for drivers,
integrations, web & docs
5. 2000
Google Inc
Today announced it has released
the largest search engine on the
Internet.
Google’s new index, comprising
more than 1 billion URLs
6. 2008
Our indexing system for processing
links indicates that
we now count 1 trillion unique URLs
(and the number of individual web
pages out there is growing by
several billion pages per day).
8. Data Growth 1,000
1000
750
500
500
250
250
120
55
4 10 24
1
0
2000 2001 2002 2003 2004 2005 2006 2007 2008
Millions of URLs
9. Truly Exponential
Growth
Is hard for people to grasp
A BBC reporter recently: "Your current PC
is more powerful than the computer they
had on board the first flight to the moon".
10. Moore’s Law
Applies to more than just CPUs
Boiled down it is that things double at
regular intervals
It’s exponential growth.. and applies to
big data
25. Applications have
complex needs
MongoDB ideal operational
database
MongoDB ideal for BIG data
Not a data processing engine, but
provides processing functionality
26. Many options for
Processing Data
•Process in MongoDB using
Map Reduce
•Process in MongoDB using
Aggregation Framework
•Process outside MongoDB (using Hadoop)
27. MongoDB Map Reduce
Map()
MongoDB Data
Group(k)
emit(k,v)
map iterates on
documents
Document is $this
Sort(k)
1 at time per shard
Reduce(k,values)
k,v
Finalize(k,v)
Input matches output
k,v Can run multiple times
28. MongoDB Map Reduce
MongoDB map reduce quite capable... but with
limits
- Javascript not best language for processing map
reduce
- Javascript limited in external data processing
libraries
- Adds load to data store
29. MongoDB
Aggregation
Most uses of MongoDB Map Reduce were for
aggregation
Aggregation Framework optimized for aggregate
queries
Realtime aggregation similar to SQL GroupBy
30. MongoDB & Hadoop
same as Mongo's Many map operations
MongoDB shard chunks (64mb) 1 at time per input split
Creates a list each split Map (k1,1v1,1ctx) Runs on same
of Input Splits Map (k ,1v ,1ctx) thread as map
each split Map (k , v , ctx)
single server or
sharded cluster (InputFormat) each split ctx.write(k2,v2)2
ctx.write(k2,v )2 Combiner(k2,values2)2
RecordReader ctx.write(k2,v ) Combiner(k2,values )2
Combiner(k2,values )
k2, 2v3 3
k , 2v 3
k ,v
Partitioner(k2)2
Partitioner(k )2
Partitioner(k )
Sort(keys2)
Sort(k2)2
Sort(k )
MongoDB
Reducer threads
Reduce(k2,values3)
Output Format Runs once per key
kf,vf
32. DEMO
Install Hadoop MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop
streaming
Write reducer in Python using Hadoop
streaming
Call myself a data scientist
33. Installing Mongo-hadoop
https://gist.github.com/1887726
hadoop_version '0.23'
hadoop_path="/usr/local/Cellar/hadoop/
$hadoop_version.0/libexec/lib"
git clone git://github.com/mongodb/mongo-
hadoop.git
cd mongo-hadoop
sed -i '' "s/default/$hadoop_version/g" build.sbt
cd streaming
./build.sh