MongoDB, Hadoop & Humongous Data

MongoDB,
Hadoop &
Humongous
Data
Steve Francia @spf13

Talking about
What is Humongous Data
Why MongoDB & Hadoop
Getting Started (Demo)
Who’s using MongoDB & Hadoop
Future of Humongous Data

@spf13

AKA
Steve Francia
15+ years building
the internet

Father, husband,
skateboarder

Chief Solutions Architect @
responsible for drivers,
integrations, web & docs

2000
Google Inc
Today announced it has released
the largest search engine on the
Internet.

Google’s new index, comprising
more than 1 billion URLs

2008
Our indexing system for processing
links indicates that
we now count 1 trillion unique URLs

(and the number of individual web
pages out there is growing by
several billion pages per day).

An unprecedented
amount of data is
being created and is
accessible

Data Growth 1,000
1000

750

500
500

250
250
120
55
4 10 24
1
0
2000 2001 2002 2003 2004 2005 2006 2007 2008

Millions of URLs

What good is
all this data if
we can’t make
sense of it?

What cost Google
millions of $$
10 years ago to
build...

Could easily and
cheaply be built by a
teenager in a garage
thanks to products
like MongoDB,
Hadoop & AWS

Applications have
complex needs
MongoDB ideal operational
database
MongoDB ideal for BIG data
Not a data processing engine, but
provides processing functionality

MongoDB Map Reduce
Map()
MongoDB Data
Group(k)
emit(k,v)

map iterates on
documents
Document is $this
Sort(k)
1 at time per shard

Reduce(k,values)

k,v

Finalize(k,v)
Input matches output

k,v Can run multiple times

MongoDB Map Reduce
MongoDB map reduce quite capable... but with limits
- Javascript not best language for processing map
reduce
- Javascript limited in external data processing
libraries
- Adds load to data store
- Sharded environments do parallel processing

MongoDB
Aggregation
Most uses of MongoDB Map Reduce were for
aggregation
Aggregation Framework optimized for aggregate
queries
Fixes some of limits of MongoDB MR
- Can do realtime aggregation similar to SQL GroupBy
- parallel processing on sharded clusters

As your data processing
needs increase

you will want to use a
tool designed for the job

Hadoop Map Reduce
Runs on same
1 1
InputFormat Map (k , v , ctx) thread as map

Many map operations ctx.write(k2,v2) Combiner(k2,values2)
1 at time per input
split same as k 2, v 3
Mongo's emit

similar to
Mongo's reducer
similar to Partitioner(k2)
Mongo's group
Sort(keys2)

Reducer threads
similar to
Mongo's Finalize

Reduce(k3,values4)
Output Format Runs once per key
kf,vf

MongoDB & Hadoop
same as Mongo's Many map operations
MongoDB shard chunks (64mb) 1 at time per input split

Creates a list each split Map (k1,1v1,1ctx) Runs on same
of Input Splits Map (k ,1v ,1ctx) thread as map
each split Map (k , v , ctx)
single server or
sharded cluster (InputFormat) each split ctx.write(k2,v2)2
ctx.write(k2,v )2 Combiner(k2,values2)2
RecordReader ctx.write(k2,v ) Combiner(k2,values )2
Combiner(k2,values )
k2, 2v3 3
k , 2v 3
k ,v

Partitioner(k2)2
Partitioner(k )2
Partitioner(k )
Sort(keys2)
Sort(k2)2
Sort(k )

MongoDB

Reducer threads

Reduce(k2,values3)
Output Format Runs once per key

kf,vf

DEMO
Install Hadoop MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop
streaming
Write reducer in Python using Hadoop
streaming
Call myself a data scientist

Installing Mongo-hadoop
https://gist.github.com/1887726

hadoop_version '0.23'
hadoop_path="/usr/local/Cellar/hadoop/
$hadoop_version.0/libexec/lib"

git clone git://github.com/mongodb/mongo-
hadoop.git
cd mongo-hadoop
sed -i '' "s/default/$hadoop_version/g" build.sbt
cd streaming
./build.sh

Groking Twitter
curl
https://stream.twitter.com/1/statuses/
sample.json
-u<login>:<password>
| mongoimport -d test -c live

... let it run for about 2 hours

Map Hashtags in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents):
for doc in documents:
for hashtag in doc['entities']['hashtags']:
yield {'_id': hashtag['text'], 'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."

Reduce hashtags in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONReducer

def reducer(key, values):
print >> sys.stderr, "Hashtag %s" % key.encode('utf8')
_count = 0
for v in values:
_count += v['count']
return {'_id': key.encode('utf8'), 'count': _count}

BSONReducer(reducer)

All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -
mapper examples/twitter/twit_hashtag_map.py
-reducer examples/twitter/twit_hashtag_reduce.py
-inputURI mongodb://127.0.0.1/test.live
-outputURI mongodb://127.0.0.1/test.twit_reduction
-ﬁle examples/twitter/twit_hashtag_map.py
-ﬁle examples/twitter/twit_hashtag_reduce.py

Popular Hash Tags
db.twit_hashtags.ﬁnd().sort( {'count' : -1 })

{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }
{ "_id" : "teamfollowback", "count" : 200 }
{ "_id" : "RT", "count" : 150 }
{ "_id" : "Arsenal", "count" : 148 }
{ "_id" : "milars", "count" : 145 }
{ "_id" : "sanremo", "count" : 145 }
{ "_id" : "LoseMyNumberIf", "count" : 139 }
{ "_id" : "RelationshipsShould", "count" : 137 }
{ "_id" : "Bahrain", "count" : 129 }
{ "_id" : "bahrain", "count" : 125 }
{ "_id" : "oomf", "count" : 117 }
{ "_id" : "BabyKillerOcalan", "count" : 106 }
{ "_id" : "TeamFollowBack", "count" : 105 }
{ "_id" : "WhyDoPeopleThink", "count" : 102 }
{ "_id" : "np", "count" : 100 }

Aggregation in Mongo 2.1
db.live.aggregate(
{ $unwind : "$entities.hashtags" } ,
{ $match :
{ "entities.hashtags.text" :
{ $exists : true } } } ,
{ $group :
{ _id : "$entities.hashtags.text",
count : { $sum : 1 } } } ,
{ $sort : { count : -1 } },
{ $limit : 10 }
)

Popular Hash Tags
db.twit_hashtags.aggregate(a){
"result" : [
{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 },
{ "_id" : "teamfollowback", "count" : 200 },
{ "_id" : "RT", "count" : 150 },
{ "_id" : "Arsenal", "count" : 148 },
{ "_id" : "milars", "count" : 145 },
{ "_id" : "sanremo","count" : 145 },
{ "_id" : "LoseMyNumberIf", "count" : 139 },
{ "_id" : "RelationshipsShould", "count" : 137 },
{ "_id" : "Bahrain", "count" : 129 },
{ "_id" : "bahrain", "count" : 125 }
],"ok" : 1
}

Who
is Usin
MongoD &

Today

Production usage
Orbitz
Badgeville
foursquare
CityGrid
and more

The
Futureof
humongous
data

What is BIG?
BIG today is
normal tomorrow

Data Growth 9,000
9000

6750

4,400
4500

2,150
2250
1,000
500
55 120 250
1 4 10 24
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Millions of URLs

2012
Generating over
250 Millions of
tweets per day

MongoDB enables us to scale
with the redeﬁnition of BIG.

New processing tools like
Hadoop & Storm are enabling
us to process the new BIG.

MongoDB is
committed to
working with best
data tools including
Storm, Spark, &
more

http://spf13.com
http://github.com/s
@spf13

Question
download at mongodb.org
We’re hiring!! Contact us at jobs@10gen.com

MongoDB, Hadoop & Humongous Data

MongoDB, Hadoop & Humongous Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a MongoDB, Hadoop & Humongous Data

Similar a MongoDB, Hadoop & Humongous Data (20)

Más de Steven Francia

Más de Steven Francia (18)

Último

Último (20)

MongoDB, Hadoop & Humongous Data

Notas del editor