Hadoop webinar-130808141030-phpapp01

Mongo-Hadoop Integration
Mike O’Brien, Software Engineer @ 10gen

Thursday, August 8, 13

We will cover:
A quick brieﬁng on what Mongo
and Hadoop are all about

The Mongo-Hadoop connector:
•what it is
•how it works
•a tour of what it can do
(Q+A at the end)

Choosing the Right Tool for the Task
Upcoming Webinar:
MongoDB and Hadoop - Essential Tools for
Your Big Data Playbook
August 21st, 2013
10am PDT, 1pm EDT, 6pm BST
Register at 10gen.com/events/biz-hadoop


document-oriented database with
dynamic schema


document-oriented database with
dynamic schema
stores data in JSON-like documents:
{

}


_id : “mike”,
age : 21,
location : {
state : ”NY”,
zip : ”11222”
},
favorite_colors : [“red”, “green”]

mongoDB scales horizontally
with sharding to handle lots of
data and load
app


Java-based framework for Map/Reduce
Excels at batch processing on large data sets
by taking advantage of parallelism


Mongo-Hadoop Connector - Why
Lots of people using Hadoop and Mongo
separately, but need integration
Need to process data across multiple sources
Custom code or slow and hacky import/
export scripts often used to get data in+out
Scalability and ﬂexibility with changes in
Hadoop or MongoDB conﬁgurations

Mongo-Hadoop Connector
Turn MongoDB into a Hadoop-enabled ﬁlesystem:
use as the input or output for Hadoop

New Feature: As of v1.1, also works with MongoDB
backup ﬁles (.bson)

input
data

Hadoop
Cluster

output
results

-or-

-or-

.BSON

.BSON


Beneﬁts + Features


Takes advantage of full multi-core
parallelism to process data in Mongo


Full integration with Hadoop and JVM ecosystems


Can be used with Amazon Elastic Mapreduce


Can be used with Amazon Elastic Mapreduce
Can read and write backup ﬁles from local
ﬁlesystem, HDFS, or S3


Vanilla Java MapReduce



or if you don’t want to use Java,
support for Hadoop Streaming.



write MapReduce code in

ruby


write MapReduce code in

ruby

python

Support for Pig
high-level scripting language for data analysis and
building map/reduce workﬂows


Support for Pig
high-level scripting language for data analysis and
building map/reduce workﬂows

Support for Hive
SQL-like language for ad-hoc queries + analysis of data sets on
Hadoop-compatible ﬁle systems


How it works:


How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data


How it works:
Each split gets assigned to a node in Hadoop cluster


How it works:
In parallel, Hadoop nodes pull data for splits from
MongoDB (or BSON) and process them locally


How it works:
In parallel, Hadoop nodes pull data for splits from
MongoDB (or BSON) and process them locally
Hadoop merges results and streams output back to
MongoDB or BSON

Tour of Mongo-Hadoop, by Example


- Using Java MapReduce with Mongo-Hadoop


- Using Hadoop Streaming


- Pig and Hive with Mongo-Hadoop


- Pig and Hive with Mongo-Hadoop
- Elastic MapReduce + BSON

Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}



{
"filename" : "1.",
"headers" : {
"X-bcc" : "",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
}

sender

}



{
"filename" : "1.",
"headers" : {
"X-bcc" : "",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
}

sender

recipients

}


Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair


14

alice

bob

48

9


99

eve

charlie
20

14

alice

bob

99

48

9

eve

charlie
20

{"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14}
{"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9}
{"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99}
{"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48}
{"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}

Example 1 - Java MapReduce
Map phase - each input doc gets
passed through a Mapper function

@Override
public
void
map(NullWritable
key,
BSONObject
val,
final
Context
context){

BSONObject
headers
=
(BSONObject)val.get("headers");

if(headers.containsKey("From")
&&
headers.containsKey("To")){

String
from
=
(String)headers.get("From");

String
to
=
(String)headers.get("To");

String[]
recips
=
to.split(",");

for(int
i=0;i<recips.length;i++){

String
recip
=
recips[i].trim();

context.write(new
MailPair(from,
recip),
new
IntWritable(1));

}

}
}


Example 1 - Java MapReduce
Map phase - each input doc gets
passed through a Mapper function
mongoDB document passed into
Hadoop MapReduce
@Override
public
void
map(NullWritable
key,
BSONObject
val,
final
Context
context){

BSONObject
headers
=
(BSONObject)val.get("headers");

if(headers.containsKey("From")
&&
headers.containsKey("To")){

String
from
=
(String)headers.get("From");

String
to
=
(String)headers.get("To");

String[]
recips
=
to.split(",");

for(int
i=0;i<recips.length;i++){

String
recip
=
recips[i].trim();

context.write(new
MailPair(from,
recip),
new
IntWritable(1));

}

}
}


Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer

public
void
reduce(
final
MailPair
pKey,

final
Iterable<IntWritable>
pValues,

final
Context
pContext
){

int
sum
=
0;

for
(
final
IntWritable
value
:
pValues
){

sum
+=
value.get();

}

BSONObject
outDoc
=
new
BasicDBObjectBuilder().start()

.add(
"f"
,
pKey.from)
.add(
"t"
,
pKey.to
)
.get();

BSONWritable
pkeyOut
=
new
BSONWritable(outDoc);

pContext.write(
pkeyOut,
new
IntWritable(sum)
);

}


the {to, from} key

public
void
reduce(
final
MailPair
pKey,

final
pValues,

final
Context
pContext
){

int
sum
=
0;

for
(
final
IntWritable
value
:
pValues
){

sum
+=
value.get();

}

BSONObject
outDoc
=
new

.add(
"f"
,
pKey.from)
.add(
"t"
,
pKey.to
)
.get();

BSONWritable
pkeyOut
=
new

pContext.write(
pkeyOut,
new
IntWritable(sum)
);

}


the {to, from} key

public
void
reduce(
final
MailPair
pKey,

final
pValues,

final
Context
pContext
){

int
sum
=
0;

for
(
final
IntWritable
value
:
pValues
){

sum
+=
value.get();

}

BSONObject
outDoc
=
new

.add(
"f"
,
pKey.from)
.add(
"t"
,
pKey.to
)
.get();

BSONWritable
pkeyOut
=
new

pContext.write(
pkeyOut,
new
IntWritable(sum)
);

}


list of all the values
collected under the key

the {to, from} key

public
void
reduce(
final
MailPair
pKey,

final
pValues,

final
Context
pContext
){

int
sum
=
0;

list of all the values
collected under the key

for
(
final
IntWritable
value
:
pValues
){

sum
+=
value.get();

}

BSONObject
outDoc
=
new

.add(
"f"
,
pKey.from)
.add(
"t"
,
pKey.to
)
.get();

BSONWritable
pkeyOut
=
new

pContext.write(
pkeyOut,
new
IntWritable(sum)
);

}

output written back to MongoDB

Read from MongoDB
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages


Read from MongoDB

Read from BSON
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir=file:///tmp/messages.bson


Read from MongoDB

Read from BSON
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir=file:///tmp/messages.bson
hdfs:///tmp/messages.bson
s3:///tmp/messages.bson


Write output to MongoDB
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out



Write output to BSON

mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir=file:///tmp/results.bson



Write output to BSON

mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir=file:///tmp/results.bson
hdfs:///tmp/results.bson
s3:///tmp/results.bson


Results : Output Data
mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "15126-1267@m2.innovyx.com" }, "count" : 1 }
"f" : "2586207@www4.imakenews.com" }, "count" : 1 }
"f" : "40enron@enron.com" }, "count" : 2 }
"f" : "a..davis@enron.com" }, "count" : 2 }
"f" : "a..hughes@enron.com" }, "count" : 4 }
"f" : "a..lindholm@enron.com" }, "count" : 1 }
"f" : "a..schroeder@enron.com" }, "count" : 1 }
...
has more


Example 2 - Hadoop Streaming

Let’s do the same Enron Map/Reduce job
with Python instead of Java

$ pip install pymongo_hadoop


Example 2 - Hadoop Streaming (cont)
Hadoop passes data to an external
process via STDOUT/STDIN
hadoop (JVM)
STDIN

map(k, v)
map(k, v)
map()
map(k, v)
JVM


STDOUT

Python / Ruby / JS
interpreter
def mapper(documents):
. . .

from pymongo_hadoop import BSONMapper
def mapper(documents):
i = 0
for doc in documents:
i = i + 1
from_field = doc['headers']['From']
to_field = doc['headers']['To']
recips = [x.strip() for x in to_field.split(',')]
for r in recips:
yield {'_id': {'f':from_field, 't':r}, 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."



from pymongo_hadoop import BSONReducer
def reducer(key, values):
print >> sys.stderr, "Processing from/to %s" % str(key)
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'count': _count}
BSONReducer(reducer)


Surviving Hadoop:
making MapReduce easier

with Pig + Hive

Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig


Pig is a powerful language that can
generate sophisticated map/reduce
workﬂows from simple scripts


Pig is a powerful language that can
generate sophisticated map/reduce
workﬂows from simple scripts
Can perform JOIN, GROUP, and execute
user-deﬁned functions (UDFs)

Example 3 - Mongo-Hadoop and Pig (cont)
Pig directives for loading data:
BSONLoader and MongoLoader
data = LOAD 'mongodb://localhost:27017/db.collection'
using com.mongodb.hadoop.pig.MongoLoader;

Writing data out
BSONStorage and MongoInsertStorage
STORE records INTO 'file:///output.bson'
using com.mongodb.hadoop.pig.BSONStorage;



Pig has its own special datatypes:
Bags, Maps, and Tuples
Mongo-Hadoop Connector intelligently
converts between Pig datatypes and
MongoDB datatypes


raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;


send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;


send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;


send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE
group, COUNT($1) as count;


send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE
group, COUNT($1) as count;
STORE send_recip_counted INTO 'file:///enron_results.bson'
using com.mongodb.hadoop.pig.BSONStorage;


Hive with Mongo-Hadoop

Similar idea to Pig - process your data
without needing to write Map/Reduce
code from scratch

...but with SQL as the language of choice


db.users.find()
{ "_id": 1, "name": "Tom", "age": 28 }

Sample Data:
db.users

{ "_id": 2, "name": "Alice", "age": 18 }
{ "_id": 3, "name": "Bob", "age": 29 }
{ "_id": 101, "name": "Scott", "age": 10 }
{ "_id": 104, "name": "Jesse", "age": 52 }
{ "_id": 110, "name": "Mike", "age": 32 }
...

ﬁrst, declare the collection to be
accessible in Hive:
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" )
TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users");


...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;



you can use GROUP BY:
SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;



you can use GROUP BY:
SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;

or JOIN multiple tables/collections together:
SELECT * FROM mongo_users T1
JOIN user_emails T2
WHERE T1.id = T2.id;

Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHERE age > 100 ;



DROP TABLE mongo_users;



Drop a table in Hive to delete the
underlying collection in MongoDB
DROP TABLE mongo_users;


Usage with Amazon Elastic MapReduce

Run mongo-hadoop jobs without
needing to set up or manage your
own Hadoop cluster.


Usage with Amazon Elastic MapReduce
First, make a “bootstrap” script that
fetches dependencies (mongo-hadoop
jar and java drivers)
#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/
mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar
wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoopcode/mongo-hadoop-core_1.1.2-1.1.0.jar

this will get executed on each node in
the cluster that EMR builds for us.

Example 4 - Usage with Amazon Elastic MapReduce
Put the bootstrap script, and all your code,
into an S3 bucket where Amazon can see it.

s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh
s3mod s3://$S3_BUCKET/bootstrap.sh public-read
s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/
enron-example.jar
s3mod s3://$S3_BUCKET/enron-example.jar public-read


...then launch the job from the command
line, pointing to your S3 locations
Control the type and
number of instances
in the cluster

$ elastic-mapreduce --create --jobflow ENRON000
--instance-type m1.xlarge
--num-instances 5
--bootstrap-action s3://$S3_BUCKET/bootstrap.sh
--log-uri s3://$S3_BUCKET/enron_logs
--jar s3://$S3_BUCKET/enron-example.jar
--arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
--arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson
--arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT
--arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
# (any additional parameters here)


Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster


Turn up the “num-instances” knob to
make jobs complete faster


Logs get captured into S3 ﬁles


Logs get captured into S3 ﬁles
(Pig, Hive, and streaming work on EMR, too!)

Example 5 - new feature: MongoUpdateWritable
In previous examples, we wrote job output data
by inserting into a new collection
... but we can also modify an existing output
collection
Works by applying mongoDB update modiﬁers:
$push, $pull, $addToSet, $inc, $set, etc.

Can be used to do incremental Map/Reduce or
“join” two collections

Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors


{

"_id":
ObjectId("51b792d381c3e67b0a18d0ed"),

"name":
"730LsRkX",

"type":
"pressure",

"owner":
"steve",
}

sensors

{

"_id":

"name":
"730LsRkX",

"type":
"pressure",

"owner":
"steve",
}

{

"_id":
ObjectId("51b792d381c3e67b0a18d678"),

"sensor_id":
ObjectId("51b792d381c3e67b0a18d4a1"),

"value":
3328.5895416489802,

"timestamp":
ISODate("2013-‐05-‐18T13:11:38.709-‐0400"),

"loc":
[-‐175.13,51.658]
}


sensors

log
events


{

"_id":

"name":
"730LsRkX",

"type":
"pressure",

"owner":
"steve",
}

{

"_id":

"sensor_id":

"value":
3328.5895416489802,

"timestamp":
ISODate("2013-‐05-‐18T13:11:38.709-‐0400"),

"loc":
[-‐175.13,51.658]
}

sensors

log
events


{

"_id":

"name":
"730LsRkX",

"type":
"pressure",

"owner":
"steve",
}

{

"_id":

"sensor_id":

"value":
3328.5895416489802,

"timestamp":
ISODate("2013-‐05-‐18T13:11:38.709-‐0400"),

"loc":
[-‐175.13,51.658]
}

refers to which sensor
logged the event

sensors

log
events

{

"_id":

"name":
"730LsRkX",

"type":
"pressure",

"owner":
"steve",
}

{

"_id":

"sensor_id":

"value":
3328.5895416489802,

"timestamp":
ISODate("2013-‐05-‐18T13:11:38.709-‐0400"),

"loc":
[-‐175.13,51.658]
}

refers to which sensor
logged the event

For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.

Plain english:
Bob’s sensors for temperature have stored 1300 readings
Bob’s sensors for pressure have stored 400 readings
Alice’s sensors for humidity have stored 600 readings
Alice’s sensors for temperature have stored 700 readings
etc...


Stage 1 -Map/Reduce
on sensors collection
sensors
(mongoDB collection)

log events

read from
mongoDB

map/reduce
for each sensor, emit:
{key: owner+type, value: _id}
group data from map() under each key, output:
{key: owner+type, val: [ list of _ids] }


insert() new records
to mongoDB

Results

After stage one, the output
docs look like:


docs look like:
the sensor’s
owner and type


docs look like:
the sensor’s
owner and type
{

"_id":
"alice
pressure",

"sensors":
[


ObjectId("51b792d381c3e67b0a18d16d"),

ObjectId("51b792d381c3e67b0a18d2bf"),

…

]
}


list of ID’s of
sensors with this
owner and type

docs look like:
the sensor’s
owner and type
{

"_id":
"alice
pressure",

"sensors":
[




…

]
}

list of ID’s of
sensors with this
owner and type

Now we just need to count the total # of log
events recorded for any sensors that appear
in the list for each owner/type group.

Stage 2 -Map/Reduce on
log events collection
sensors

{key: sensor_id, value: 1}


group data from map() under each key
for each value in that key:
update({sensors: key}, {$inc : {logs_count:1}})

log events

map/reduce

read from
mongoDB

update() existing
records in mongoDB

Results

Stage 2 -Map/Reduce on
log events collection
sensors

{key: sensor_id, value: 1}


group data from map() under each key
for each value in that key:
update({sensors: key}, {$inc : {logs_count:1}})

log events

map/reduce

read from
mongoDB

context.write(null,

new
MongoUpdateWritable(

query,
//which
documents
to
modify

update,
//how
to
modify
($inc)

true,

//upsert

false)
);
//
multi

update() existing
records in mongoDB

Results

Example - MongoUpdateWritable

Result after stage 2
{

"_id":
"1UoTcvnCTz
temp",

"sensors":
[




…

],

"logs_count":
1050616
}

now populated with correct count

Upcoming Features (v1.2 and beyond)
Performance Improvements - Lazy BSON
Full-featured Hive support
Support for multi-collection input sources
API for adding
custom splitter implementations
and more

Recap
Mongo-Hadoop - use Hadoop to do massive computations
on big data sets stored in Mongo/BSON

MongoDB becomes a Hadoop-enabled ﬁlesystem

Tools and APIs make it easier:
Streaming, Pig, Hive, EMR, etc.

Questions?

Examples can be found on github:
https://github.com/mongodb/mongo-hadoop/tree/
master/examples


Hadoop webinar-130808141030-phpapp01

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hadoop webinar-130808141030-phpapp01

Similar a Hadoop webinar-130808141030-phpapp01 (20)

Último

Último (20)

Hadoop webinar-130808141030-phpapp01