SlideShare a Scribd company logo
1 of 111
Download to read offline
Mongo-Hadoop Integration
Mike O’Brien, Software Engineer @ 10gen
Thursday, August 8, 13
We will cover:
The Mongo-Hadoop connector:
•what it is
•how it works
•a tour of what it can do
A quick briefing on what Mongo
and Hadoop are all about
(Q+A at the end)
Thursday, August 8, 13
Choosing the Right Tool for the Task
Upcoming Webinar:
MongoDB and Hadoop - Essential Tools for
Your Big Data Playbook
August 21st, 2013
10am PDT, 1pm EDT, 6pm BST
Register at 10gen.com/events/biz-hadoop
Thursday, August 8, 13
Thursday, August 8, 13
document-oriented database with
dynamic schema
Thursday, August 8, 13
document-oriented database with
dynamic schema
stores data in JSON-like documents:
{
_id : “mike”,
age : 21,
location : {
state : ”NY”,
zip : ”11222”
},
favorite_colors : [“red”, “green”]
}
Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app
Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app
Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app
Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app
Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app
Thursday, August 8, 13
Java-based framework for Map/Reduce
Excels at batch processing on large data sets
by taking advantage of parallelism
Thursday, August 8, 13
Mongo-Hadoop Connector - Why
Lots of people using Hadoop and Mongo
separately, but need integration
Custom code or slow and hacky import/
export scripts often used to get data in+out
Scalability and flexibility with changes in
Hadoop or MongoDB configurations
Need to process data across multiple sources
Thursday, August 8, 13
Mongo-Hadoop Connector
Turn MongoDB into a Hadoop-enabled filesystem:
use as the input or output for Hadoop
New Feature: As of v1.1, also works with MongoDB
backup files (.bson)
.BSON
-or-
input
data
.BSON
-or-
Hadoop
Cluster
output
results
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full integration with Hadoop and JVM ecosystems
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full integration with Hadoop and JVM ecosystems
Can be used with Amazon Elastic Mapreduce
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full integration with Hadoop and JVM ecosystems
Can be used with Amazon Elastic Mapreduce
Can read and write backup files from local
filesystem, HDFS, or S3
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
Vanilla Java MapReduce
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
Vanilla Java MapReduce
or if you don’t want to use Java,
support for Hadoop Streaming.
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
Vanilla Java MapReduce
write MapReduce code in
ruby
or if you don’t want to use Java,
support for Hadoop Streaming.
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
Vanilla Java MapReduce
write MapReduce code in
ruby
or if you don’t want to use Java,
support for Hadoop Streaming.
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
Vanilla Java MapReduce
write MapReduce code in
ruby python
or if you don’t want to use Java,
support for Hadoop Streaming.
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
Support for Pig
high-level scripting language for data analysis and
building map/reduce workflows
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
Support for Pig
high-level scripting language for data analysis and
building map/reduce workflows
Support for Hive
SQL-like language for ad-hoc queries + analysis of data sets on
Hadoop-compatible file systems
Benefits + Features
Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data
Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data
Each split gets assigned to a node in Hadoop cluster
Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data
Each split gets assigned to a node in Hadoop cluster
In parallel, Hadoop nodes pull data for splits from
MongoDB (or BSON) and process them locally
Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data
Each split gets assigned to a node in Hadoop cluster
In parallel, Hadoop nodes pull data for splits from
MongoDB (or BSON) and process them locally
Hadoop merges results and streams output back to
MongoDB or BSON
Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming
Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming
- Pig and Hive with Mongo-Hadoop
Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming
- Pig and Hive with Mongo-Hadoop
- Elastic MapReduce + BSON
Thursday, August 8, 13
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Input Data: Enron e-mail corpus (501k records, 1.75Gb)
each document is one email
Thursday, August 8, 13
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Input Data: Enron e-mail corpus (501k records, 1.75Gb)
each document is one email
sender
Thursday, August 8, 13
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Input Data: Enron e-mail corpus (501k records, 1.75Gb)
each document is one email
sender
recipients
Thursday, August 8, 13
Thursday, August 8, 13
Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair
Thursday, August 8, 13
Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair
bob
alice
eve
charlie
14
99
9
48
20
Thursday, August 8, 13
{"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14}
{"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9}
{"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99}
{"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48}
{"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}
Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair
bob
alice
eve
charlie
14
99
9
48
20
Thursday, August 8, 13
Example 1 - Java MapReduce
Map phase - each input doc gets
passed through a Mapper function
@Override
public	
  void	
  map(NullWritable	
  key,	
  BSONObject	
  val,	
  final	
  Context	
  context){
	
  	
  	
  	
  BSONObject	
  headers	
  =	
  (BSONObject)val.get("headers");
	
  	
  	
  	
  if(headers.containsKey("From")	
  &&	
  headers.containsKey("To")){
	
  	
  	
  	
  	
  	
  	
  	
  String	
  from	
  =	
  (String)headers.get("From");
	
  	
  	
  	
  	
  	
  	
  	
  String	
  to	
  =	
  (String)headers.get("To");
	
  	
  	
  	
  	
  	
  	
  	
  String[]	
  recips	
  =	
  to.split(",");
	
  	
  	
  	
  	
  	
  	
  	
  for(int	
  i=0;i<recips.length;i++){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  String	
  recip	
  =	
  recips[i].trim();
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(new	
  MailPair(from,	
  recip),	
  new	
  IntWritable(1));
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  }
}
Thursday, August 8, 13
Example 1 - Java MapReduce
mongoDB document passed into
Hadoop MapReduce
Map phase - each input doc gets
passed through a Mapper function
@Override
public	
  void	
  map(NullWritable	
  key,	
  BSONObject	
  val,	
  final	
  Context	
  context){
	
  	
  	
  	
  BSONObject	
  headers	
  =	
  (BSONObject)val.get("headers");
	
  	
  	
  	
  if(headers.containsKey("From")	
  &&	
  headers.containsKey("To")){
	
  	
  	
  	
  	
  	
  	
  	
  String	
  from	
  =	
  (String)headers.get("From");
	
  	
  	
  	
  	
  	
  	
  	
  String	
  to	
  =	
  (String)headers.get("To");
	
  	
  	
  	
  	
  	
  	
  	
  String[]	
  recips	
  =	
  to.split(",");
	
  	
  	
  	
  	
  	
  	
  	
  for(int	
  i=0;i<recips.length;i++){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  String	
  recip	
  =	
  recips[i].trim();
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(new	
  MailPair(from,	
  recip),	
  new	
  IntWritable(1));
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  }
}
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
	
  	
  	
  	
  public	
  void	
  reduce(	
  final	
  MailPair	
  pKey,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Iterable<IntWritable>	
  pValues,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Context	
  pContext	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (	
  final	
  IntWritable	
  value	
  :	
  pValues	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  value.get();
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  	
  	
  	
  	
  BSONObject	
  outDoc	
  =	
  new	
  BasicDBObjectBuilder().start()
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "f"	
  ,	
  pKey.from)
.add(	
  "t"	
  ,	
  pKey.to	
  )
.get();
	
  	
  	
  	
  	
  	
  	
  	
  BSONWritable	
  pkeyOut	
  =	
  new	
  BSONWritable(outDoc);
	
  	
  	
  	
  	
  	
  	
  	
  pContext.write(	
  pkeyOut,	
  new	
  IntWritable(sum)	
  );
	
  	
  	
  	
  }
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to, from} key
	
  	
  	
  	
  public	
  void	
  reduce(	
  final	
  MailPair	
  pKey,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Iterable<IntWritable>	
  pValues,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Context	
  pContext	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (	
  final	
  IntWritable	
  value	
  :	
  pValues	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  value.get();
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  	
  	
  	
  	
  BSONObject	
  outDoc	
  =	
  new	
  BasicDBObjectBuilder().start()
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "f"	
  ,	
  pKey.from)
.add(	
  "t"	
  ,	
  pKey.to	
  )
.get();
	
  	
  	
  	
  	
  	
  	
  	
  BSONWritable	
  pkeyOut	
  =	
  new	
  BSONWritable(outDoc);
	
  	
  	
  	
  	
  	
  	
  	
  pContext.write(	
  pkeyOut,	
  new	
  IntWritable(sum)	
  );
	
  	
  	
  	
  }
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to, from} key
list of all the values
collected under the key
	
  	
  	
  	
  public	
  void	
  reduce(	
  final	
  MailPair	
  pKey,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Iterable<IntWritable>	
  pValues,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Context	
  pContext	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (	
  final	
  IntWritable	
  value	
  :	
  pValues	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  value.get();
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  	
  	
  	
  	
  BSONObject	
  outDoc	
  =	
  new	
  BasicDBObjectBuilder().start()
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "f"	
  ,	
  pKey.from)
.add(	
  "t"	
  ,	
  pKey.to	
  )
.get();
	
  	
  	
  	
  	
  	
  	
  	
  BSONWritable	
  pkeyOut	
  =	
  new	
  BSONWritable(outDoc);
	
  	
  	
  	
  	
  	
  	
  	
  pContext.write(	
  pkeyOut,	
  new	
  IntWritable(sum)	
  );
	
  	
  	
  	
  }
Thursday, August 8, 13
output written back to MongoDB
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to, from} key
list of all the values
collected under the key
	
  	
  	
  	
  public	
  void	
  reduce(	
  final	
  MailPair	
  pKey,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Iterable<IntWritable>	
  pValues,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Context	
  pContext	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (	
  final	
  IntWritable	
  value	
  :	
  pValues	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  value.get();
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  	
  	
  	
  	
  BSONObject	
  outDoc	
  =	
  new	
  BasicDBObjectBuilder().start()
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "f"	
  ,	
  pKey.from)
.add(	
  "t"	
  ,	
  pKey.to	
  )
.get();
	
  	
  	
  	
  	
  	
  	
  	
  BSONWritable	
  pkeyOut	
  =	
  new	
  BSONWritable(outDoc);
	
  	
  	
  	
  	
  	
  	
  	
  pContext.write(	
  pkeyOut,	
  new	
  IntWritable(sum)	
  );
	
  	
  	
  	
  }
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages
Read from MongoDB
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages
Read from MongoDB
Read from BSON
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir=file:///tmp/messages.bson
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages
Read from MongoDB
Read from BSON
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir=file:///tmp/messages.bson
hdfs:///tmp/messages.bson
s3:///tmp/messages.bson
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out
Write output to MongoDB
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out
Write output to MongoDB
Write output to BSON
mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir=file:///tmp/results.bson
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out
Write output to MongoDB
Write output to BSON
mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir=file:///tmp/results.bson
hdfs:///tmp/results.bson
s3:///tmp/results.bson
Thursday, August 8, 13
Results : Output Data
mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "15126-1267@m2.innovyx.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "2586207@www4.imakenews.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "40enron@enron.com" }, "count" : 2 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..davis@enron.com" }, "count" : 2 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..hughes@enron.com" }, "count" : 4 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..lindholm@enron.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..schroeder@enron.com" }, "count" : 1 }
...
has more
Thursday, August 8, 13
Example 2 - Hadoop Streaming
Let’s do the same Enron Map/Reduce job
with Python instead of Java
$ pip install pymongo_hadoop
Thursday, August 8, 13
Example 2 - Hadoop Streaming (cont)
Hadoop passes data to an external
process via STDOUT/STDIN
map(k, v)
map(k, v)
map(k, v)map()
JVM
STDIN
Python / Ruby / JS
interpreter
STDOUT
hadoop (JVM)
def mapper(documents):
. . .
Thursday, August 8, 13
from pymongo_hadoop import BSONMapper
def mapper(documents):
i = 0
for doc in documents:
i = i + 1
from_field = doc['headers']['From']
to_field = doc['headers']['To']
recips = [x.strip() for x in to_field.split(',')]
for r in recips:
yield {'_id': {'f':from_field, 't':r}, 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
Example 2 - Hadoop Streaming (cont)
Thursday, August 8, 13
from pymongo_hadoop import BSONReducer
def reducer(key, values):
print >> sys.stderr, "Processing from/to %s" % str(key)
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'count': _count}
BSONReducer(reducer)
Example 2 - Hadoop Streaming (cont)
Thursday, August 8, 13
Surviving Hadoop:
making MapReduce easier
with Pig + Hive
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig
Pig is a powerful language that can
generate sophisticated map/reduce
workflows from simple scripts
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig
Pig is a powerful language that can
generate sophisticated map/reduce
workflows from simple scripts
Can perform JOIN, GROUP, and execute
user-defined functions (UDFs)
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
Pig directives for loading data:
BSONLoader and MongoLoader
data = LOAD 'mongodb://localhost:27017/db.collection'
using com.mongodb.hadoop.pig.MongoLoader;
STORE records INTO 'file:///output.bson'
using com.mongodb.hadoop.pig.BSONStorage;
Writing data out
BSONStorage and MongoInsertStorage
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
Pig has its own special datatypes:
Bags, Maps, and Tuples
Mongo-Hadoop Connector intelligently
converts between Pig datatypes and
MongoDB datatypes
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;
send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE
group, COUNT($1) as count;
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;
send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE
group, COUNT($1) as count;
STORE send_recip_counted INTO 'file:///enron_results.bson'
using com.mongodb.hadoop.pig.BSONStorage;
Thursday, August 8, 13
Hive with Mongo-Hadoop
Similar idea to Pig - process your data
without needing to write Map/Reduce
code from scratch
...but with SQL as the language of choice
Thursday, August 8, 13
Hive with Mongo-Hadoop
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" )
TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users");
first, declare the collection to be
accessible in Hive:
Sample Data:
db.users
db.users.find()
{ "_id": 1, "name": "Tom", "age": 28 }
{ "_id": 2, "name": "Alice", "age": 18 }
{ "_id": 3, "name": "Bob", "age": 29 }
{ "_id": 101, "name": "Scott", "age": 10 }
{ "_id": 104, "name": "Jesse", "age": 52 }
{ "_id": 110, "name": "Mike", "age": 32 }
...
Thursday, August 8, 13
Hive with Mongo-Hadoop
Thursday, August 8, 13
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;
Thursday, August 8, 13
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;
SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;
you can use GROUP BY:
Thursday, August 8, 13
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;
SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;
you can use GROUP BY:
or JOIN multiple tables/collections together:
SELECT * FROM mongo_users T1
JOIN user_emails T2
WHERE T1.id = T2.id;
Thursday, August 8, 13
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHERE age > 100 ;
Thursday, August 8, 13
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHERE age > 100 ;
DROP TABLE mongo_users;
Thursday, August 8, 13
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHERE age > 100 ;
DROP TABLE mongo_users;
Drop a table in Hive to delete the
underlying collection in MongoDB
Thursday, August 8, 13
Usage with Amazon Elastic MapReduce
Run mongo-hadoop jobs without
needing to set up or manage your
own Hadoop cluster.
Thursday, August 8, 13
Usage with Amazon Elastic MapReduce
First, make a “bootstrap” script that
fetches dependencies (mongo-hadoop
jar and java drivers)
#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/
mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar
wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoop-
code/mongo-hadoop-core_1.1.2-1.1.0.jar
this will get executed on each node in
the cluster that EMR builds for us.
Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Put the bootstrap script, and all your code,
into an S3 bucket where Amazon can see it.
s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh
s3mod s3://$S3_BUCKET/bootstrap.sh public-read
s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/
enron-example.jar
s3mod s3://$S3_BUCKET/enron-example.jar public-read
Thursday, August 8, 13
$ elastic-mapreduce --create --jobflow ENRON000
--instance-type m1.xlarge
--num-instances 5
--bootstrap-action s3://$S3_BUCKET/bootstrap.sh
--log-uri s3://$S3_BUCKET/enron_logs
--jar s3://$S3_BUCKET/enron-example.jar
--arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
--arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson
--arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT
--arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
# (any additional parameters here)
Example 4 - Usage with Amazon Elastic MapReduce
...then launch the job from the command
line, pointing to your S3 locations
Control the type and
number of instances
in the cluster
Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
Turn up the “num-instances” knob to
make jobs complete faster
Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
Turn up the “num-instances” knob to
make jobs complete faster
Logs get captured into S3 files
Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
Turn up the “num-instances” knob to
make jobs complete faster
(Pig, Hive, and streaming work on EMR, too!)
Logs get captured into S3 files
Thursday, August 8, 13
Example 5 - new feature: MongoUpdateWritable
... but we can also modify an existing output
collection
Works by applying mongoDB update modifiers:
$push, $pull, $addToSet, $inc, $set, etc.
Can be used to do incremental Map/Reduce or
“join” two collections
In previous examples, we wrote job output data
by inserting into a new collection
Thursday, August 8, 13
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}
sensors
Thursday, August 8, 13
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}
sensors
Thursday, August 8, 13
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}
sensors
log
events
Thursday, August 8, 13
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}
sensors
log
events
refers to which sensor
logged the event
Thursday, August 8, 13
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}
sensors
log
events
refers to which sensor
logged the event
Thursday, August 8, 13
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}
sensors
log
events
refers to which sensor
logged the event
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.
Thursday, August 8, 13
Thursday, August 8, 13
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.
Thursday, August 8, 13
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.
Plain english:
Bob’s sensors for temperature have stored 1300 readings
Bob’s sensors for pressure have stored 400 readings
Alice’s sensors for humidity have stored 600 readings
Alice’s sensors for temperature have stored 700 readings
etc...
Thursday, August 8, 13
sensors
(mongoDB collection)
Stage 1 -Map/Reduce
on sensors collection
Results
(mongoDB collection)
for each sensor, emit:
{key: owner+type, value: _id}
group data from map() under each key, output:
{key: owner+type, val: [ list of _ids] }
read from
mongoDB
insert() new records
to mongoDB
map/reduce
log events
(mongoDB collection)
Thursday, August 8, 13
After stage one, the output
docs look like:
Thursday, August 8, 13
the sensor’s
owner and type
After stage one, the output
docs look like:
Thursday, August 8, 13
the sensor’s
owner and type
After stage one, the output
docs look like:
list of ID’s of
sensors with this
owner and type
{
	
  	
  "_id":	
  "alice	
  pressure",
	
  	
  "sensors":	
  [
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d475"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d16d"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d2bf"),
	
  	
  	
  	
  …
	
  	
  ]
}
Thursday, August 8, 13
the sensor’s
owner and type
After stage one, the output
docs look like:
list of ID’s of
sensors with this
owner and type
{
	
  	
  "_id":	
  "alice	
  pressure",
	
  	
  "sensors":	
  [
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d475"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d16d"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d2bf"),
	
  	
  	
  	
  …
	
  	
  ]
}
Now we just need to count the total # of log
events recorded for any sensors that appear
in the list for each owner/type group.
Thursday, August 8, 13
sensors
(mongoDB collection)
Stage 2 -Map/Reduce on
log events collection
Results
(mongoDB collection)
read from
mongoDB
update() existing
records in mongoDB
map/reduce
log events
(mongoDB collection)
for each sensor, emit:
{key: sensor_id, value: 1}
group data from map() under each key
for each value in that key:
update({sensors: key}, {$inc : {logs_count:1}})
Thursday, August 8, 13
sensors
(mongoDB collection)
Stage 2 -Map/Reduce on
log events collection
Results
(mongoDB collection)
read from
mongoDB
update() existing
records in mongoDB
map/reduce
log events
(mongoDB collection)
for each sensor, emit:
{key: sensor_id, value: 1}
group data from map() under each key
for each value in that key:
update({sensors: key}, {$inc : {logs_count:1}})
context.write(null,	
  
new	
  MongoUpdateWritable(
	
  	
  	
  query,	
  //which	
  documents	
  to	
  modify	
  
	
  	
  	
  update,	
  //how	
  to	
  modify	
  ($inc)
	
  	
  	
  true,	
  	
  	
  	
  //upsert
	
  	
  	
  false)
);	
  //	
  multi
Thursday, August 8, 13
Example - MongoUpdateWritable
Result after stage 2
{
	
  	
  "_id":	
  "1UoTcvnCTz	
  temp",
	
  	
  "sensors":	
  [
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d475"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d16d"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d2bf"),
	
  	
  	
  	
  …
	
  	
  ],
	
  	
  "logs_count":	
  1050616
}
now populated with correct count
Thursday, August 8, 13
Upcoming Features (v1.2 and beyond)
Full-featured Hive support
Performance Improvements - Lazy BSON
Support for multi-collection input sources
API for adding
custom splitter implementations
and more
Thursday, August 8, 13
Recap
Mongo-Hadoop - use Hadoop to do massive computations
on big data sets stored in Mongo/BSON
Tools and APIs make it easier:
Streaming, Pig, Hive, EMR, etc.
MongoDB becomes a Hadoop-enabled filesystem
Thursday, August 8, 13
Questions?
https://github.com/mongodb/mongo-hadoop/tree/
master/examples
Examples can be found on github:
Thursday, August 8, 13

More Related Content

More from MongoDB

MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump StartMongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB
 
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDBMongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDBMongoDB
 
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...MongoDB
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB
 
MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...
MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...
MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...MongoDB
 

More from MongoDB (20)

MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDBMongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
 
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
 
MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...
MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...
MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Webinar: What's New with MongoDB Hadoop Integration

  • 1. Mongo-Hadoop Integration Mike O’Brien, Software Engineer @ 10gen Thursday, August 8, 13
  • 2. We will cover: The Mongo-Hadoop connector: •what it is •how it works •a tour of what it can do A quick briefing on what Mongo and Hadoop are all about (Q+A at the end) Thursday, August 8, 13
  • 3. Choosing the Right Tool for the Task Upcoming Webinar: MongoDB and Hadoop - Essential Tools for Your Big Data Playbook August 21st, 2013 10am PDT, 1pm EDT, 6pm BST Register at 10gen.com/events/biz-hadoop Thursday, August 8, 13
  • 5. document-oriented database with dynamic schema Thursday, August 8, 13
  • 6. document-oriented database with dynamic schema stores data in JSON-like documents: { _id : “mike”, age : 21, location : { state : ”NY”, zip : ”11222” }, favorite_colors : [“red”, “green”] } Thursday, August 8, 13
  • 7. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 8. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 9. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 10. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 11. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 12. Java-based framework for Map/Reduce Excels at batch processing on large data sets by taking advantage of parallelism Thursday, August 8, 13
  • 13. Mongo-Hadoop Connector - Why Lots of people using Hadoop and Mongo separately, but need integration Custom code or slow and hacky import/ export scripts often used to get data in+out Scalability and flexibility with changes in Hadoop or MongoDB configurations Need to process data across multiple sources Thursday, August 8, 13
  • 14. Mongo-Hadoop Connector Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop New Feature: As of v1.1, also works with MongoDB backup files (.bson) .BSON -or- input data .BSON -or- Hadoop Cluster output results Thursday, August 8, 13
  • 15. Mongo-Hadoop Connector Benefits + Features Thursday, August 8, 13
  • 16. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Thursday, August 8, 13
  • 17. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Thursday, August 8, 13
  • 18. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic Mapreduce Thursday, August 8, 13
  • 19. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic Mapreduce Can read and write backup files from local filesystem, HDFS, or S3 Thursday, August 8, 13
  • 20. Mongo-Hadoop Connector Benefits + Features Thursday, August 8, 13
  • 21. Mongo-Hadoop Connector Vanilla Java MapReduce Benefits + Features Thursday, August 8, 13
  • 22. Mongo-Hadoop Connector Vanilla Java MapReduce or if you don’t want to use Java, support for Hadoop Streaming. Benefits + Features Thursday, August 8, 13
  • 23. Mongo-Hadoop Connector Vanilla Java MapReduce write MapReduce code in ruby or if you don’t want to use Java, support for Hadoop Streaming. Benefits + Features Thursday, August 8, 13
  • 24. Mongo-Hadoop Connector Vanilla Java MapReduce write MapReduce code in ruby or if you don’t want to use Java, support for Hadoop Streaming. Benefits + Features Thursday, August 8, 13
  • 25. Mongo-Hadoop Connector Vanilla Java MapReduce write MapReduce code in ruby python or if you don’t want to use Java, support for Hadoop Streaming. Benefits + Features Thursday, August 8, 13
  • 26. Mongo-Hadoop Connector Benefits + Features Thursday, August 8, 13
  • 27. Mongo-Hadoop Connector Support for Pig high-level scripting language for data analysis and building map/reduce workflows Benefits + Features Thursday, August 8, 13
  • 28. Mongo-Hadoop Connector Support for Pig high-level scripting language for data analysis and building map/reduce workflows Support for Hive SQL-like language for ad-hoc queries + analysis of data sets on Hadoop-compatible file systems Benefits + Features Thursday, August 8, 13
  • 29. Mongo-Hadoop Connector How it works: Thursday, August 8, 13
  • 30. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Thursday, August 8, 13
  • 31. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster Thursday, August 8, 13
  • 32. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Thursday, August 8, 13
  • 33. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Hadoop merges results and streams output back to MongoDB or BSON Thursday, August 8, 13
  • 34. Tour of Mongo-Hadoop, by Example Thursday, August 8, 13
  • 35. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop Thursday, August 8, 13
  • 36. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming Thursday, August 8, 13
  • 37. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming - Pig and Hive with Mongo-Hadoop Thursday, August 8, 13
  • 38. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming - Pig and Hive with Mongo-Hadoop - Elastic MapReduce + BSON Thursday, August 8, 13
  • 39. { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email Thursday, August 8, 13
  • 40. { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email sender Thursday, August 8, 13
  • 41. { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email sender recipients Thursday, August 8, 13
  • 43. Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair Thursday, August 8, 13
  • 44. Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair bob alice eve charlie 14 99 9 48 20 Thursday, August 8, 13
  • 45. {"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14} {"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9} {"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99} {"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48} {"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20} Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair bob alice eve charlie 14 99 9 48 20 Thursday, August 8, 13
  • 46. Example 1 - Java MapReduce Map phase - each input doc gets passed through a Mapper function @Override public  void  map(NullWritable  key,  BSONObject  val,  final  Context  context){        BSONObject  headers  =  (BSONObject)val.get("headers");        if(headers.containsKey("From")  &&  headers.containsKey("To")){                String  from  =  (String)headers.get("From");                String  to  =  (String)headers.get("To");                String[]  recips  =  to.split(",");                for(int  i=0;i<recips.length;i++){                        String  recip  =  recips[i].trim();                        context.write(new  MailPair(from,  recip),  new  IntWritable(1));                }        } } Thursday, August 8, 13
  • 47. Example 1 - Java MapReduce mongoDB document passed into Hadoop MapReduce Map phase - each input doc gets passed through a Mapper function @Override public  void  map(NullWritable  key,  BSONObject  val,  final  Context  context){        BSONObject  headers  =  (BSONObject)val.get("headers");        if(headers.containsKey("From")  &&  headers.containsKey("To")){                String  from  =  (String)headers.get("From");                String  to  =  (String)headers.get("To");                String[]  recips  =  to.split(",");                for(int  i=0;i<recips.length;i++){                        String  recip  =  recips[i].trim();                        context.write(new  MailPair(from,  recip),  new  IntWritable(1));                }        } } Thursday, August 8, 13
  • 48. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13
  • 49. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13
  • 50. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key list of all the values collected under the key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13
  • 51. output written back to MongoDB Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key list of all the values collected under the key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13
  • 52. Example 1 - Java MapReduce (cont) mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Read from MongoDB Thursday, August 8, 13
  • 53. Example 1 - Java MapReduce (cont) mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Read from MongoDB Read from BSON mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir=file:///tmp/messages.bson Thursday, August 8, 13
  • 54. Example 1 - Java MapReduce (cont) mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Read from MongoDB Read from BSON mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir=file:///tmp/messages.bson hdfs:///tmp/messages.bson s3:///tmp/messages.bson Thursday, August 8, 13
  • 55. Example 1 - Java MapReduce (cont) mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Write output to MongoDB Thursday, August 8, 13
  • 56. Example 1 - Java MapReduce (cont) mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Write output to MongoDB Write output to BSON mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir=file:///tmp/results.bson Thursday, August 8, 13
  • 57. Example 1 - Java MapReduce (cont) mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Write output to MongoDB Write output to BSON mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir=file:///tmp/results.bson hdfs:///tmp/results.bson s3:///tmp/results.bson Thursday, August 8, 13
  • 58. Results : Output Data mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "15126-1267@m2.innovyx.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "2586207@www4.imakenews.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "40enron@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..davis@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..hughes@enron.com" }, "count" : 4 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..lindholm@enron.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..schroeder@enron.com" }, "count" : 1 } ... has more Thursday, August 8, 13
  • 59. Example 2 - Hadoop Streaming Let’s do the same Enron Map/Reduce job with Python instead of Java $ pip install pymongo_hadoop Thursday, August 8, 13
  • 60. Example 2 - Hadoop Streaming (cont) Hadoop passes data to an external process via STDOUT/STDIN map(k, v) map(k, v) map(k, v)map() JVM STDIN Python / Ruby / JS interpreter STDOUT hadoop (JVM) def mapper(documents): . . . Thursday, August 8, 13
  • 61. from pymongo_hadoop import BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Example 2 - Hadoop Streaming (cont) Thursday, August 8, 13
  • 62. from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Example 2 - Hadoop Streaming (cont) Thursday, August 8, 13
  • 63. Surviving Hadoop: making MapReduce easier with Pig + Hive Thursday, August 8, 13
  • 64. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Thursday, August 8, 13
  • 65. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Pig is a powerful language that can generate sophisticated map/reduce workflows from simple scripts Thursday, August 8, 13
  • 66. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Pig is a powerful language that can generate sophisticated map/reduce workflows from simple scripts Can perform JOIN, GROUP, and execute user-defined functions (UDFs) Thursday, August 8, 13
  • 67. Example 3 - Mongo-Hadoop and Pig (cont) Pig directives for loading data: BSONLoader and MongoLoader data = LOAD 'mongodb://localhost:27017/db.collection' using com.mongodb.hadoop.pig.MongoLoader; STORE records INTO 'file:///output.bson' using com.mongodb.hadoop.pig.BSONStorage; Writing data out BSONStorage and MongoInsertStorage Thursday, August 8, 13
  • 68. Example 3 - Mongo-Hadoop and Pig (cont) Pig has its own special datatypes: Bags, Maps, and Tuples Mongo-Hadoop Connector intelligently converts between Pig datatypes and MongoDB datatypes Thursday, August 8, 13
  • 69. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; Thursday, August 8, 13
  • 70. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; Thursday, August 8, 13
  • 71. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; Thursday, August 8, 13
  • 72. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; send_recip_grouped = GROUP send_recip_split BY (from, to); send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count; Thursday, August 8, 13
  • 73. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; send_recip_grouped = GROUP send_recip_split BY (from, to); send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count; STORE send_recip_counted INTO 'file:///enron_results.bson' using com.mongodb.hadoop.pig.BSONStorage; Thursday, August 8, 13
  • 74. Hive with Mongo-Hadoop Similar idea to Pig - process your data without needing to write Map/Reduce code from scratch ...but with SQL as the language of choice Thursday, August 8, 13
  • 75. Hive with Mongo-Hadoop CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" ) TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users"); first, declare the collection to be accessible in Hive: Sample Data: db.users db.users.find() { "_id": 1, "name": "Tom", "age": 28 } { "_id": 2, "name": "Alice", "age": 18 } { "_id": 3, "name": "Bob", "age": 29 } { "_id": 101, "name": "Scott", "age": 10 } { "_id": 104, "name": "Jesse", "age": 52 } { "_id": 110, "name": "Mike", "age": 32 } ... Thursday, August 8, 13
  • 77. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; Thursday, August 8, 13
  • 78. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ; you can use GROUP BY: Thursday, August 8, 13
  • 79. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ; you can use GROUP BY: or JOIN multiple tables/collections together: SELECT * FROM mongo_users T1 JOIN user_emails T2 WHERE T1.id = T2.id; Thursday, August 8, 13
  • 80. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; Thursday, August 8, 13
  • 81. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; DROP TABLE mongo_users; Thursday, August 8, 13
  • 82. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; DROP TABLE mongo_users; Drop a table in Hive to delete the underlying collection in MongoDB Thursday, August 8, 13
  • 83. Usage with Amazon Elastic MapReduce Run mongo-hadoop jobs without needing to set up or manage your own Hadoop cluster. Thursday, August 8, 13
  • 84. Usage with Amazon Elastic MapReduce First, make a “bootstrap” script that fetches dependencies (mongo-hadoop jar and java drivers) #!/bin/sh wget -P /home/hadoop/lib http://central.maven.org/maven2/org/ mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoop- code/mongo-hadoop-core_1.1.2-1.1.0.jar this will get executed on each node in the cluster that EMR builds for us. Thursday, August 8, 13
  • 85. Example 4 - Usage with Amazon Elastic MapReduce Put the bootstrap script, and all your code, into an S3 bucket where Amazon can see it. s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh s3mod s3://$S3_BUCKET/bootstrap.sh public-read s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/ enron-example.jar s3mod s3://$S3_BUCKET/enron-example.jar public-read Thursday, August 8, 13
  • 86. $ elastic-mapreduce --create --jobflow ENRON000 --instance-type m1.xlarge --num-instances 5 --bootstrap-action s3://$S3_BUCKET/bootstrap.sh --log-uri s3://$S3_BUCKET/enron_logs --jar s3://$S3_BUCKET/enron-example.jar --arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat --arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson --arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT --arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat # (any additional parameters here) Example 4 - Usage with Amazon Elastic MapReduce ...then launch the job from the command line, pointing to your S3 locations Control the type and number of instances in the cluster Thursday, August 8, 13
  • 87. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Thursday, August 8, 13
  • 88. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Turn up the “num-instances” knob to make jobs complete faster Thursday, August 8, 13
  • 89. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Turn up the “num-instances” knob to make jobs complete faster Logs get captured into S3 files Thursday, August 8, 13
  • 90. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Turn up the “num-instances” knob to make jobs complete faster (Pig, Hive, and streaming work on EMR, too!) Logs get captured into S3 files Thursday, August 8, 13
  • 91. Example 5 - new feature: MongoUpdateWritable ... but we can also modify an existing output collection Works by applying mongoDB update modifiers: $push, $pull, $addToSet, $inc, $set, etc. Can be used to do incremental Map/Reduce or “join” two collections In previous examples, we wrote job output data by inserting into a new collection Thursday, August 8, 13
  • 92. Example 5 - MongoUpdateWritable Let’s say we have two collections. {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } sensors Thursday, August 8, 13
  • 93. Example 5 - MongoUpdateWritable Let’s say we have two collections. {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } sensors Thursday, August 8, 13
  • 94. Example 5 - MongoUpdateWritable Let’s say we have two collections. {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } sensors log events Thursday, August 8, 13
  • 95. Example 5 - MongoUpdateWritable Let’s say we have two collections. {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } sensors log events refers to which sensor logged the event Thursday, August 8, 13
  • 96. Example 5 - MongoUpdateWritable Let’s say we have two collections. {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } sensors log events refers to which sensor logged the event Thursday, August 8, 13
  • 97. Example 5 - MongoUpdateWritable Let’s say we have two collections. {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } sensors log events refers to which sensor logged the event For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Thursday, August 8, 13
  • 99. For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Thursday, August 8, 13
  • 100. For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Plain english: Bob’s sensors for temperature have stored 1300 readings Bob’s sensors for pressure have stored 400 readings Alice’s sensors for humidity have stored 600 readings Alice’s sensors for temperature have stored 700 readings etc... Thursday, August 8, 13
  • 101. sensors (mongoDB collection) Stage 1 -Map/Reduce on sensors collection Results (mongoDB collection) for each sensor, emit: {key: owner+type, value: _id} group data from map() under each key, output: {key: owner+type, val: [ list of _ids] } read from mongoDB insert() new records to mongoDB map/reduce log events (mongoDB collection) Thursday, August 8, 13
  • 102. After stage one, the output docs look like: Thursday, August 8, 13
  • 103. the sensor’s owner and type After stage one, the output docs look like: Thursday, August 8, 13
  • 104. the sensor’s owner and type After stage one, the output docs look like: list of ID’s of sensors with this owner and type {    "_id":  "alice  pressure",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ] } Thursday, August 8, 13
  • 105. the sensor’s owner and type After stage one, the output docs look like: list of ID’s of sensors with this owner and type {    "_id":  "alice  pressure",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ] } Now we just need to count the total # of log events recorded for any sensors that appear in the list for each owner/type group. Thursday, August 8, 13
  • 106. sensors (mongoDB collection) Stage 2 -Map/Reduce on log events collection Results (mongoDB collection) read from mongoDB update() existing records in mongoDB map/reduce log events (mongoDB collection) for each sensor, emit: {key: sensor_id, value: 1} group data from map() under each key for each value in that key: update({sensors: key}, {$inc : {logs_count:1}}) Thursday, August 8, 13
  • 107. sensors (mongoDB collection) Stage 2 -Map/Reduce on log events collection Results (mongoDB collection) read from mongoDB update() existing records in mongoDB map/reduce log events (mongoDB collection) for each sensor, emit: {key: sensor_id, value: 1} group data from map() under each key for each value in that key: update({sensors: key}, {$inc : {logs_count:1}}) context.write(null,   new  MongoUpdateWritable(      query,  //which  documents  to  modify        update,  //how  to  modify  ($inc)      true,        //upsert      false) );  //  multi Thursday, August 8, 13
  • 108. Example - MongoUpdateWritable Result after stage 2 {    "_id":  "1UoTcvnCTz  temp",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ],    "logs_count":  1050616 } now populated with correct count Thursday, August 8, 13
  • 109. Upcoming Features (v1.2 and beyond) Full-featured Hive support Performance Improvements - Lazy BSON Support for multi-collection input sources API for adding custom splitter implementations and more Thursday, August 8, 13
  • 110. Recap Mongo-Hadoop - use Hadoop to do massive computations on big data sets stored in Mongo/BSON Tools and APIs make it easier: Streaming, Pig, Hive, EMR, etc. MongoDB becomes a Hadoop-enabled filesystem Thursday, August 8, 13