SlideShare una empresa de Scribd logo
1 de 111
Descargar para leer sin conexión
Mongo-Hadoop Integration
Mike O’Brien, Software Engineer @ 10gen

Thursday, August 8, 13
We will cover:
A quick briefing on what Mongo
and Hadoop are all about

The Mongo-Hadoop connector:
•what it is
•how it works
•a tour of what it can do
(Q+A at the end)
Thursday, August 8, 13
Choosing the Right Tool for the Task
Upcoming Webinar:
MongoDB and Hadoop - Essential Tools for
Your Big Data Playbook
August 21st, 2013
10am PDT, 1pm EDT, 6pm BST
Register at 10gen.com/events/biz-hadoop

Thursday, August 8, 13
Thursday, August 8, 13
document-oriented database with
dynamic schema

Thursday, August 8, 13
document-oriented database with
dynamic schema
stores data in JSON-like documents:
{

}

Thursday, August 8, 13

_id : “mike”,
age : 21,
location : {
state : ”NY”,
zip : ”11222”
},
favorite_colors : [“red”, “green”]
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
Java-based framework for Map/Reduce
Excels at batch processing on large data sets
by taking advantage of parallelism

Thursday, August 8, 13
Mongo-Hadoop Connector - Why
Lots of people using Hadoop and Mongo
separately, but need integration
Need to process data across multiple sources
Custom code or slow and hacky import/
export scripts often used to get data in+out
Scalability and flexibility with changes in
Hadoop or MongoDB configurations
Thursday, August 8, 13
Mongo-Hadoop Connector
Turn MongoDB into a Hadoop-enabled filesystem:
use as the input or output for Hadoop

New Feature: As of v1.1, also works with MongoDB
backup files (.bson)

input
data

Hadoop
Cluster

output
results

-or-

-or-

.BSON

.BSON

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full integration with Hadoop and JVM ecosystems

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full integration with Hadoop and JVM ecosystems
Can be used with Amazon Elastic Mapreduce

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full integration with Hadoop and JVM ecosystems
Can be used with Amazon Elastic Mapreduce
Can read and write backup files from local
filesystem, HDFS, or S3
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce
or if you don’t want to use Java,
support for Hadoop Streaming.

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce
or if you don’t want to use Java,
support for Hadoop Streaming.
write MapReduce code in

ruby
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce
or if you don’t want to use Java,
support for Hadoop Streaming.
write MapReduce code in

ruby
Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce
or if you don’t want to use Java,
support for Hadoop Streaming.
write MapReduce code in

ruby
Thursday, August 8, 13

python
Mongo-Hadoop Connector
Benefits + Features

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Support for Pig
high-level scripting language for data analysis and
building map/reduce workflows

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Support for Pig
high-level scripting language for data analysis and
building map/reduce workflows

Support for Hive
SQL-like language for ad-hoc queries + analysis of data sets on
Hadoop-compatible file systems

Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:

Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data

Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data
Each split gets assigned to a node in Hadoop cluster

Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data
Each split gets assigned to a node in Hadoop cluster
In parallel, Hadoop nodes pull data for splits from
MongoDB (or BSON) and process them locally

Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data
Each split gets assigned to a node in Hadoop cluster
In parallel, Hadoop nodes pull data for splits from
MongoDB (or BSON) and process them locally
Hadoop merges results and streams output back to
MongoDB or BSON
Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example

Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop

Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming

Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming
- Pig and Hive with Mongo-Hadoop

Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming
- Pig and Hive with Mongo-Hadoop
- Elastic MapReduce + BSON
Thursday, August 8, 13
Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}

Thursday, August 8, 13
Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}

sender

}

Thursday, August 8, 13
Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}

sender

recipients

}

Thursday, August 8, 13
Thursday, August 8, 13
Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair

Thursday, August 8, 13
Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair
14

alice

bob

48

9

Thursday, August 8, 13

99

eve

charlie
20
Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair
14

alice

bob

99

48

9

eve

charlie
20

{"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14}
{"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9}
{"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99}
{"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48}
{"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}
Thursday, August 8, 13
Example 1 - Java MapReduce
Map phase - each input doc gets
passed through a Mapper function

@Override
public	
  void	
  map(NullWritable	
  key,	
  BSONObject	
  val,	
  final	
  Context	
  context){
	
  	
  	
  	
  BSONObject	
  headers	
  =	
  (BSONObject)val.get("headers");
	
  	
  	
  	
  if(headers.containsKey("From")	
  &&	
  headers.containsKey("To")){
	
  	
  	
  	
  	
  	
  	
  	
  String	
  from	
  =	
  (String)headers.get("From");
	
  	
  	
  	
  	
  	
  	
  	
  String	
  to	
  =	
  (String)headers.get("To");
	
  	
  	
  	
  	
  	
  	
  	
  String[]	
  recips	
  =	
  to.split(",");
	
  	
  	
  	
  	
  	
  	
  	
  for(int	
  i=0;i<recips.length;i++){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  String	
  recip	
  =	
  recips[i].trim();
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(new	
  MailPair(from,	
  recip),	
  new	
  IntWritable(1));
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  }
}

Thursday, August 8, 13
Example 1 - Java MapReduce
Map phase - each input doc gets
passed through a Mapper function
mongoDB document passed into
Hadoop MapReduce
@Override
public	
  void	
  map(NullWritable	
  key,	
  BSONObject	
  val,	
  final	
  Context	
  context){
	
  	
  	
  	
  BSONObject	
  headers	
  =	
  (BSONObject)val.get("headers");
	
  	
  	
  	
  if(headers.containsKey("From")	
  &&	
  headers.containsKey("To")){
	
  	
  	
  	
  	
  	
  	
  	
  String	
  from	
  =	
  (String)headers.get("From");
	
  	
  	
  	
  	
  	
  	
  	
  String	
  to	
  =	
  (String)headers.get("To");
	
  	
  	
  	
  	
  	
  	
  	
  String[]	
  recips	
  =	
  to.split(",");
	
  	
  	
  	
  	
  	
  	
  	
  for(int	
  i=0;i<recips.length;i++){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  String	
  recip	
  =	
  recips[i].trim();
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(new	
  MailPair(from,	
  recip),	
  new	
  IntWritable(1));
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  }
}

Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer

	
  	
  	
  	
  public	
  void	
  reduce(	
  final	
  MailPair	
  pKey,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Iterable<IntWritable>	
  pValues,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Context	
  pContext	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (	
  final	
  IntWritable	
  value	
  :	
  pValues	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  value.get();
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  	
  	
  	
  	
  BSONObject	
  outDoc	
  =	
  new	
  BasicDBObjectBuilder().start()
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "f"	
  ,	
  pKey.from)
.add(	
  "t"	
  ,	
  pKey.to	
  )
.get();
	
  	
  	
  	
  	
  	
  	
  	
  BSONWritable	
  pkeyOut	
  =	
  new	
  BSONWritable(outDoc);
	
  	
  	
  	
  	
  	
  	
  	
  pContext.write(	
  pkeyOut,	
  new	
  IntWritable(sum)	
  );
	
  	
  	
  	
  }

Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to, from} key
	
  	
  	
  	
  public	
  void	
  reduce(	
  final	
  MailPair	
  pKey,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Iterable<IntWritable>	
  pValues,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Context	
  pContext	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (	
  final	
  IntWritable	
  value	
  :	
  pValues	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  value.get();
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  	
  	
  	
  	
  BSONObject	
  outDoc	
  =	
  new	
  BasicDBObjectBuilder().start()
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "f"	
  ,	
  pKey.from)
.add(	
  "t"	
  ,	
  pKey.to	
  )
.get();
	
  	
  	
  	
  	
  	
  	
  	
  BSONWritable	
  pkeyOut	
  =	
  new	
  BSONWritable(outDoc);
	
  	
  	
  	
  	
  	
  	
  	
  pContext.write(	
  pkeyOut,	
  new	
  IntWritable(sum)	
  );
	
  	
  	
  	
  }

Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to, from} key
	
  	
  	
  	
  public	
  void	
  reduce(	
  final	
  MailPair	
  pKey,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Iterable<IntWritable>	
  pValues,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Context	
  pContext	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (	
  final	
  IntWritable	
  value	
  :	
  pValues	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  value.get();
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  	
  	
  	
  	
  BSONObject	
  outDoc	
  =	
  new	
  BasicDBObjectBuilder().start()
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "f"	
  ,	
  pKey.from)
.add(	
  "t"	
  ,	
  pKey.to	
  )
.get();
	
  	
  	
  	
  	
  	
  	
  	
  BSONWritable	
  pkeyOut	
  =	
  new	
  BSONWritable(outDoc);
	
  	
  	
  	
  	
  	
  	
  	
  pContext.write(	
  pkeyOut,	
  new	
  IntWritable(sum)	
  );
	
  	
  	
  	
  }

Thursday, August 8, 13

list of all the values
collected under the key
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to, from} key
	
  	
  	
  	
  public	
  void	
  reduce(	
  final	
  MailPair	
  pKey,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Iterable<IntWritable>	
  pValues,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Context	
  pContext	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;

list of all the values
collected under the key

	
  	
  	
  	
  	
  	
  	
  	
  for	
  (	
  final	
  IntWritable	
  value	
  :	
  pValues	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  value.get();
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  	
  	
  	
  	
  BSONObject	
  outDoc	
  =	
  new	
  BasicDBObjectBuilder().start()
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "f"	
  ,	
  pKey.from)
.add(	
  "t"	
  ,	
  pKey.to	
  )
.get();
	
  	
  	
  	
  	
  	
  	
  	
  BSONWritable	
  pkeyOut	
  =	
  new	
  BSONWritable(outDoc);
	
  	
  	
  	
  	
  	
  	
  	
  pContext.write(	
  pkeyOut,	
  new	
  IntWritable(sum)	
  );
	
  	
  	
  	
  }

output written back to MongoDB
Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Read from MongoDB
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages

Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Read from MongoDB
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages

Read from BSON
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir=file:///tmp/messages.bson

Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Read from MongoDB
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages

Read from BSON
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir=file:///tmp/messages.bson
hdfs:///tmp/messages.bson
s3:///tmp/messages.bson

Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Write output to MongoDB
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out

Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Write output to MongoDB
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out

Write output to BSON

mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir=file:///tmp/results.bson

Thursday, August 8, 13
Example 1 - Java MapReduce (cont)
Write output to MongoDB
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out

Write output to BSON

mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir=file:///tmp/results.bson
hdfs:///tmp/results.bson
s3:///tmp/results.bson

Thursday, August 8, 13
Results : Output Data
mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "15126-1267@m2.innovyx.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "2586207@www4.imakenews.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "40enron@enron.com" }, "count" : 2 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..davis@enron.com" }, "count" : 2 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..hughes@enron.com" }, "count" : 4 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..lindholm@enron.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..schroeder@enron.com" }, "count" : 1 }
...
has more

Thursday, August 8, 13
Example 2 - Hadoop Streaming

Let’s do the same Enron Map/Reduce job
with Python instead of Java

$ pip install pymongo_hadoop

Thursday, August 8, 13
Example 2 - Hadoop Streaming (cont)
Hadoop passes data to an external
process via STDOUT/STDIN
hadoop (JVM)
STDIN

map(k, v)
map(k, v)
map()
map(k, v)
JVM

Thursday, August 8, 13

STDOUT

Python / Ruby / JS
interpreter
def mapper(documents):
. . .
Example 2 - Hadoop Streaming (cont)
from pymongo_hadoop import BSONMapper
def mapper(documents):
i = 0
for doc in documents:
i = i + 1
from_field = doc['headers']['From']
to_field = doc['headers']['To']
recips = [x.strip() for x in to_field.split(',')]
for r in recips:
yield {'_id': {'f':from_field, 't':r}, 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."

Thursday, August 8, 13
Example 2 - Hadoop Streaming (cont)

from pymongo_hadoop import BSONReducer
def reducer(key, values):
print >> sys.stderr, "Processing from/to %s" % str(key)
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'count': _count}
BSONReducer(reducer)

Thursday, August 8, 13
Surviving Hadoop:
making MapReduce easier

with Pig + Hive
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig

Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig
Pig is a powerful language that can
generate sophisticated map/reduce
workflows from simple scripts

Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig
Pig is a powerful language that can
generate sophisticated map/reduce
workflows from simple scripts
Can perform JOIN, GROUP, and execute
user-defined functions (UDFs)
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
Pig directives for loading data:
BSONLoader and MongoLoader
data = LOAD 'mongodb://localhost:27017/db.collection'
using com.mongodb.hadoop.pig.MongoLoader;

Writing data out
BSONStorage and MongoInsertStorage
STORE records INTO 'file:///output.bson'
using com.mongodb.hadoop.pig.BSONStorage;

Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)

Pig has its own special datatypes:
Bags, Maps, and Tuples
Mongo-Hadoop Connector intelligently
converts between Pig datatypes and
MongoDB datatypes

Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;

Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;

Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;

Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;
send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE
group, COUNT($1) as count;

Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;
send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE
group, COUNT($1) as count;
STORE send_recip_counted INTO 'file:///enron_results.bson'
using com.mongodb.hadoop.pig.BSONStorage;

Thursday, August 8, 13
Hive with Mongo-Hadoop

Similar idea to Pig - process your data
without needing to write Map/Reduce
code from scratch

...but with SQL as the language of choice

Thursday, August 8, 13
Hive with Mongo-Hadoop
db.users.find()
{ "_id": 1, "name": "Tom", "age": 28 }

Sample Data:
db.users

{ "_id": 2, "name": "Alice", "age": 18 }
{ "_id": 3, "name": "Bob", "age": 29 }
{ "_id": 101, "name": "Scott", "age": 10 }
{ "_id": 104, "name": "Jesse", "age": 52 }
{ "_id": 110, "name": "Mike", "age": 32 }
...

first, declare the collection to be
accessible in Hive:
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" )
TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users");

Thursday, August 8, 13
Hive with Mongo-Hadoop

Thursday, August 8, 13
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;

Thursday, August 8, 13
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;

you can use GROUP BY:
SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;

Thursday, August 8, 13
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;

you can use GROUP BY:
SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;

or JOIN multiple tables/collections together:
SELECT * FROM mongo_users T1
JOIN user_emails T2
WHERE T1.id = T2.id;
Thursday, August 8, 13
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHERE age > 100 ;

Thursday, August 8, 13
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHERE age > 100 ;

DROP TABLE mongo_users;

Thursday, August 8, 13
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHERE age > 100 ;

Drop a table in Hive to delete the
underlying collection in MongoDB
DROP TABLE mongo_users;

Thursday, August 8, 13
Usage with Amazon Elastic MapReduce

Run mongo-hadoop jobs without
needing to set up or manage your
own Hadoop cluster.

Thursday, August 8, 13
Usage with Amazon Elastic MapReduce
First, make a “bootstrap” script that
fetches dependencies (mongo-hadoop
jar and java drivers)
#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/
mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar
wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoopcode/mongo-hadoop-core_1.1.2-1.1.0.jar

this will get executed on each node in
the cluster that EMR builds for us.
Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Put the bootstrap script, and all your code,
into an S3 bucket where Amazon can see it.

s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh
s3mod s3://$S3_BUCKET/bootstrap.sh public-read
s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/
enron-example.jar
s3mod s3://$S3_BUCKET/enron-example.jar public-read

Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
...then launch the job from the command
line, pointing to your S3 locations
Control the type and
number of instances
in the cluster

$ elastic-mapreduce --create --jobflow ENRON000
--instance-type m1.xlarge
--num-instances 5
--bootstrap-action s3://$S3_BUCKET/bootstrap.sh
--log-uri s3://$S3_BUCKET/enron_logs
--jar s3://$S3_BUCKET/enron-example.jar
--arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
--arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson
--arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT
--arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
# (any additional parameters here)

Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster

Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
Turn up the “num-instances” knob to
make jobs complete faster

Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
Turn up the “num-instances” knob to
make jobs complete faster
Logs get captured into S3 files

Thursday, August 8, 13
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
Turn up the “num-instances” knob to
make jobs complete faster
Logs get captured into S3 files
(Pig, Hive, and streaming work on EMR, too!)
Thursday, August 8, 13
Example 5 - new feature: MongoUpdateWritable
In previous examples, we wrote job output data
by inserting into a new collection
... but we can also modify an existing output
collection
Works by applying mongoDB update modifiers:
$push, $pull, $addToSet, $inc, $set, etc.

Can be used to do incremental Map/Reduce or
“join” two collections
Thursday, August 8, 13
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

Thursday, August 8, 13

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}

Thursday, August 8, 13
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

log
events

Thursday, August 8, 13

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

log
events

Thursday, August 8, 13

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}

refers to which sensor
logged the event
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

log
events

Thursday, August 8, 13

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}

refers to which sensor
logged the event
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

log
events

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}

refers to which sensor
logged the event

For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.
Thursday, August 8, 13
Thursday, August 8, 13
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.

Thursday, August 8, 13
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.
Plain english:
Bob’s sensors for temperature have stored 1300 readings
Bob’s sensors for pressure have stored 400 readings
Alice’s sensors for humidity have stored 600 readings
Alice’s sensors for temperature have stored 700 readings
etc...

Thursday, August 8, 13
Stage 1 -Map/Reduce
on sensors collection
sensors
(mongoDB collection)

log events

read from
mongoDB

map/reduce
for each sensor, emit:
{key: owner+type, value: _id}
group data from map() under each key, output:
{key: owner+type, val: [ list of _ids] }

(mongoDB collection)

insert() new records
to mongoDB

Results
(mongoDB collection)
Thursday, August 8, 13
After stage one, the output
docs look like:

Thursday, August 8, 13
After stage one, the output
docs look like:
the sensor’s
owner and type

Thursday, August 8, 13
After stage one, the output
docs look like:
the sensor’s
owner and type
{
	
  	
  "_id":	
  "alice	
  pressure",
	
  	
  "sensors":	
  [
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d475"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d16d"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d2bf"),
	
  	
  	
  	
  …
	
  	
  ]
}

Thursday, August 8, 13

list of ID’s of
sensors with this
owner and type
After stage one, the output
docs look like:
the sensor’s
owner and type
{
	
  	
  "_id":	
  "alice	
  pressure",
	
  	
  "sensors":	
  [
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d475"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d16d"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d2bf"),
	
  	
  	
  	
  …
	
  	
  ]
}

list of ID’s of
sensors with this
owner and type

Now we just need to count the total # of log
events recorded for any sensors that appear
in the list for each owner/type group.
Thursday, August 8, 13
Stage 2 -Map/Reduce on
log events collection
sensors

for each sensor, emit:
{key: sensor_id, value: 1}

(mongoDB collection)

group data from map() under each key
for each value in that key:
update({sensors: key}, {$inc : {logs_count:1}})

log events
(mongoDB collection)

map/reduce

read from
mongoDB

update() existing
records in mongoDB

Results
(mongoDB collection)
Thursday, August 8, 13
Stage 2 -Map/Reduce on
log events collection
sensors

for each sensor, emit:
{key: sensor_id, value: 1}

(mongoDB collection)

group data from map() under each key
for each value in that key:
update({sensors: key}, {$inc : {logs_count:1}})

log events
(mongoDB collection)

map/reduce

read from
mongoDB

context.write(null,	
  
new	
  MongoUpdateWritable(
	
  	
  	
  query,	
  //which	
  documents	
  to	
  modify	
  
	
  	
  	
  update,	
  //how	
  to	
  modify	
  ($inc)
	
  	
  	
  true,	
  	
  	
  	
  //upsert
	
  	
  	
  false)
);	
  //	
  multi
Thursday, August 8, 13

update() existing
records in mongoDB

Results
(mongoDB collection)
Example - MongoUpdateWritable

Result after stage 2
{
	
  	
  "_id":	
  "1UoTcvnCTz	
  temp",
	
  	
  "sensors":	
  [
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d475"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d16d"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d2bf"),
	
  	
  	
  	
  …
	
  	
  ],
	
  	
  "logs_count":	
  1050616
}

now populated with correct count
Thursday, August 8, 13
Upcoming Features (v1.2 and beyond)
Performance Improvements - Lazy BSON
Full-featured Hive support
Support for multi-collection input sources
API for adding
custom splitter implementations
and more
Thursday, August 8, 13
Recap
Mongo-Hadoop - use Hadoop to do massive computations
on big data sets stored in Mongo/BSON

MongoDB becomes a Hadoop-enabled filesystem

Tools and APIs make it easier:
Streaming, Pig, Hive, EMR, etc.
Thursday, August 8, 13
Questions?

Examples can be found on github:
https://github.com/mongodb/mongo-hadoop/tree/
master/examples

Thursday, August 8, 13

Más contenido relacionado

La actualidad más candente

Elastic search integration with hadoop leveragebigdata
Elastic search integration with hadoop   leveragebigdataElastic search integration with hadoop   leveragebigdata
Elastic search integration with hadoop leveragebigdataPooja Gupta
 
Overview of Dan Olteanu's Research presentation
Overview of Dan Olteanu's Research presentationOverview of Dan Olteanu's Research presentation
Overview of Dan Olteanu's Research presentationDBOnto
 
OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashingJ Singh
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsMongoDB
 
Big Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopBig Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopIMC Institute
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
 
Power of Python with Big Data
Power of Python with Big DataPower of Python with Big Data
Power of Python with Big DataEdureka!
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big dataJ Singh
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsIMC Institute
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
Building Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFuBuilding Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFuMatthew Hayes
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)IMC Institute
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, LarusNeo4j
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchclintongormley
 

La actualidad más candente (20)

Elastic search integration with hadoop leveragebigdata
Elastic search integration with hadoop   leveragebigdataElastic search integration with hadoop   leveragebigdata
Elastic search integration with hadoop leveragebigdata
 
Overview of Dan Olteanu's Research presentation
Overview of Dan Olteanu's Research presentationOverview of Dan Olteanu's Research presentation
Overview of Dan Olteanu's Research presentation
 
OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
Big Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopBig Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop Workshop
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Power of Python with Big Data
Power of Python with Big DataPower of Python with Big Data
Power of Python with Big Data
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
Insight_150115_Demo
Insight_150115_DemoInsight_150115_Demo
Insight_150115_Demo
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Building Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFuBuilding Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFu
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, Larus
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 

Similar a Hadoop webinar-130808141030-phpapp01

Linked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache StanbolLinked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache StanbolGabriel Dragomir
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Tugdual Grall
 
Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...xu liwei
 
Drupal and Apache Stanbol. What if you could reliably do autotagging?
Drupal and Apache Stanbol. What if you could reliably do autotagging?Drupal and Apache Stanbol. What if you could reliably do autotagging?
Drupal and Apache Stanbol. What if you could reliably do autotagging?Gabriel Dragomir
 
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Nuxeo
 
One Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web AppOne Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web Apptechnicolorenvy
 
Apachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogriselApachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogriselNuxeo
 
Presentationnosqlmah
PresentationnosqlmahPresentationnosqlmah
Presentationnosqlmahp3rnilla
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPaco Nathan
 
Mongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappeMongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappeSpyros Passas
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!NLJUG
 
FUTURESTACK13: Mobile Apps, A DevOps Way from Jonathan Karon, Engineering Man...
FUTURESTACK13: Mobile Apps, A DevOps Way from Jonathan Karon, Engineering Man...FUTURESTACK13: Mobile Apps, A DevOps Way from Jonathan Karon, Engineering Man...
FUTURESTACK13: Mobile Apps, A DevOps Way from Jonathan Karon, Engineering Man...New Relic
 
실시간 웹 협업도구 만들기 V0.3
실시간 웹 협업도구 만들기 V0.3실시간 웹 협업도구 만들기 V0.3
실시간 웹 협업도구 만들기 V0.3NAVER D2
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and OntarioBigData_Europe
 

Similar a Hadoop webinar-130808141030-phpapp01 (20)

Linked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache StanbolLinked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache Stanbol
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...
 
Drupal and Apache Stanbol. What if you could reliably do autotagging?
Drupal and Apache Stanbol. What if you could reliably do autotagging?Drupal and Apache Stanbol. What if you could reliably do autotagging?
Drupal and Apache Stanbol. What if you could reliably do autotagging?
 
Couchbase
CouchbaseCouchbase
Couchbase
 
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011
 
One Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web AppOne Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web App
 
Apachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogriselApachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogrisel
 
Presentationnosqlmah
PresentationnosqlmahPresentationnosqlmah
Presentationnosqlmah
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and Hadoop
 
Mongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappeMongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappe
 
elasticsearch
elasticsearchelasticsearch
elasticsearch
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
 
FUTURESTACK13: Mobile Apps, A DevOps Way from Jonathan Karon, Engineering Man...
FUTURESTACK13: Mobile Apps, A DevOps Way from Jonathan Karon, Engineering Man...FUTURESTACK13: Mobile Apps, A DevOps Way from Jonathan Karon, Engineering Man...
FUTURESTACK13: Mobile Apps, A DevOps Way from Jonathan Karon, Engineering Man...
 
실시간 웹 협업도구 만들기 V0.3
실시간 웹 협업도구 만들기 V0.3실시간 웹 협업도구 만들기 V0.3
실시간 웹 협업도구 만들기 V0.3
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent Apps
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 

Último

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Último (20)

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Hadoop webinar-130808141030-phpapp01

  • 1. Mongo-Hadoop Integration Mike O’Brien, Software Engineer @ 10gen Thursday, August 8, 13
  • 2. We will cover: A quick briefing on what Mongo and Hadoop are all about The Mongo-Hadoop connector: •what it is •how it works •a tour of what it can do (Q+A at the end) Thursday, August 8, 13
  • 3. Choosing the Right Tool for the Task Upcoming Webinar: MongoDB and Hadoop - Essential Tools for Your Big Data Playbook August 21st, 2013 10am PDT, 1pm EDT, 6pm BST Register at 10gen.com/events/biz-hadoop Thursday, August 8, 13
  • 5. document-oriented database with dynamic schema Thursday, August 8, 13
  • 6. document-oriented database with dynamic schema stores data in JSON-like documents: { } Thursday, August 8, 13 _id : “mike”, age : 21, location : { state : ”NY”, zip : ”11222” }, favorite_colors : [“red”, “green”]
  • 7. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 8. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 9. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 10. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 11. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  • 12. Java-based framework for Map/Reduce Excels at batch processing on large data sets by taking advantage of parallelism Thursday, August 8, 13
  • 13. Mongo-Hadoop Connector - Why Lots of people using Hadoop and Mongo separately, but need integration Need to process data across multiple sources Custom code or slow and hacky import/ export scripts often used to get data in+out Scalability and flexibility with changes in Hadoop or MongoDB configurations Thursday, August 8, 13
  • 14. Mongo-Hadoop Connector Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop New Feature: As of v1.1, also works with MongoDB backup files (.bson) input data Hadoop Cluster output results -or- -or- .BSON .BSON Thursday, August 8, 13
  • 15. Mongo-Hadoop Connector Benefits + Features Thursday, August 8, 13
  • 16. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Thursday, August 8, 13
  • 17. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Thursday, August 8, 13
  • 18. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic Mapreduce Thursday, August 8, 13
  • 19. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic Mapreduce Can read and write backup files from local filesystem, HDFS, or S3 Thursday, August 8, 13
  • 20. Mongo-Hadoop Connector Benefits + Features Thursday, August 8, 13
  • 21. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce Thursday, August 8, 13
  • 22. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce or if you don’t want to use Java, support for Hadoop Streaming. Thursday, August 8, 13
  • 23. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce or if you don’t want to use Java, support for Hadoop Streaming. write MapReduce code in ruby Thursday, August 8, 13
  • 24. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce or if you don’t want to use Java, support for Hadoop Streaming. write MapReduce code in ruby Thursday, August 8, 13
  • 25. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce or if you don’t want to use Java, support for Hadoop Streaming. write MapReduce code in ruby Thursday, August 8, 13 python
  • 26. Mongo-Hadoop Connector Benefits + Features Thursday, August 8, 13
  • 27. Mongo-Hadoop Connector Benefits + Features Support for Pig high-level scripting language for data analysis and building map/reduce workflows Thursday, August 8, 13
  • 28. Mongo-Hadoop Connector Benefits + Features Support for Pig high-level scripting language for data analysis and building map/reduce workflows Support for Hive SQL-like language for ad-hoc queries + analysis of data sets on Hadoop-compatible file systems Thursday, August 8, 13
  • 29. Mongo-Hadoop Connector How it works: Thursday, August 8, 13
  • 30. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Thursday, August 8, 13
  • 31. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster Thursday, August 8, 13
  • 32. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Thursday, August 8, 13
  • 33. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Hadoop merges results and streams output back to MongoDB or BSON Thursday, August 8, 13
  • 34. Tour of Mongo-Hadoop, by Example Thursday, August 8, 13
  • 35. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop Thursday, August 8, 13
  • 36. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming Thursday, August 8, 13
  • 37. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming - Pig and Hive with Mongo-Hadoop Thursday, August 8, 13
  • 38. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming - Pig and Hive with Mongo-Hadoop - Elastic MapReduce + BSON Thursday, August 8, 13
  • 39. Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Thursday, August 8, 13
  • 40. Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } sender } Thursday, August 8, 13
  • 41. Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } sender recipients } Thursday, August 8, 13
  • 43. Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair Thursday, August 8, 13
  • 44. Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair 14 alice bob 48 9 Thursday, August 8, 13 99 eve charlie 20
  • 45. Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair 14 alice bob 99 48 9 eve charlie 20 {"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14} {"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9} {"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99} {"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48} {"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20} Thursday, August 8, 13
  • 46. Example 1 - Java MapReduce Map phase - each input doc gets passed through a Mapper function @Override public  void  map(NullWritable  key,  BSONObject  val,  final  Context  context){        BSONObject  headers  =  (BSONObject)val.get("headers");        if(headers.containsKey("From")  &&  headers.containsKey("To")){                String  from  =  (String)headers.get("From");                String  to  =  (String)headers.get("To");                String[]  recips  =  to.split(",");                for(int  i=0;i<recips.length;i++){                        String  recip  =  recips[i].trim();                        context.write(new  MailPair(from,  recip),  new  IntWritable(1));                }        } } Thursday, August 8, 13
  • 47. Example 1 - Java MapReduce Map phase - each input doc gets passed through a Mapper function mongoDB document passed into Hadoop MapReduce @Override public  void  map(NullWritable  key,  BSONObject  val,  final  Context  context){        BSONObject  headers  =  (BSONObject)val.get("headers");        if(headers.containsKey("From")  &&  headers.containsKey("To")){                String  from  =  (String)headers.get("From");                String  to  =  (String)headers.get("To");                String[]  recips  =  to.split(",");                for(int  i=0;i<recips.length;i++){                        String  recip  =  recips[i].trim();                        context.write(new  MailPair(from,  recip),  new  IntWritable(1));                }        } } Thursday, August 8, 13
  • 48. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13
  • 49. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13
  • 50. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13 list of all the values collected under the key
  • 51. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0; list of all the values collected under the key                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } output written back to MongoDB Thursday, August 8, 13
  • 52. Example 1 - Java MapReduce (cont) Read from MongoDB mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Thursday, August 8, 13
  • 53. Example 1 - Java MapReduce (cont) Read from MongoDB mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Read from BSON mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir=file:///tmp/messages.bson Thursday, August 8, 13
  • 54. Example 1 - Java MapReduce (cont) Read from MongoDB mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Read from BSON mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir=file:///tmp/messages.bson hdfs:///tmp/messages.bson s3:///tmp/messages.bson Thursday, August 8, 13
  • 55. Example 1 - Java MapReduce (cont) Write output to MongoDB mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Thursday, August 8, 13
  • 56. Example 1 - Java MapReduce (cont) Write output to MongoDB mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Write output to BSON mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir=file:///tmp/results.bson Thursday, August 8, 13
  • 57. Example 1 - Java MapReduce (cont) Write output to MongoDB mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Write output to BSON mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir=file:///tmp/results.bson hdfs:///tmp/results.bson s3:///tmp/results.bson Thursday, August 8, 13
  • 58. Results : Output Data mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "15126-1267@m2.innovyx.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "2586207@www4.imakenews.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "40enron@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..davis@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..hughes@enron.com" }, "count" : 4 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..lindholm@enron.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..schroeder@enron.com" }, "count" : 1 } ... has more Thursday, August 8, 13
  • 59. Example 2 - Hadoop Streaming Let’s do the same Enron Map/Reduce job with Python instead of Java $ pip install pymongo_hadoop Thursday, August 8, 13
  • 60. Example 2 - Hadoop Streaming (cont) Hadoop passes data to an external process via STDOUT/STDIN hadoop (JVM) STDIN map(k, v) map(k, v) map() map(k, v) JVM Thursday, August 8, 13 STDOUT Python / Ruby / JS interpreter def mapper(documents): . . .
  • 61. Example 2 - Hadoop Streaming (cont) from pymongo_hadoop import BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Thursday, August 8, 13
  • 62. Example 2 - Hadoop Streaming (cont) from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Thursday, August 8, 13
  • 63. Surviving Hadoop: making MapReduce easier with Pig + Hive Thursday, August 8, 13
  • 64. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Thursday, August 8, 13
  • 65. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Pig is a powerful language that can generate sophisticated map/reduce workflows from simple scripts Thursday, August 8, 13
  • 66. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Pig is a powerful language that can generate sophisticated map/reduce workflows from simple scripts Can perform JOIN, GROUP, and execute user-defined functions (UDFs) Thursday, August 8, 13
  • 67. Example 3 - Mongo-Hadoop and Pig (cont) Pig directives for loading data: BSONLoader and MongoLoader data = LOAD 'mongodb://localhost:27017/db.collection' using com.mongodb.hadoop.pig.MongoLoader; Writing data out BSONStorage and MongoInsertStorage STORE records INTO 'file:///output.bson' using com.mongodb.hadoop.pig.BSONStorage; Thursday, August 8, 13
  • 68. Example 3 - Mongo-Hadoop and Pig (cont) Pig has its own special datatypes: Bags, Maps, and Tuples Mongo-Hadoop Connector intelligently converts between Pig datatypes and MongoDB datatypes Thursday, August 8, 13
  • 69. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; Thursday, August 8, 13
  • 70. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; Thursday, August 8, 13
  • 71. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; Thursday, August 8, 13
  • 72. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; send_recip_grouped = GROUP send_recip_split BY (from, to); send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count; Thursday, August 8, 13
  • 73. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; send_recip_grouped = GROUP send_recip_split BY (from, to); send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count; STORE send_recip_counted INTO 'file:///enron_results.bson' using com.mongodb.hadoop.pig.BSONStorage; Thursday, August 8, 13
  • 74. Hive with Mongo-Hadoop Similar idea to Pig - process your data without needing to write Map/Reduce code from scratch ...but with SQL as the language of choice Thursday, August 8, 13
  • 75. Hive with Mongo-Hadoop db.users.find() { "_id": 1, "name": "Tom", "age": 28 } Sample Data: db.users { "_id": 2, "name": "Alice", "age": 18 } { "_id": 3, "name": "Bob", "age": 29 } { "_id": 101, "name": "Scott", "age": 10 } { "_id": 104, "name": "Jesse", "age": 52 } { "_id": 110, "name": "Mike", "age": 32 } ... first, declare the collection to be accessible in Hive: CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" ) TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users"); Thursday, August 8, 13
  • 77. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; Thursday, August 8, 13
  • 78. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; you can use GROUP BY: SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ; Thursday, August 8, 13
  • 79. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; you can use GROUP BY: SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ; or JOIN multiple tables/collections together: SELECT * FROM mongo_users T1 JOIN user_emails T2 WHERE T1.id = T2.id; Thursday, August 8, 13
  • 80. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; Thursday, August 8, 13
  • 81. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; DROP TABLE mongo_users; Thursday, August 8, 13
  • 82. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; Drop a table in Hive to delete the underlying collection in MongoDB DROP TABLE mongo_users; Thursday, August 8, 13
  • 83. Usage with Amazon Elastic MapReduce Run mongo-hadoop jobs without needing to set up or manage your own Hadoop cluster. Thursday, August 8, 13
  • 84. Usage with Amazon Elastic MapReduce First, make a “bootstrap” script that fetches dependencies (mongo-hadoop jar and java drivers) #!/bin/sh wget -P /home/hadoop/lib http://central.maven.org/maven2/org/ mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoopcode/mongo-hadoop-core_1.1.2-1.1.0.jar this will get executed on each node in the cluster that EMR builds for us. Thursday, August 8, 13
  • 85. Example 4 - Usage with Amazon Elastic MapReduce Put the bootstrap script, and all your code, into an S3 bucket where Amazon can see it. s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh s3mod s3://$S3_BUCKET/bootstrap.sh public-read s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/ enron-example.jar s3mod s3://$S3_BUCKET/enron-example.jar public-read Thursday, August 8, 13
  • 86. Example 4 - Usage with Amazon Elastic MapReduce ...then launch the job from the command line, pointing to your S3 locations Control the type and number of instances in the cluster $ elastic-mapreduce --create --jobflow ENRON000 --instance-type m1.xlarge --num-instances 5 --bootstrap-action s3://$S3_BUCKET/bootstrap.sh --log-uri s3://$S3_BUCKET/enron_logs --jar s3://$S3_BUCKET/enron-example.jar --arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat --arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson --arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT --arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat # (any additional parameters here) Thursday, August 8, 13
  • 87. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Thursday, August 8, 13
  • 88. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Turn up the “num-instances” knob to make jobs complete faster Thursday, August 8, 13
  • 89. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Turn up the “num-instances” knob to make jobs complete faster Logs get captured into S3 files Thursday, August 8, 13
  • 90. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Turn up the “num-instances” knob to make jobs complete faster Logs get captured into S3 files (Pig, Hive, and streaming work on EMR, too!) Thursday, August 8, 13
  • 91. Example 5 - new feature: MongoUpdateWritable In previous examples, we wrote job output data by inserting into a new collection ... but we can also modify an existing output collection Works by applying mongoDB update modifiers: $push, $pull, $addToSet, $inc, $set, etc. Can be used to do incremental Map/Reduce or “join” two collections Thursday, August 8, 13
  • 92. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors Thursday, August 8, 13 {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", }
  • 93. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } Thursday, August 8, 13
  • 94. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors log events Thursday, August 8, 13 {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] }
  • 95. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors log events Thursday, August 8, 13 {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } refers to which sensor logged the event
  • 96. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors log events Thursday, August 8, 13 {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } refers to which sensor logged the event
  • 97. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors log events {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } refers to which sensor logged the event For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Thursday, August 8, 13
  • 99. For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Thursday, August 8, 13
  • 100. For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Plain english: Bob’s sensors for temperature have stored 1300 readings Bob’s sensors for pressure have stored 400 readings Alice’s sensors for humidity have stored 600 readings Alice’s sensors for temperature have stored 700 readings etc... Thursday, August 8, 13
  • 101. Stage 1 -Map/Reduce on sensors collection sensors (mongoDB collection) log events read from mongoDB map/reduce for each sensor, emit: {key: owner+type, value: _id} group data from map() under each key, output: {key: owner+type, val: [ list of _ids] } (mongoDB collection) insert() new records to mongoDB Results (mongoDB collection) Thursday, August 8, 13
  • 102. After stage one, the output docs look like: Thursday, August 8, 13
  • 103. After stage one, the output docs look like: the sensor’s owner and type Thursday, August 8, 13
  • 104. After stage one, the output docs look like: the sensor’s owner and type {    "_id":  "alice  pressure",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ] } Thursday, August 8, 13 list of ID’s of sensors with this owner and type
  • 105. After stage one, the output docs look like: the sensor’s owner and type {    "_id":  "alice  pressure",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ] } list of ID’s of sensors with this owner and type Now we just need to count the total # of log events recorded for any sensors that appear in the list for each owner/type group. Thursday, August 8, 13
  • 106. Stage 2 -Map/Reduce on log events collection sensors for each sensor, emit: {key: sensor_id, value: 1} (mongoDB collection) group data from map() under each key for each value in that key: update({sensors: key}, {$inc : {logs_count:1}}) log events (mongoDB collection) map/reduce read from mongoDB update() existing records in mongoDB Results (mongoDB collection) Thursday, August 8, 13
  • 107. Stage 2 -Map/Reduce on log events collection sensors for each sensor, emit: {key: sensor_id, value: 1} (mongoDB collection) group data from map() under each key for each value in that key: update({sensors: key}, {$inc : {logs_count:1}}) log events (mongoDB collection) map/reduce read from mongoDB context.write(null,   new  MongoUpdateWritable(      query,  //which  documents  to  modify        update,  //how  to  modify  ($inc)      true,        //upsert      false) );  //  multi Thursday, August 8, 13 update() existing records in mongoDB Results (mongoDB collection)
  • 108. Example - MongoUpdateWritable Result after stage 2 {    "_id":  "1UoTcvnCTz  temp",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ],    "logs_count":  1050616 } now populated with correct count Thursday, August 8, 13
  • 109. Upcoming Features (v1.2 and beyond) Performance Improvements - Lazy BSON Full-featured Hive support Support for multi-collection input sources API for adding custom splitter implementations and more Thursday, August 8, 13
  • 110. Recap Mongo-Hadoop - use Hadoop to do massive computations on big data sets stored in Mongo/BSON MongoDB becomes a Hadoop-enabled filesystem Tools and APIs make it easier: Streaming, Pig, Hive, EMR, etc. Thursday, August 8, 13
  • 111. Questions? Examples can be found on github: https://github.com/mongodb/mongo-hadoop/tree/ master/examples Thursday, August 8, 13