MongoDB scales easily to store mass volumes of data. However, when it comes to making sense of it all what options do you have? In this talk, we’ll take a look at 3 different ways of aggregating your data with MongoDB, and determine the reasons why you might choose one way over another. No matter what your big data needs are, you will find out how MongoDB the big data store is evolving to help make sense of your data.
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Past, Present and Future of Data Processing in Apache Hadoop
1. Codemotion Milano 2013
Data Processing and
Aggregation
Massimo Brignoli
Solutions Architect, MongoDB Inc.
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
2. Who Am I?
• Solutions Architect/Evangelist in MongoDB Inc.
• 20 years of experience in databases
• Former MySQL employee
• Previous life: web, web, web
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
3. Big Data
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
4. What is Big Data?
• Big Data is like teenage sex:
• everyone talks about it
• nobody really knows how to do it
• everyone thinks everyone else is doing it
• so everyone claims they are doing it…
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
5. Understanding Big Data – It’s Not Very “Big”
64% - Ingest diverse,
new data in real-time
15% - More than 100TB
of data
20% - Less than 100TB
(average of all? <20TB)
from Big Data Executive Summary – 50+ top executives from Government and F500 firms
6. For over a decade
Big Data == Custom
Software
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
9. RDBMS Makes Development Hard
Code
XML Config
DB Schema
Application
Object Relational
Mapping
Relational
Database
10. And Even Harder To Iterate
New
Table
New
Column
New
Table
Name
Pet
Phone
New
Column
3 months later…
Email
11. From Complexity to Simplicity
MongoDB
RDBMS
{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{
type :
"Health",
plan : "PPO Plus" },
{
type :
"Dental",
plan : "Standard" }
]
}
12. In the past few years
Open source software has
emerged enabling the rest of
us to handle Big Data
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
14. Enterprise Big Data Stack
CRM, ERP, Collaboration, Mobile, BI
Data Management
Online Data
RDBMS
RDBMS
Offline Data
Hadoop
Infrastructure
OS & Virtualization, Compute, Storage, Network
EDW
Security & Auditing
Management & Monitoring
Applications
15. Consideration – Online vs. Offline
Online
• Real-time
• Low-latency
• High availability
vs.
Offline
• Long-running
• High-Latency
• Availability is lower priority
16. How MongoDB Meets Our
Requirements
• MongoDB is an operational database
• MongoDB provides high performance for storage
and retrieval at large scale
• MongoDB has a robust query interface permitting
intelligent operations
• MongoDB is not a data processing engine, but
provides processing functionality
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
17. MongoDB data processing options
http://www.flickr.com/photos/torek/4444673930/ http://createivecommons.org/licenses/by-nc-sa/3.0/
Except where otherwise noted, this work is licensed under
18. Getting Example Data
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
19. The “hello world” of
MapReduce is counting words
in a paragraph of text.
Let’s try something a little
more interesting…
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
20. What is the most popular pub
name?
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
21. Open Street Map Data
#!/usr/bin/env python
# Data Source
# http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59]
import re
import sys
from imposm.parser import OSMParser
import pymongo
class Handler(object):
def nodes(self, nodes):
if not nodes:
return
docs = []
for node in nodes:
osm_id, doc, (lon, lat) = node
if "name" not in doc:
node_points[osm_id] = (lon, lat)
continue
doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&")
doc["_id"] = osm_id
doc["location"] = {"type": "Point", "coordinates": [lon, lat]}
docs.append(doc)
collection.insert(docs)
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
22. Example Pub Data
{
"_id" : 451152,
"amenity" : "pub",
"name" : "The Dignity",
"addr:housenumber" : "363",
"addr:street" : "Regents Park Road",
"addr:city" : "London",
"addr:postcode" : "N3 1DH",
"toilets" : "yes",
"toilets:access" : "customers",
"location" : {
"type" : "Point",
"coordinates" : [-0.1945732, 51.6008172]
}
}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
25. map
Map Function
MongoDB
reduce
> var map = function() {
finalize
emit(this.name, 1);
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
26. map
Reduce Function
MongoDB
reduce
> var reduce = function (key, values) {
finalize
var sum = 0;
values.forEach( function (val) {sum += val;} );
return sum;
}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
27. Results
> db.pub_names.find().sort({value: -1}).limit(10)
{ "_id" : "The Red Lion", "value" : 407 }
{ "_id" : "The Royal Oak", "value" : 328 }
{ "_id" : "The Crown", "value" : 242 }
{ "_id" : "The White Hart", "value" : 214 }
{ "_id" : "The White Horse", "value" : 200 }
{ "_id" : "The New Inn", "value" : 187 }
{ "_id" : "The Plough", "value" : 185 }
{ "_id" : "The Rose & Crown", "value" : 164 }
{ "_id" : "The Wheatsheaf", "value" : 147 }
{ "_id" : "The Swan", "value" : 140 }
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
28. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
29. Pub Names in the Center of London
> db.pubs.mapReduce(map, reduce, { out: "pub_names",
query: {
location: {
$within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] }
}}
})
{
"result" : "pub_names",
"timeMillis" : 116,
"counts" : {
"input" : 643,
"emit" : 643,
"reduce" : 54,
"output" : 537
},
"ok" : 1,
}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
31. MongoDB MapReduce
• Real-time
• Output directly to document or collection
• Runs inside MongoDB on local data
− Adds load to your DB
− In Javascript – debugging can be a challenge
− Translating in and out of C++
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
34. Aggregation Framework in 60
Seconds
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
35. Aggregation Framework Operators
• $project
• $match
• $limit
• $skip
• $sort
• $unwind
• $group
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
36. $match
• Filter documents
• Uses existing query syntax
• If using $geoNear it has to be first in pipeline
• $where is not supported
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
37. Matching Field Values
{
"_id" : 271421,
"amenity" : "pub",
"name" : "Sir Walter Tyrrell",
"location" : {
"type" : "Point",
"coordinates" : [
-1.6192422,
50.9131996
]
}
}
{ "$match": {
"name": "The Red Lion"
}}
{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]}
{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]
}
}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
38. $project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
39. Including and Excluding Fields
{ “$project”: {
{
"_id" : 271466,
"name" : "The Red Lion",
“_id”: 0,
“amenity”: 1,
“name”: 1,
"location" : {
}}
"amenity" : "pub",
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]
}
{
“amenity” : “pub”,
“name” : “The Red Lion”
}
}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
40. Reformatting Documents
{ “$project”: {
{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
“_id”: 0,
“name”: 1,
“meta”: {
“type”: “$amenity”}
}}
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]
}
}
{
“name” : “The Red Lion”
“meta” : {
“type” : “pub”
}}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
41. Dealing with Arrays
{ “$project”: {
{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"facilities" : [
"toilets",
“_id”: 0,
“name”: 1,
“meta”: {
“type”: “$amenity”}
}}
{"$unwind": "$facility"}
"food"
],
}
{ "name" : "The Red Lion",
"facility" : "toilets" },
{ "name" : "The Red Lion",
"facility" : "food" }
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
42. $group
• Group documents by an ID
• Field reference, object, constant
• Other output fields are computed
$max, $min, $avg, $sum
$addToSet, $push $first, $last
• Processes all data in memory
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
43. Back to the pub!
•
http://www.offwestend.com/index.php/theatres/pastshows/71
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
44. Popular Pub Names
>var popular_pub_names = [
{ $match : location:
{ $within: { $centerSphere:
[[-0.12, 51.516], 2 / 3959]}}}
},
{ $group :
{ _id: “$name”
value: {$sum: 1} }
},
{ $sort : {value: -1} },
{ $limit : 10 }
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
45. Results
> db.pubs.aggregate(popular_pub_names)
{
"result" : [
{ "_id" : "All Bar One", "value" : 11 }
{ "_id" : "The Slug & Lettuce", "value" : 7 }
{ "_id" : "The Coach & Horses", "value" : 6 }
{ "_id" : "The Green Man", "value" : 5 }
{ "_id" : "The Kings Arms", "value" : 5 }
{ "_id" : "The Red Lion", "value" : 5 }
{ "_id" : "Corney & Barrow", "value" : 4 }
{ "_id" : "O'Neills", "value" : 4 }
{ "_id" : "Pitcher & Piano", "value" : 4 }
{ "_id" : "The Crown", "value" : 4 }
],
"ok" : 1
}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
46. Aggregation Framework Benefits
• Real-time
• Simple yet powerful interface
• Declared in JSON, executes in C++
• Runs inside MongoDB on local data
− Adds load to your DB
− Limited Operators
− Data output is limited
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
47. Analyzing MongoDB Data in
External Systems
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
49. MongoDB with Hadoop
•
MongoDB
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
warehouse
50. MongoDB with Hadoop
•
ETL
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MongoDB
51. Map Pub Names in Python
#!/usr/bin/env python
from pymongo_hadoop import BSONMapper
def mapper(documents):
bounds = get_bounds() # ~2 mile polygon
for doc in documents:
geo = get_geo(doc["location"]) # Convert the geo type
if not geo:
continue
if bounds.intersects(geo):
yield {'_id': doc['name'], 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
52. Reduce Pub Names in Python
#!/usr/bin/env python
from pymongo_hadoop import BSONReducer
def reducer(key, values):
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'value': _count}
BSONReducer(reducer)
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
53. Execute MapReduce
hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar
-mapper examples/pub/map.py
-reducer examples/pub/reduce.py
-mongo mongodb://127.0.0.1/demo.pubs
-outputURI mongodb://127.0.0.1/demo.pub_names
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
54. Popular Pub Names Nearby
> db.pub_names.find().sort({value: -1}).limit(10)
{
{
{
{
{
{
{
{
{
{
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
:
:
:
:
:
:
:
:
:
:
"All Bar One", "value" : 11 }
"The Slug & Lettuce", "value" : 7 }
"The Coach & Horses", "value" : 6 }
"The Kings Arms", "value" : 5 }
"Corney & Barrow", "value" : 4 }
"O'Neills", "value" : 4 }
"Pitcher & Piano", "value" : 4 }
"The Crown", "value" : 4 }
"The George", "value" : 4 }
"The Green Man", "value" : 4 }
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
55. MongoDB and Hadoop
• Away from data store
• Can leverage existing data processing infrastructure
• Can horizontally scale your data processing
- Offline batch processing
- Requires synchronisation between store &
processor
- Infrastructure is much more complex
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
56. The Future of Big Data and
MongoDB
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
57. What is Big Data?
Big Data today will be
normal tomorrow
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
58. Exponential Data Growth
Billions of URLs indexed by Google
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
2000
2002
2004
2006
2008
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
2010
2012
59. MongoDB enables you to
scale big
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
60. MongoDB is evolving
so you can process the
big
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
61. Data Processing with MongoDB
• Process in MongoDB using Map/Reduce
• Process in MongoDB using Aggregation
Framework
• Process outside MongoDB using Hadoop and
other external tools
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
IBM designed IMS with Rockwell and Caterpillar starting in 1966 for the Apollo program. IMS's challenge was to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.
This is helpful because as much as 95% of enterprise information is unstructured, and doesn’t fit neatly into tidy rows and columns. NoSQL and Hadoop allow for dynamic schema.
The industry is talking about Hadoop and MongoDB for Big Data. So should you
This is where MongoDB fits into the existing enterprise IT stackMongoDB is an operational data store used for online data, in the same way that Oracle is an operational data store. It supports applications that ingest, store, manage and even analyze data in real-time. (Compared to Hadoop and data warehouses, which are used for offline, batch analytical workloads.)
Another common use case we see is warehousing of data -* again the connector allows you to utilize existing libraries via hadoopUS
The third most common usecase is an ETL - extract transform load - function.Then putting the aggregated data into mongodb for further analysis.