SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
Skyrocket your Analytics
MongoDB Meetup on December 10, 2012
www.precog.com
@precogio
Nov - Dec 2012
welcome & agenda

■ Welcome to the Precog & MongoDB Meetup!
    7:00 - 7:30
    Overview of Precog for MongoDB by Derek Chen-Becker

    7:30 - 7:45
    Break (grab a beer, drink and snacks)

    7:45 - 8:15
    Analyzing Big Data with Quirrel by John A. De Goes

    8:15 - 8:30
    Precog Challenge Problems! Win some prizes!



■ Questions? Please ask away!
who we are

■ Precog Team
Derek Chen-Becker, Lead Infrastructure Engineer
John A. De Goes, CEO/Founder
Kris Nuttycombe, Dir of Engineering
Nathan Lubchenco, Developer Evangelist

■ MongoDB Host
Clay Mcllrath

■ Thank you to Google for hosting us!
Current MongoDB Support for Analytics
Derek Chen-Becker
Precog Lead Infrastructure Engineer
@dchenbecker
Nov - Dec 2012
current mongodb support for analytics

■ Mongo has support for a small set of simple aggregation primitives

  ○ count - returns the count of a given collection's documents with optional

    filtering

  ○ distinct - returns the distinct values for given selector criteria

  ○ group - returns groups of documents based on given key criteria. Group

    cannot be used in sharded configurations
current mongodb support for analytics
> db.london_medals.group({
     key : {"Country":1},
     reduce : function(curr, result) { result.total += 1 },
     initial: { total : 0, fullTotal: db.london_medals.count() },
     finalize: function(result){ result.percent = result.total * 100 / result.fullTotal }
  })
[
    {"Country" : "Great Britain", "total" : 88, "fullTotal" : 1019, "percent" : 8.635917566241414},
    {"Country" : "Dominican Republic", "total" : 2, "fullTotal" : 1019, "percent" :
0.19627085377821393},
    {"Country" : "Denmark", "total" : 16, "fullTotal" : 1019, "percent" : 1.5701668302257115},
  ...
■ More sophisticated queries are possible, but require a lot of JS and you'll hit the limits pretty
     quickly
■ Group cannot be used in sharded configurations. For that you need...
current mongodb support for analytics
■ Map/Reduce: Exactly what its name says.
■ You utilize JavaScript functions to map your documents' data, then reduce that
  data into a form of your choosing.



                                                                  Output
                                                                 Collection


               Input      Mapping Function   Reducing Function
             Collection


                                                                  Result
                                                                 Document
current mongodb support for analytics
■ The mapping function redefines this to be the current document
■ Output mapped keys and values are generated via the emit function
■ Emit can be called zero or more times for a single document


function () { emit(this.Countryname, { count : 1 }); }

function () {
  for (var i = 0; i < this.Pupils.length; i++) {
   emit(this.Pupils[i].name, { count : 1});
}

function () {
  if ((this.parents.age - this.age) < 25) { emit(this.age, { income : this.income }); }
}
current mongodb support for analytics
■ The reduction function is used to aggregate the outputs from the mapping
  function
■ The function receives two inputs: the key for the elements being reduced, and
  the values being reduced
■ The result of the reduction must be the same format as in the input elements,
  and must be idempotent
function (key, values) {
  var count = 0;
  for (var item in values) {
    count += item.count
  }
  { "count" : count }
}
current mongodb support for analytics
■ Map/Reduce utilizes JavaScript to do all of its work
  ○ JavaScript in MongoDB is currently single-threaded (performance bottleneck)
  ○ Using external JS libraries is cumbersome and doesn't play well with sharding
  ○ No matter what language you're actually using, you'll be writing/maintaining
    JavaScript
■ Troubleshooting the Map/Reduce functions is primitive. 10Gen's advice: "write
  your own emit function" (!)
■ Output options are flexible, but have some caveats
  ○ Output to a result document must fit in a BSON doc (16MB limit)
  ○ For an output collection: if you want indices on the result set, you need to pre-
    create the collection then use the merge output option
current mongodb support for analytics
■ The Aggregation Framework is designed to alleviate some of the issues with
  Map/Reduce for common analytical queries
■ New in 2.2
■ Works by constructing a pipeline of operations on data. Similar to M/R, but
  implemented in native code (higher performance, not single-threaded)




                        Input
                                     Match        Project      Group
                      Collection
current mongodb support for analytics
■ Filtering/paging ops
  ○ $match - utilize Mongo selection syntax to choose input docs
  ○ $limit
  ○ $skip
■ Field manipulation ops
  ○ $project - select which fields are processed. Can add new fields
  ○ $unwind - flattens a doc with an array field into multiple events, one per array
    value
■ Output ops
  ○ $group
  ○ $sort
■ Most common pipelines will be of the form $match ⇒ $project ⇒ $group
current mongodb support for analytics
■ $match is very important to getting good performance
■ Needs to be the first op in the pipeline, otherwise indices can't be used
■ Uses normal MongoDB query syntax, with two exceptions
  ○ Can't use a $where clause (this requires JavaScript)
  ○ Can't use Geospatial queries (just because)


{ $match : { "Name" : "Fred" } }
{ $match : { "Countryname" : { $neq : "Great Britain" } } }
{ $match : { "Income" : { $exists : 1 } } }
current mongodb support for analytics
■ $project is used to select/compute/augment the fields you want in the output
  documents
  { $project : { "Countryname" : 1, "Sportname" : 1 } }
■ Can reference input document fields in computations via "$"
  { $project : { "country_name" : "$Countryname" } } /* renames field */
■ Computation of field values is possible, but it's limited and can be quite painful
  { $project: {
   "_id":0, "height":1, "weight":1,
   "bmi": { $divide : ["$weight", { $multiply : [ "$height", "$height" ] } ] } }
  } /* omit "_id" field, inflict pain and suffering on future maintainers... */
current mongodb support for analytics
■ $group, like the group command, collates and computes sets of values based
  on the identity field ("_id"), and whatever other fields you want
  { $group : { "_id" : "$Countryname" } } /* distinct list of countries */
■ Aggregation operators can be used to perform computation ($max, $min, $avg,
  $sum)
  { $group : { "_id" : "$Countryname", "count" : { $sum : 1 } } } /* histogram by
country */
  { $group : { "_id" : "$Countryname", "weight" : { $avg : "$weight" } } }
  { $group : { "_id" : "$Countryname", "weight" : { $sum : "$weight" } } }
■ Set-based operations ($addToSet, $push)
  { $group : { "_id" : "$Countryname", "sport" : { $addToSet : "$sport" } } }
current mongodb support for analytics
■ Aggregation framework has a limited set of operators
  ○ $project limited to $add/$subtract/$multiply/$divide, as well as some
    boolean, string, and date/time operations
  ○ $group limited to $min/$max/$avg/$sum
■ Some operators, notably $group and $sort, are required to operate entirely in
  memory
  ○ This may prevent aggregation on large data sets
  ○ Can't work around using subsetting like you can with M/R, because output is
    strictly a document (no collection option yet)
current mongodb support for analytics
■ Even with these tools, there are still limitations
  ○ MongoDB is not relational. This means a lot of work on your part if you have
     datasets representing different things that you'd like to correlate. Clicks vs
     views, for example
  ○ While the Aggregation Framework alleviates some of the performance issues
     of Map/Reduce, it does so by throwing away flexibility
  ○ The best approach for parallelization (sharding) is fraught with operational
     challenges (come see me for horror stories)
Overview of Precog for MongoDB
Derek Chen-Becker
Precog Lead Infrastructure Engineer
@dchenbecker
Nov - Dec 2012
overview of precog for mongodb
■ Download file: http://www.precog.com/mongodb
■ Setup:
$ unzip precog.zip
$ cd precog
$ emacs -nw config.cfg (adjust ports, etc)
$ ./precog.sh
overview of precog for mongodb
■ Precog for MongoDB allows you to perform sophisticated analytics utilizing
  existing mongo instances
■ Self-contained JAR bundling:
  ○ The Precog Analytics service
  ○ Labcoat IDE for Quirrel
■ Does not include the full Precog stack
  ○ Minimal authentication handling (single api key in config)
  ○ No ingest service (just add data directly to mongo)
overview of precog for mongodb
■ Some sample queries

-- histogram by country
data := //summer_games/athletes
solve 'country
  { country: 'country,
    count: count(data where data.Countryname = 'country) }
Analyzing Big Data with Quirrel
John A. De Goes
Precog CEO/Founder
@jdegoes
Nov - Dec 2012
overview

Quirrel is a statistically-oriented query language
designed for the analysis of large-scale, potentially
heterogeneous data sets.
quirrel

●   Simple
●   Set-oriented
●   Statistically-oriented
●   Purely declarative
●   Implicitly parallel
sneak peek

pageViews := //pageViews
avg := mean(pageViews.duration)
bound := 1.5 * stdDev(pageViews.duration)
pageViews.userId where
  pageViews.duration > avg + bound
quirrel speaks json

1
true
[[1, 0, 0], [0, 1, 0], [0, 0, 1]]

"All work and no play makes jack a dull
boy"

{"age": 23, "gender": "female",
"interests": ["sports", "tennis"]}
comments

-- Ignore me.
(- Ignore
   me,
   too -)
basic expressions

2 * 4

(1 + 2) * 3 / 9 > 23

3 > 2 & (1 != 2)

false & true | !false
named expressions

x := 2

square := x * x
loading data

//pageViews

load("/pageViews")

//campaigns/summer/2012
drilldown

pageViews := load("/pageViews")

pageViews.userId

pageViews.keywords[2]
reductions

count(//pageViews)

sum(//purchases.total)

stdDev(//purchases.total)
filtering

pageViews := //pageViews

pageViews.userId where
  pageViews.duration > 1000
augmentation


clicks with
  {dow: dayOfWeek(clicks.time)}
standard library

import std::stats::rank

rank(//pageViews.duration)
user-defined functions

ctr(day) :=
  count(clicks where
        clicks.day = day) /
  count(impressions where
        impressions.day = day)

ctrOnMonday := ctr(1)

ctrOnMonday
grouping - implicit constraints

solve 'day
  {day: 'day,
   ctr: count(clicks where
              clicks.day = 'day) /
        count(impressions where
              impressions.day =
                          'day)}
grouping - explicit constraints

solve 'day = purchases.day
  {day: 'day,
   cummTotal:
     sum(purchases.total where
         purchases.day < 'day)}
questions?




http://quirrel-lang.org
Now, it's your turn! Win some cool prizes!

Precog Challenge Problems
Nov - Dec 2012
precog challenge #1

■ Using the conversions data, find the state with
  the highest average income.
■ Variable names: conversions.customers.state
  and conversions.customers.income
precog challenge #2

■ Use Labcoat to display a bar chart of the clicks
  per month.
■ Variable names: clicks.timestamp
precog challenge #3

■ What product has the worst overall sales to
  women? To men?
■ Variable names: billing.product.ID, billing.
  product.price, billing.customer.gender
precog challenge #1 possible solution

conversions := //conversions
results := solve 'state
 {state: 'state,
 aveIncome: mean(conversions.customer.income where
                      conversions.customer.state = 'state)}
results where results.aveIncome = max(results.aveIncome)
precog challenge #2 possible solution

clicks := //clicks
clicks' := clicks with {month: std::time::monthOfYear(clicks.timeStamp)}

solve 'month
 {month: 'month, clicks: count(clicks'.product.price where clicks'.month = 'month)}
precog challenge #3 possible solution
billing := //billing
results := solve 'product, 'gender
 {product: 'product,
 gender: 'gender,
 sales: sum(billing.product.price where
       billing.product.ID = 'product &
       billing.customer.gender = 'gender)}

worstSalesToWomen := results where results.gender = "female" &
          results.sales = min(results.sales where results.gender = "female")
worstSalesToMen := results where results.gender = "male" &
       results.sales = min(results.sales where results.gender = "male")

worstSalesToWomen union worstSalesToMen
Thank you!
Follow us on Twitter
@precogio
@jdegoes
@dchenbecker

Download Precog for MongoDB for FREE:
www.precog.com/mongodb

Try Precog for free and get a free account:
www.precog.com

Subscribe to our monthly newsletter:
www.precog.com/about/newsletter
Nov - Dec 2012

Más contenido relacionado

La actualidad más candente

Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
MongoDB
 
Aggregation in MongoDB
Aggregation in MongoDBAggregation in MongoDB
Aggregation in MongoDB
Kishor Parkhe
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
PDF.JS at SwissJeese 2012
PDF.JS at SwissJeese 2012PDF.JS at SwissJeese 2012
PDF.JS at SwissJeese 2012
Julian Viereck
 

La actualidad más candente (20)

MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Agg framework selectgroup feb2015 v2
Agg framework selectgroup feb2015 v2Agg framework selectgroup feb2015 v2
Agg framework selectgroup feb2015 v2
 
3D + MongoDB = 3D Repo
3D + MongoDB = 3D Repo3D + MongoDB = 3D Repo
3D + MongoDB = 3D Repo
 
MongoDB crud
MongoDB crudMongoDB crud
MongoDB crud
 
Webinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkWebinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation Framework
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
 
Aggregation in MongoDB
Aggregation in MongoDBAggregation in MongoDB
Aggregation in MongoDB
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
 
Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJS
 
Mongo db
Mongo dbMongo db
Mongo db
 
Advanced Analytics & Statistics with MongoDB
Advanced Analytics & Statistics with MongoDBAdvanced Analytics & Statistics with MongoDB
Advanced Analytics & Statistics with MongoDB
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
Working with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDBWorking with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDB
 
Quirrel & R for Dummies
Quirrel & R for DummiesQuirrel & R for Dummies
Quirrel & R for Dummies
 
Introduction to MongoDB at IGDTUW
Introduction to MongoDB at IGDTUWIntroduction to MongoDB at IGDTUW
Introduction to MongoDB at IGDTUW
 
PDF.JS at SwissJeese 2012
PDF.JS at SwissJeese 2012PDF.JS at SwissJeese 2012
PDF.JS at SwissJeese 2012
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 

Similar a Precog & MongoDB User Group: Skyrocket Your Analytics

Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
Takahiro Inoue
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
MongoDB
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
David Galeano
 

Similar a Precog & MongoDB User Group: Skyrocket Your Analytics (20)

MongoDB 3.2 - a giant leap. What’s new?
MongoDB 3.2 - a giant leap. What’s new?MongoDB 3.2 - a giant leap. What’s new?
MongoDB 3.2 - a giant leap. What’s new?
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Introduction To MongoDB
Introduction To MongoDBIntroduction To MongoDB
Introduction To MongoDB
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedback
 
MongoDB Distilled
MongoDB DistilledMongoDB Distilled
MongoDB Distilled
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
 
MongoDB FabLab León
MongoDB FabLab LeónMongoDB FabLab León
MongoDB FabLab León
 
Redis Day TLV 2018 - RediSearch Aggregations
Redis Day TLV 2018 - RediSearch AggregationsRedis Day TLV 2018 - RediSearch Aggregations
Redis Day TLV 2018 - RediSearch Aggregations
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Mongo-Drupal
Mongo-DrupalMongo-Drupal
Mongo-Drupal
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Druid
DruidDruid
Druid
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
 

Más de MongoDB

Más de MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Precog & MongoDB User Group: Skyrocket Your Analytics

  • 1. Skyrocket your Analytics MongoDB Meetup on December 10, 2012 www.precog.com @precogio Nov - Dec 2012
  • 2. welcome & agenda ■ Welcome to the Precog & MongoDB Meetup! 7:00 - 7:30 Overview of Precog for MongoDB by Derek Chen-Becker 7:30 - 7:45 Break (grab a beer, drink and snacks) 7:45 - 8:15 Analyzing Big Data with Quirrel by John A. De Goes 8:15 - 8:30 Precog Challenge Problems! Win some prizes! ■ Questions? Please ask away!
  • 3. who we are ■ Precog Team Derek Chen-Becker, Lead Infrastructure Engineer John A. De Goes, CEO/Founder Kris Nuttycombe, Dir of Engineering Nathan Lubchenco, Developer Evangelist ■ MongoDB Host Clay Mcllrath ■ Thank you to Google for hosting us!
  • 4. Current MongoDB Support for Analytics Derek Chen-Becker Precog Lead Infrastructure Engineer @dchenbecker Nov - Dec 2012
  • 5. current mongodb support for analytics ■ Mongo has support for a small set of simple aggregation primitives ○ count - returns the count of a given collection's documents with optional filtering ○ distinct - returns the distinct values for given selector criteria ○ group - returns groups of documents based on given key criteria. Group cannot be used in sharded configurations
  • 6. current mongodb support for analytics > db.london_medals.group({ key : {"Country":1}, reduce : function(curr, result) { result.total += 1 }, initial: { total : 0, fullTotal: db.london_medals.count() }, finalize: function(result){ result.percent = result.total * 100 / result.fullTotal } }) [ {"Country" : "Great Britain", "total" : 88, "fullTotal" : 1019, "percent" : 8.635917566241414}, {"Country" : "Dominican Republic", "total" : 2, "fullTotal" : 1019, "percent" : 0.19627085377821393}, {"Country" : "Denmark", "total" : 16, "fullTotal" : 1019, "percent" : 1.5701668302257115}, ... ■ More sophisticated queries are possible, but require a lot of JS and you'll hit the limits pretty quickly ■ Group cannot be used in sharded configurations. For that you need...
  • 7. current mongodb support for analytics ■ Map/Reduce: Exactly what its name says. ■ You utilize JavaScript functions to map your documents' data, then reduce that data into a form of your choosing. Output Collection Input Mapping Function Reducing Function Collection Result Document
  • 8. current mongodb support for analytics ■ The mapping function redefines this to be the current document ■ Output mapped keys and values are generated via the emit function ■ Emit can be called zero or more times for a single document function () { emit(this.Countryname, { count : 1 }); } function () { for (var i = 0; i < this.Pupils.length; i++) { emit(this.Pupils[i].name, { count : 1}); } function () { if ((this.parents.age - this.age) < 25) { emit(this.age, { income : this.income }); } }
  • 9. current mongodb support for analytics ■ The reduction function is used to aggregate the outputs from the mapping function ■ The function receives two inputs: the key for the elements being reduced, and the values being reduced ■ The result of the reduction must be the same format as in the input elements, and must be idempotent function (key, values) { var count = 0; for (var item in values) { count += item.count } { "count" : count } }
  • 10. current mongodb support for analytics ■ Map/Reduce utilizes JavaScript to do all of its work ○ JavaScript in MongoDB is currently single-threaded (performance bottleneck) ○ Using external JS libraries is cumbersome and doesn't play well with sharding ○ No matter what language you're actually using, you'll be writing/maintaining JavaScript ■ Troubleshooting the Map/Reduce functions is primitive. 10Gen's advice: "write your own emit function" (!) ■ Output options are flexible, but have some caveats ○ Output to a result document must fit in a BSON doc (16MB limit) ○ For an output collection: if you want indices on the result set, you need to pre- create the collection then use the merge output option
  • 11. current mongodb support for analytics ■ The Aggregation Framework is designed to alleviate some of the issues with Map/Reduce for common analytical queries ■ New in 2.2 ■ Works by constructing a pipeline of operations on data. Similar to M/R, but implemented in native code (higher performance, not single-threaded) Input Match Project Group Collection
  • 12. current mongodb support for analytics ■ Filtering/paging ops ○ $match - utilize Mongo selection syntax to choose input docs ○ $limit ○ $skip ■ Field manipulation ops ○ $project - select which fields are processed. Can add new fields ○ $unwind - flattens a doc with an array field into multiple events, one per array value ■ Output ops ○ $group ○ $sort ■ Most common pipelines will be of the form $match ⇒ $project ⇒ $group
  • 13. current mongodb support for analytics ■ $match is very important to getting good performance ■ Needs to be the first op in the pipeline, otherwise indices can't be used ■ Uses normal MongoDB query syntax, with two exceptions ○ Can't use a $where clause (this requires JavaScript) ○ Can't use Geospatial queries (just because) { $match : { "Name" : "Fred" } } { $match : { "Countryname" : { $neq : "Great Britain" } } } { $match : { "Income" : { $exists : 1 } } }
  • 14. current mongodb support for analytics ■ $project is used to select/compute/augment the fields you want in the output documents { $project : { "Countryname" : 1, "Sportname" : 1 } } ■ Can reference input document fields in computations via "$" { $project : { "country_name" : "$Countryname" } } /* renames field */ ■ Computation of field values is possible, but it's limited and can be quite painful { $project: { "_id":0, "height":1, "weight":1, "bmi": { $divide : ["$weight", { $multiply : [ "$height", "$height" ] } ] } } } /* omit "_id" field, inflict pain and suffering on future maintainers... */
  • 15. current mongodb support for analytics ■ $group, like the group command, collates and computes sets of values based on the identity field ("_id"), and whatever other fields you want { $group : { "_id" : "$Countryname" } } /* distinct list of countries */ ■ Aggregation operators can be used to perform computation ($max, $min, $avg, $sum) { $group : { "_id" : "$Countryname", "count" : { $sum : 1 } } } /* histogram by country */ { $group : { "_id" : "$Countryname", "weight" : { $avg : "$weight" } } } { $group : { "_id" : "$Countryname", "weight" : { $sum : "$weight" } } } ■ Set-based operations ($addToSet, $push) { $group : { "_id" : "$Countryname", "sport" : { $addToSet : "$sport" } } }
  • 16. current mongodb support for analytics ■ Aggregation framework has a limited set of operators ○ $project limited to $add/$subtract/$multiply/$divide, as well as some boolean, string, and date/time operations ○ $group limited to $min/$max/$avg/$sum ■ Some operators, notably $group and $sort, are required to operate entirely in memory ○ This may prevent aggregation on large data sets ○ Can't work around using subsetting like you can with M/R, because output is strictly a document (no collection option yet)
  • 17. current mongodb support for analytics ■ Even with these tools, there are still limitations ○ MongoDB is not relational. This means a lot of work on your part if you have datasets representing different things that you'd like to correlate. Clicks vs views, for example ○ While the Aggregation Framework alleviates some of the performance issues of Map/Reduce, it does so by throwing away flexibility ○ The best approach for parallelization (sharding) is fraught with operational challenges (come see me for horror stories)
  • 18. Overview of Precog for MongoDB Derek Chen-Becker Precog Lead Infrastructure Engineer @dchenbecker Nov - Dec 2012
  • 19. overview of precog for mongodb ■ Download file: http://www.precog.com/mongodb ■ Setup: $ unzip precog.zip $ cd precog $ emacs -nw config.cfg (adjust ports, etc) $ ./precog.sh
  • 20. overview of precog for mongodb ■ Precog for MongoDB allows you to perform sophisticated analytics utilizing existing mongo instances ■ Self-contained JAR bundling: ○ The Precog Analytics service ○ Labcoat IDE for Quirrel ■ Does not include the full Precog stack ○ Minimal authentication handling (single api key in config) ○ No ingest service (just add data directly to mongo)
  • 21. overview of precog for mongodb ■ Some sample queries -- histogram by country data := //summer_games/athletes solve 'country { country: 'country, count: count(data where data.Countryname = 'country) }
  • 22. Analyzing Big Data with Quirrel John A. De Goes Precog CEO/Founder @jdegoes Nov - Dec 2012
  • 23. overview Quirrel is a statistically-oriented query language designed for the analysis of large-scale, potentially heterogeneous data sets.
  • 24. quirrel ● Simple ● Set-oriented ● Statistically-oriented ● Purely declarative ● Implicitly parallel
  • 25. sneak peek pageViews := //pageViews avg := mean(pageViews.duration) bound := 1.5 * stdDev(pageViews.duration) pageViews.userId where pageViews.duration > avg + bound
  • 26. quirrel speaks json 1 true [[1, 0, 0], [0, 1, 0], [0, 0, 1]] "All work and no play makes jack a dull boy" {"age": 23, "gender": "female", "interests": ["sports", "tennis"]}
  • 27. comments -- Ignore me. (- Ignore me, too -)
  • 28. basic expressions 2 * 4 (1 + 2) * 3 / 9 > 23 3 > 2 & (1 != 2) false & true | !false
  • 29. named expressions x := 2 square := x * x
  • 33. filtering pageViews := //pageViews pageViews.userId where pageViews.duration > 1000
  • 34. augmentation clicks with {dow: dayOfWeek(clicks.time)}
  • 36. user-defined functions ctr(day) := count(clicks where clicks.day = day) / count(impressions where impressions.day = day) ctrOnMonday := ctr(1) ctrOnMonday
  • 37. grouping - implicit constraints solve 'day {day: 'day, ctr: count(clicks where clicks.day = 'day) / count(impressions where impressions.day = 'day)}
  • 38. grouping - explicit constraints solve 'day = purchases.day {day: 'day, cummTotal: sum(purchases.total where purchases.day < 'day)}
  • 40. Now, it's your turn! Win some cool prizes! Precog Challenge Problems Nov - Dec 2012
  • 41. precog challenge #1 ■ Using the conversions data, find the state with the highest average income. ■ Variable names: conversions.customers.state and conversions.customers.income
  • 42. precog challenge #2 ■ Use Labcoat to display a bar chart of the clicks per month. ■ Variable names: clicks.timestamp
  • 43. precog challenge #3 ■ What product has the worst overall sales to women? To men? ■ Variable names: billing.product.ID, billing. product.price, billing.customer.gender
  • 44. precog challenge #1 possible solution conversions := //conversions results := solve 'state {state: 'state, aveIncome: mean(conversions.customer.income where conversions.customer.state = 'state)} results where results.aveIncome = max(results.aveIncome)
  • 45. precog challenge #2 possible solution clicks := //clicks clicks' := clicks with {month: std::time::monthOfYear(clicks.timeStamp)} solve 'month {month: 'month, clicks: count(clicks'.product.price where clicks'.month = 'month)}
  • 46. precog challenge #3 possible solution billing := //billing results := solve 'product, 'gender {product: 'product, gender: 'gender, sales: sum(billing.product.price where billing.product.ID = 'product & billing.customer.gender = 'gender)} worstSalesToWomen := results where results.gender = "female" & results.sales = min(results.sales where results.gender = "female") worstSalesToMen := results where results.gender = "male" & results.sales = min(results.sales where results.gender = "male") worstSalesToWomen union worstSalesToMen
  • 47. Thank you! Follow us on Twitter @precogio @jdegoes @dchenbecker Download Precog for MongoDB for FREE: www.precog.com/mongodb Try Precog for free and get a free account: www.precog.com Subscribe to our monthly newsletter: www.precog.com/about/newsletter Nov - Dec 2012