SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
 	
  	
  	
  	
  	
  	
  Building	
  Social	
  Analy/cs	
  Tool	
  with	
  MongoDB	
  -­‐	
  
A	
  Developer's	
  Perspec/ve
1.  Product	
  Overview	
  
2.  Why	
  MongoDB	
  for	
  us?	
  
3.  Aggrega?on	
  Queries	
  to	
  the	
  rescue	
  
4.  How	
  Javascript	
  helped	
  us?	
  
5.  Experiences	
  with	
  Indexes	
  
6.  In-­‐progress	
  use-­‐cases	
  
7.  Tips	
  &	
  Tricks	
  
8.  Demo	
  
Agenda
Abhishek	
  Tejpaul	
  
	
  
SoUware	
  Developer	
  @	
  IntelliGrape	
  SoUware	
  
	
  
Loves	
  Grails,	
  Git	
  and	
  Linux	
  
	
  
abhishek@intelligrape.com	
  
About me
DataSiU	
  
Instagram	
  
Web	
  
Crawler1	
  
Web	
  
Crawler…	
  
mongoDB
Product Overview – Information Flow
Product Overview – Results
Product Overview – Results
Product Overview – Results
•  Schema-­‐less	
  data.	
  Typical	
  data	
  sources	
  
	
  
•  Adding	
  new	
  social	
  pla4orms	
  in	
  future	
  
•  Needed	
  fast	
  read-­‐write	
  opera6ons	
  
Why MongoDB for us?
Aggregation Queries – Getting Insights
•  Combina6on	
  of	
  queries	
  chained	
  together	
  
•  At	
  every	
  stage,	
  we	
  can	
  filter/chain/massage	
  data	
  
	
  
Image	
  credit:	
  h@ps://www.openshiC.com/blogs/an-­‐overview-­‐of-­‐whats-­‐new-­‐in-­‐mongodb-­‐22	
  
Our use-case (esp. for graphs)
•  Sen6ment	
  Analysis	
  
•  Demographic	
  Analysis	
  
•  Ar6cle	
  Analysis	
  
•  Plan	
  
•  Crea?on	
  of	
  Intelligence	
  tables	
  in	
  advance	
  
•  Reality	
  
•  On-­‐the-­‐fly	
  analysis	
  using	
  Aggrega6on	
  queries	
  
How to go about it?
•  Operates	
  on	
  a	
  single	
  collec6on	
  	
  
•  Think	
  about	
  data	
  you	
  have	
  and	
  insights	
  you	
  want	
  
•  Focus	
  on	
  reducing	
  data	
  size	
  early	
  on	
  
•  $match	
  
•  $project	
  
•  $sort	
  
•  $limit,	
  $skip	
  
•  Example
db.collec?onName.aggregate(	
  
	
  { 	
  "$match" 	
  : 	
  { 	
  fieldName	
  :	
   	
  matchingValue 	
   	
  },	
  
	
  { 	
  "$project"	
  : 	
  { 	
  	
  oldOrNewField:	
  fieldValue 	
   	
   	
  }},	
  
	
  { 	
  "$group" 	
  : 	
  { 	
  fieldName	
  : 	
  oldOrNewField,	
  "sum": 	
  {"$sum":1}}},	
  
	
  { 	
  "$sort" 	
  : 	
   	
  { 	
  "sum" 	
  : 	
  -­‐1 	
  }},	
  
	
  { 	
  "$limit" 	
  : 	
  20 	
  })	
  
	
  
Javascript Capabilities
•  All	
  the	
  programming	
  capabili6es	
  of	
  Javascript	
  language	
  at	
  your	
  
disposal	
  
•  Taking	
  business	
  logic	
  /	
  processing	
  to	
  your	
  data-­‐store	
  
Javascript – Our use-cases
•  Remove	
  garbage	
  data	
  at	
  DB	
  level	
  
•  Twijer	
  wrong	
  results	
  
•  Filtering	
  out	
  STOP	
  keywords	
  
	
  
	
  db.IgnoreList.findOne().stopWords.forEach(	
  func?on(data)	
  {	
  
	
   	
  db.ProcessedAr?cle.update(	
  
	
   	
   	
  { 	
  "isAc?ve"	
  : 	
  true,	
  "isIgnored" 	
  : 	
  {"$ne":true} 	
  },	
  	
  
	
   	
   	
  { 	
  	
  
	
   	
   	
   	
  "$pull" 	
   	
  : 	
  {"topicOfDiscussion"	
  : 	
  {"name":	
  data}},	
  
	
   	
   	
   	
  "$set" 	
   	
  : 	
  {"isIgnored" 	
  : 	
   	
  true}	
  
	
   	
   	
  },	
  
	
   	
   	
  { 	
  "mul?" 	
   	
  : 	
  true 	
   	
  }	
  
	
   	
  )	
  
	
  });	
  
	
  return	
  true	
  
	
  
Javascript – Caveats
•  Takes	
  up	
  read-­‐write	
  locks	
  on	
  the	
  en6re	
  database	
  
•  Can	
  be	
  run	
  with	
  {‘noLock’	
  :	
  true}	
  op?on	
  
	
  
	
  db.runCommand({	
  
	
   	
   	
  Eval:	
  <func?on>,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Args:	
  <args>,	
  
	
   	
   	
  Nolock:	
  <true/false>	
  
	
   	
   	
  })	
  
	
  
•  Can	
  be	
  replaced	
  by	
  mapreduce	
  in	
  most	
  cases	
  
	
  
•  Take	
  it	
  as	
  one-­‐off	
  case	
  
Indexes – Our use-cases
•  dropDups	
  
{dropDups	
  :	
  true}	
  
•  backGround	
  
{backGround	
  :	
  true}	
  
•  Time	
  to	
  Live	
  
{expireAUerSeconds	
  :	
  3600}	
  
•  Compound	
  Indexing	
  
{key1	
  :	
  1,	
  key2	
  :	
  1}	
  !=	
  {key1	
  :	
  1}	
  	
  
Our current state
•  Faster	
  write	
  opera?ons	
  
•  Under	
  high	
  data	
  load	
  from	
  different	
  sources	
  
•  Faster	
  read	
  opera?ons	
  
•  Graph	
  rendering	
  up-­‐to	
  10	
  x	
  quicker	
  
•  Ease	
  of	
  scalability	
  
•  Though	
  yet	
  to	
  reach	
  there	
  
Work In Progress
•  Full-­‐text	
  search	
  implementa?on	
  
•  can	
  be	
  created	
  only	
  on	
  strings	
  or	
  array	
  of	
  strings	
  
•  db.collec?onName.ensureIndex(	
  {	
  fieldName	
  :	
  "text"	
  }	
  )	
  
•  Capped	
  Collec?ons	
  
•  Widgets	
  for	
  last-­‐run	
  jobs	
  /	
  event	
  log	
  tables	
  
•  Very	
  fast	
  writes	
  possible	
  
•  db.createCollec?on("cName",	
  {	
  capped	
  :	
  true,	
  size	
  :	
  5242880,	
  
max	
  :	
  5000	
  }	
  )	
  
•  size	
  argument	
  is	
  always	
  required	
  
Tips / Tricks – Things we learnt
•  cloneCollec6on	
  
•  No	
  more	
  ssh/scp	
  to	
  remote	
  systems	
  
•  db.runCommand({cloneCollec?on:	
  <nsCollec?on>,	
  from:	
  <remote>,	
  query:	
  {}})	
  
•  db.cloneCollec?on(from,	
  collec?onName,	
  query)	
  
•  db.Collec-onName.copyTo	
  
•  doesn’t	
  not	
  copy	
  indexes	
  
Tips / Tricks – Things we learnt
•  remove()	
  vs	
  drop()	
  
•  Can’t	
  use	
  remove	
  for	
  capped	
  collec6ons	
  	
  
•  remove	
  keeps	
  indexes	
  while	
  drop()	
  clears	
  them	
  
•  To	
  remove	
  all	
  the	
  documents	
  in	
  a	
  collec?on,	
  use	
  drop()	
  
•  To	
  remove	
  beZer	
  part	
  of	
  large	
  collec?on,	
  use	
  javascript	
  
•  preZy()	
  find	
  by	
  default	
  
•  DBQuery.prototype._prejyShell	
  =	
  true	
  (	
  inside	
  your	
  ~/.mongorc.js)	
  
DEMO	
  
I	
  am	
  not	
  a	
  MongoDB	
  expert	
  though	
  J	
  
Thank	
  You!!	
  

Más contenido relacionado

La actualidad más candente

Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevAltinity Ltd
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchFlorian Hopf
 
Natural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkNatural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkDatabricks
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONAli Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONOutlyer
 
Where Is My Data - ILTAM Session
Where Is My Data - ILTAM SessionWhere Is My Data - ILTAM Session
Where Is My Data - ILTAM SessionTamir Dresher
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMongoDB
 
MongoDB Best Practices for Developers
MongoDB Best Practices for DevelopersMongoDB Best Practices for Developers
MongoDB Best Practices for DevelopersMoshe Kaplan
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneNoriaki Tatsumi
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim TkachenkoWebinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim TkachenkoAltinity Ltd
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membaseArdak Shalkarbayuli
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stackVikrant Chauhan
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDBMongoDB
 

La actualidad más candente (20)

Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Natural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkNatural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache Spark
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONAli Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
 
Where Is My Data - ILTAM Session
Where Is My Data - ILTAM SessionWhere Is My Data - ILTAM Session
Where Is My Data - ILTAM Session
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best Practices
 
MongoDB Best Practices for Developers
MongoDB Best Practices for DevelopersMongoDB Best Practices for Developers
MongoDB Best Practices for Developers
 
Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital One
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim TkachenkoWebinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
 

Destacado

Problema de upgrading de delay mínimo de árvore geradora mínima
Problema  de upgrading  de  delay  mínimo  de  árvore geradora mínimaProblema  de upgrading  de  delay  mínimo  de  árvore geradora mínima
Problema de upgrading de delay mínimo de árvore geradora mínimaUniversidade Federal do Maranhão
 
3 класс (урок 1)(1)
3 класс (урок 1)(1)3 класс (урок 1)(1)
3 класс (урок 1)(1)oksikboss
 
MMMS monitoring backup and management at a single click
MMMS monitoring backup and management at a single clickMMMS monitoring backup and management at a single click
MMMS monitoring backup and management at a single clickMongoDB APAC
 
2 класс (урок 1)
2 класс (урок 1)2 класс (урок 1)
2 класс (урок 1)oksikboss
 
1) dasar dasar programan web
1) dasar dasar programan web1) dasar dasar programan web
1) dasar dasar programan webImam Fathur
 
3 класс (урок 5)
3 класс (урок 5)3 класс (урок 5)
3 класс (урок 5)oksikboss
 
3 класс (урок 2)(1)
3 класс (урок 2)(1)3 класс (урок 2)(1)
3 класс (урок 2)(1)oksikboss
 
3 класс (урок 7)
3 класс (урок 7)3 класс (урок 7)
3 класс (урок 7)oksikboss
 
3 класс (урок 7.1)
3 класс (урок 7.1)3 класс (урок 7.1)
3 класс (урок 7.1)oksikboss
 
What's new in MongoDB 2.6 at India event by company
What's new in MongoDB 2.6 at India event by companyWhat's new in MongoDB 2.6 at India event by company
What's new in MongoDB 2.6 at India event by companyMongoDB APAC
 
247 overviewmongodbevening-bangalore
247 overviewmongodbevening-bangalore247 overviewmongodbevening-bangalore
247 overviewmongodbevening-bangaloreMongoDB APAC
 
Cignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysCignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysMongoDB APAC
 
Mongo db eveningschemadesign
Mongo db eveningschemadesignMongo db eveningschemadesign
Mongo db eveningschemadesignMongoDB APAC
 
Урок 1. Введение в курс разработки сайтов. Web – технологии.
Урок 1. Введение в курс разработки сайтов. Web – технологии.Урок 1. Введение в курс разработки сайтов. Web – технологии.
Урок 1. Введение в курс разработки сайтов. Web – технологии.oksikboss
 

Destacado (17)

Problema de upgrading de delay mínimo de árvore geradora mínima
Problema  de upgrading  de  delay  mínimo  de  árvore geradora mínimaProblema  de upgrading  de  delay  mínimo  de  árvore geradora mínima
Problema de upgrading de delay mínimo de árvore geradora mínima
 
3 класс (урок 1)(1)
3 класс (урок 1)(1)3 класс (урок 1)(1)
3 класс (урок 1)(1)
 
MMMS monitoring backup and management at a single click
MMMS monitoring backup and management at a single clickMMMS monitoring backup and management at a single click
MMMS monitoring backup and management at a single click
 
2 класс (урок 1)
2 класс (урок 1)2 класс (урок 1)
2 класс (урок 1)
 
1) dasar dasar programan web
1) dasar dasar programan web1) dasar dasar programan web
1) dasar dasar programan web
 
3 класс (урок 5)
3 класс (урок 5)3 класс (урок 5)
3 класс (урок 5)
 
3 класс (урок 2)(1)
3 класс (урок 2)(1)3 класс (урок 2)(1)
3 класс (урок 2)(1)
 
3 класс (урок 7)
3 класс (урок 7)3 класс (урок 7)
3 класс (урок 7)
 
3 класс (урок 7.1)
3 класс (урок 7.1)3 класс (урок 7.1)
3 класс (урок 7.1)
 
What's new in MongoDB 2.6 at India event by company
What's new in MongoDB 2.6 at India event by companyWhat's new in MongoDB 2.6 at India event by company
What's new in MongoDB 2.6 at India event by company
 
Learning And Earning
Learning And Earning Learning And Earning
Learning And Earning
 
Pelicamigrator
PelicamigratorPelicamigrator
Pelicamigrator
 
Rpsonmongodb
RpsonmongodbRpsonmongodb
Rpsonmongodb
 
247 overviewmongodbevening-bangalore
247 overviewmongodbevening-bangalore247 overviewmongodbevening-bangalore
247 overviewmongodbevening-bangalore
 
Cignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysCignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdays
 
Mongo db eveningschemadesign
Mongo db eveningschemadesignMongo db eveningschemadesign
Mongo db eveningschemadesign
 
Урок 1. Введение в курс разработки сайтов. Web – технологии.
Урок 1. Введение в курс разработки сайтов. Web – технологии.Урок 1. Введение в курс разработки сайтов. Web – технологии.
Урок 1. Введение в курс разработки сайтов. Web – технологии.
 

Similar a Buildingsocialanalyticstoolwithmongodb

Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Eventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real WorldEventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real WorldBeyondTrees
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBMongoDB
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13DECK36
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Usability in the GeoWeb
Usability in the GeoWebUsability in the GeoWeb
Usability in the GeoWebDave Bouwman
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebJames Rakich
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 

Similar a Buildingsocialanalyticstoolwithmongodb (20)

Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Eventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real WorldEventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real World
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 
JS Essence
JS EssenceJS Essence
JS Essence
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Usability in the GeoWeb
Usability in the GeoWebUsability in the GeoWeb
Usability in the GeoWeb
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Buildingsocialanalyticstoolwithmongodb

  • 1.              Building  Social  Analy/cs  Tool  with  MongoDB  -­‐   A  Developer's  Perspec/ve
  • 2. 1.  Product  Overview   2.  Why  MongoDB  for  us?   3.  Aggrega?on  Queries  to  the  rescue   4.  How  Javascript  helped  us?   5.  Experiences  with  Indexes   6.  In-­‐progress  use-­‐cases   7.  Tips  &  Tricks   8.  Demo   Agenda
  • 3. Abhishek  Tejpaul     SoUware  Developer  @  IntelliGrape  SoUware     Loves  Grails,  Git  and  Linux     abhishek@intelligrape.com   About me
  • 4. DataSiU   Instagram   Web   Crawler1   Web   Crawler…   mongoDB Product Overview – Information Flow
  • 8. •  Schema-­‐less  data.  Typical  data  sources     •  Adding  new  social  pla4orms  in  future   •  Needed  fast  read-­‐write  opera6ons   Why MongoDB for us?
  • 9. Aggregation Queries – Getting Insights •  Combina6on  of  queries  chained  together   •  At  every  stage,  we  can  filter/chain/massage  data     Image  credit:  h@ps://www.openshiC.com/blogs/an-­‐overview-­‐of-­‐whats-­‐new-­‐in-­‐mongodb-­‐22  
  • 10. Our use-case (esp. for graphs) •  Sen6ment  Analysis   •  Demographic  Analysis   •  Ar6cle  Analysis   •  Plan   •  Crea?on  of  Intelligence  tables  in  advance   •  Reality   •  On-­‐the-­‐fly  analysis  using  Aggrega6on  queries  
  • 11. How to go about it? •  Operates  on  a  single  collec6on     •  Think  about  data  you  have  and  insights  you  want   •  Focus  on  reducing  data  size  early  on   •  $match   •  $project   •  $sort   •  $limit,  $skip   •  Example db.collec?onName.aggregate(    {  "$match"  :  {  fieldName  :    matchingValue    },    {  "$project"  :  {    oldOrNewField:  fieldValue      }},    {  "$group"  :  {  fieldName  :  oldOrNewField,  "sum":  {"$sum":1}}},    {  "$sort"  :    {  "sum"  :  -­‐1  }},    {  "$limit"  :  20  })    
  • 12. Javascript Capabilities •  All  the  programming  capabili6es  of  Javascript  language  at  your   disposal   •  Taking  business  logic  /  processing  to  your  data-­‐store  
  • 13. Javascript – Our use-cases •  Remove  garbage  data  at  DB  level   •  Twijer  wrong  results   •  Filtering  out  STOP  keywords      db.IgnoreList.findOne().stopWords.forEach(  func?on(data)  {      db.ProcessedAr?cle.update(        {  "isAc?ve"  :  true,  "isIgnored"  :  {"$ne":true}  },          {            "$pull"    :  {"topicOfDiscussion"  :  {"name":  data}},          "$set"    :  {"isIgnored"  :    true}        },        {  "mul?"    :  true    }      )    });    return  true    
  • 14. Javascript – Caveats •  Takes  up  read-­‐write  locks  on  the  en6re  database   •  Can  be  run  with  {‘noLock’  :  true}  op?on      db.runCommand({        Eval:  <func?on>,                                                        Args:  <args>,        Nolock:  <true/false>        })     •  Can  be  replaced  by  mapreduce  in  most  cases     •  Take  it  as  one-­‐off  case  
  • 15. Indexes – Our use-cases •  dropDups   {dropDups  :  true}   •  backGround   {backGround  :  true}   •  Time  to  Live   {expireAUerSeconds  :  3600}   •  Compound  Indexing   {key1  :  1,  key2  :  1}  !=  {key1  :  1}    
  • 16. Our current state •  Faster  write  opera?ons   •  Under  high  data  load  from  different  sources   •  Faster  read  opera?ons   •  Graph  rendering  up-­‐to  10  x  quicker   •  Ease  of  scalability   •  Though  yet  to  reach  there  
  • 17. Work In Progress •  Full-­‐text  search  implementa?on   •  can  be  created  only  on  strings  or  array  of  strings   •  db.collec?onName.ensureIndex(  {  fieldName  :  "text"  }  )   •  Capped  Collec?ons   •  Widgets  for  last-­‐run  jobs  /  event  log  tables   •  Very  fast  writes  possible   •  db.createCollec?on("cName",  {  capped  :  true,  size  :  5242880,   max  :  5000  }  )   •  size  argument  is  always  required  
  • 18. Tips / Tricks – Things we learnt •  cloneCollec6on   •  No  more  ssh/scp  to  remote  systems   •  db.runCommand({cloneCollec?on:  <nsCollec?on>,  from:  <remote>,  query:  {}})   •  db.cloneCollec?on(from,  collec?onName,  query)   •  db.Collec-onName.copyTo   •  doesn’t  not  copy  indexes  
  • 19. Tips / Tricks – Things we learnt •  remove()  vs  drop()   •  Can’t  use  remove  for  capped  collec6ons     •  remove  keeps  indexes  while  drop()  clears  them   •  To  remove  all  the  documents  in  a  collec?on,  use  drop()   •  To  remove  beZer  part  of  large  collec?on,  use  javascript   •  preZy()  find  by  default   •  DBQuery.prototype._prejyShell  =  true  (  inside  your  ~/.mongorc.js)  
  • 21. I  am  not  a  MongoDB  expert  though  J