An Open Talk at DeveloperWeek Austin 2017 by Kimberly Wilkins (@dba_denizen), Principal Engineer - Databases at ObjectRocket. Featuring new use cases like Bitcoin, AI, IoT, and all the cool things.
8. www.objectrocket.com
Data is Coming From Everywhere
“Big data is like teenage sex:
everyone talks about it,
nobody really knows how to
do it, everyone thinks
everyone else is doing it, so
everyone claims they are
doing it…”
-Dan Ariely, Duke University
9. www.objectrocket.com
Remember
• Hold the data
• Find the data fast
• Stream the data between data stores
• Process the data along the way
• Analyze the data
• Understand where the data comes from
10. www.objectrocket.com
Why?
• Faster, more flexible development
• Lower $ (hardware, software, deployment)
• Performance (faster writes, faster reads)
• Developers (“Schemaless”, cool toys)
• > dev’s than ^ dba’s, devops, SRE’s…
• Variety of NoSQL technologies
12. www.objectrocket.com
MongoDB
"MongoDB (from humongous) is a free and open-source
cross-platform document-oriented database program.
Classified as a NoSQL database program, MongoDB
uses JSON-like documents with schemas.”
– straight from wikipedia
• #1 NoSQL
• #5 Overall
13. www.objectrocket.com
Features: MongoDB
Document store
collections vs tables; document or objectId’s
Easy for developers – more devs than DBA’s and Ops
flexible data types
Unstructured & structured data
De-normalized
Duplicate data is OK
Index intersections, partials, aggregation pipelines - $lookup
improvements coming in 3.6 *Nov–single db call; updating arrays
Scales vertically or horizontally - sharding
14. www.objectrocket.com
MongoDB Architectural Basics
• Faster, more flexible development
• Built-in Replication via Replica sets
• HA/DR throughout stack, components
• Scaling via Sharding
• DR via use of Multiple Data Centers
• Delayed and/or Hidden Slaves
• https://www.objectrocket.com/files/objectrocket-for-
mongodb-white-paper.pdf
17. www.objectrocket.com
MongoDB Architecture - Advanced
• Multiple Storage Engine Options
• HA/DR throughout stack, components
• Scaling via Sharding
• DR via use of Multiple Data Centers,
delayed/hidden
• Percona Server Edition - has features from
MongoDB Enterprise edition* Security
18. www.objectrocket.com
Best Use Cases
• User Data - games, chat, social media
• Mobile Analytics, Engagement/Campaigns
• Aggregation Summaries
• Product Catalogs
• Inventory Management
• Shopping Carts
• Content Management Systems - Sitecore
1000 x
20. www.objectrocket.com
Elasticsearch
“Elasticsearch is a distributed, JSON-
based search and analytics engine
designed for horizontal scalability,
maximum reliability, and easy
management.”
– straight from Elastic.co website
21. www.objectrocket.com
Best Use Cases
● Cluster - A collection of Elasticsearch nodes of
various roles
↳ Nodes - Elasticsearch processes that perform one or more roles
● Roles are: master, data, ingest, coordinating-only (client)
● Nodes can operate in any combination or all roles
↳ Indexes - A collection of data (like databases/collections)
● Can be combined in queries with wildcards and aliases
● Fields in an index have an unchangeable data type (mapping)
↳ Shards - Slices of the index data
● Unlike many databases, automatically constructed (not key based)
● A replica is just a readonly copy of a shard
↳ Segments - Lucene’s chunk of data
● Automatically built as data is indexed.
● Docs are not deleted, just marked as deleted (can be
optimized/merged)
↳ Documents - A JSON entry in the index
22. www.objectrocket.com
Elasticsearch vs. Elastic Stack
• Don’t be confused!
• Elasticsearch vs. Elastic Stack
• The Open Source Elastic Stack is a suite of
tools/apps associated with and working in
conjunction with Elasticsearch to complete a variety
of analytics tasks.
24. www.objectrocket.com
Basic Elastic Architecture
3 Nodes 1 Replica, 1 master-Master –fewer nodes, more resources
per node, each shard performs better
3 Nodes 2 Replicas, 1 master-Master – more nodes, needs more
HW resources but increases search performance for the index and
improves redundancy
25. www.objectrocket.com
Best Use Cases
• Full and Fuzzy Text Searches **true strength speed
• Geo and Range related searches
• Visualizing Data – with other ES Stack
Components- Kibana
• Logging and Log Analysis xsplunkx
• Scraping and Combining Public Data Sources
• Event and Data Metrics
28. www.objectrocket.com
Visualization with Kibana
MongoDB Elastic (Elasticsearch)
General Purpose Document store DB, server side scripts,
some aggreg pipelines
OLTP = good, REPORTING = not as good
Simple = good, Complex = good, Very Complex = not as good
Full-text search engine, Fuzzy text search, geo near,
keyword, real-time analytics, indexer, distributed , java
based w/Lucene under the covers
Current version: 3.4.10 *Halloween!
Recommended: 3.4.8 or 3.4.9
Current version: 5.6.1 September 18, 2017 *New, kinks from
5.5.3 release from September 11, 2017
Recommended and Available 5.5.1 July 25, 2017
Schemaless **#! Structured, unstructured, semi-structured Schemaless **#! Structured, unstructured, semi-structured
JSON, BSON docs JSON
Sharding to scale Sharding/Nodes to scale
HA via replica sets
(1 Primary, 2 Secondaries – or more with quorum)
HA via replica sets
(1 MASTER, x REPLICAS)
Limited index intersection v2.6+, very large indexes still ehh 1 Query can use multiple indexes
Great general purpose NoSQL db, for Processing, filtering
during query & data retrieval
Processing via index builds, stores in multiple versions.
Great at Indexing; Great at searching big datasets
30. www.objectrocket.com
Combining – in general
• Database >>many indexes or very large indexes
• Data has lots of arrays - to perform queries that
required many different $and clauses on an field
with an array as a value
• SPEED up fuzzy and/or full text searches – ‘chicken’
ex. db.articles.find({ $text: { $search: "chi" } }
31. www.objectrocket.com
MongoDB & Elasticsearch +
Primarily Search Engine
Scalable, distributed
Horizontal scaling
JSON
Schemaless*
Based on Lucene
Support for Python, JS, .Net,
Scala, Perl, php, Ruby
3rd Party Product Integration
Primarily for Streaming, for
moving data between data
stores, used with other
components and data techs
to create near real time and
very near real time event
analytics, append only,
Horizontal scaling
JSON
Schemaless*
Parallel Processing
3rd Party Product Integration
Primarily OLTP
Scalable, distributed
Verticle or Horizontal
scaling
Binary JSON
Schemaless*
Rapid prototyping
Event Logging
Social Media
Content management
User Data and Actions
NOT in-depth analysis
MongoDB
Elasticsearch
Kafka, others
32. www.objectrocket.com
MongoDB & Elasticsearch @ObjectRocket
MongoDB
metrics
Centralized
Logging
MongoDB data
visualization Network
monitoring
Website search
Business
Metrics
Elasticsearch metrics
Currently
33. www.objectrocket.com
Potential New Use 1 – Bitcoin Time Interval Tracking
Bitcoin ticker data Interval Tracking and Analysis….
MongoDB
• Simple and Complex
Queries
• Aggregations at any
stage
Elasticsearch
• Speed up queries –
faster results
• Store frequent queries
for re-use via indexes
35. www.objectrocket.com
Potential New Use 2 – Cryptocurrency Platform/Trading
• Crytpocurrency Trading Platform - ex. tribeca
• node.js – v7.8 or higher
• MongoDB database – for persistence, aggregations
• Elasticsearch – the ‘need for speed’ rapid-fire
executions required – sub millisecond trades & cancellations
36. www.objectrocket.com
Potential New Use 3 – Social Media App Searching
• Searching large Social Media Apps for frequently
searched items – popular quarterbacks & receivers
on fantasy football sites, wines in comments
• MongoDB’s $text operator is special - cannot be
used more than once in a query; no use with $nor,
etc.
ex. db.comments.find({ $and: [{$text: { $search: ”win"
},{$text: {$search: “red” }}]}) – WON’T WORK!
In MongoDB but combine it.
38. www.objectrocket.com
Potential New Use 4 – Machine Learning, Deep Learning
Architecture and Streaming
Platform – Jay Kreps
• Apps/DB’s->data in
• Aggregations at any stage
• Further Queries
• Faster Queries via ES
• Results back into DB’s
• Algorithms applied
• Endless … Limitless …
Device events, time series,
event logs, AR/VR/MR
39. www.objectrocket.com
Links
• MongoDB to Analyze cryptocurrency price swings and intervals:
https://medium.com/@serbanmihai/aggregate-mongodb-data-with-node-js-and-mongoose-
cryptocurrency-financial-time-series-ae739b4c9485
• MongoDB with node.js – Cryptocurrency trading platform:
https://github.com/michaelgrosner/tribeca
• Arctic MongoDN and Python – Cryptocurrency Database:
https://mxbu.github.io/logbook/2017/06/04/use-arctic-to-create-cryptocurrency-database/
• AI MI DL - Jay Kreps article Architecture and Streaming Platform for AI Deep Learning
Database Pipeline Models Events etc.:
• https://www.oreilly.com/ideas/apache-kafka-and-the-four-challenges-of-production-machine-
learning-systems
MongoDB is somewhat the defacto general purpose NoSQL DB and it has added enough new features
and made enough improvements to stay there at top of NoSQL offerings
Elastic is moving up and it can do things fast
As our word expands and changes, the potential use cases for combining data stores – MongoDB and Elasticsearch – also grows.
But before we can talk about those current and potential use cases for combining them, we should take a quick look at what each of them are and when to use them individually.
2 mins
People wanted Big Data to go away, they wanted to call it other things
or NOT call it things or whatever… EOT IOT IIOT
But it’s not going to…
-Internet of Things / Everything / Industrial IIOT - logs, events, - 2019 ~$1.7 TRILLION $$
-Monitoring and managing those has sprung up whole companies now –
-Augmented Reality AR VR MR - THE FUTURE – the next iphone level CHANGE
Manufacturing, Training,
Sorry, not sorry - still love this quote after all of the years -
But the truth remains – more and more and more Data Points
Requires THINGS (applications, Data Store) to manage them
We NEED Something to hold the data, to find the data fast,
to SHARE the data and MOVE it from one APP to another
Process and transform along the way, Analyze it MEANINGS
NEVER truly schemaless though…
If you are NOT thinking about app design before you actually start designing it, you FAIL
You are just storing data that will likely never be used and your new shiny NoSQL datastore
will just become a data wasteland
= MongoDB and Elastic then MONGODB solo next
Keeo them tied together here –
MongoDB is somewhat the defacto general purpose NoSQL DB and it has added enough new features
and made enough improvements to stay there at top of NoSQL offerings
Elastic is moving up and it can do things fast
IF something comes straight from wikipedia it HAS to be true
MongoDB is the defacto general purpose NoSQL DB
#5 Datastore technology over and holding steady there
#1 NoSQL Database product
MongoDB has the market share and the community buy-in to make the difference in supportability to
usually take the prize unless you have a really really heavy write application
Community Support and Development efforts - drivers, etc.
Built in Sharding/Scaling via Replica Sets High writes and heavy reads –
can be somewhat mutually exclusive
MongoDB scales nearly linearly for heavy read workloads
3.4.10 as of Halloween - since released on Halloween, would avoid ;-)
no tricks please - 3.4.9 considered a minor release overall but …
But what does it look like really? Architecture overview next
1 Primary, 2 Secondaries - heartbeat communication for up/down state, replication to secondaries via oplog
MongoDB has same kind of potential to scale UP instead of OUT –
**NOTE - many people run MongoDB on dedicated larger bare metal hosts and grow by scaling up vertically
However, if they continue to grow, they will run into many of the same challenges that traditional RDBMS's have
So what about scaling OUT with Mongo? Religious War here
MongoD’s – the data nodes – the shards - the Replica Sets (primary and 2 secondary members)
MongoS’s – Query routers – talk to config servers and MongoD data nodes - get location metadata from config servers to route queries to the correct shard to satisfy a query and return the result
Good design to have multiple mongoS query routers in sharded clusters – our environments have 4
Config servers – the Data Dictionary of Mongo - contains cluster/shard metadata – mapping of data set –3.0 and below Always keep exactly and ONLY 3 for PROD env’s.
3.2 and up, is now by default a replica set and is NOW Required to be WT – improves consistency of info in chunk map - aka where data extents reside
If you lose or corrupt your configs, the mongoS will not know where the data resides - so can’t retrieve it …so effectively lost
Too much to cover other than mention for you to look up later
WT – new default, also for required config serer replica set vs 3 single db’s as before
MMAP - still good for larger result sets or smaller, more frequent write activities, specifically updates
Unless you have a lot of CPU and cores to throw at it for WT usage
= reminder to talk about percona version that allows us to offer security features that usually only come with the more expensive Enterprise version
SSL kerberos LDAP integration *** our experience there
User Data in Games
Inventory Management – update, decrease, increase inventory
Shopping carts - tales of the long query and 1000 pairs of shoes
CMS – Our expertise running Sitecore on Azure
A search engine but a whole lot more
MUCH more powerful than JUST a search engine
GeoAnalytics - Geo near me
Basically Clusters with Nodes holding Indexes
then split across hosts with Shards
Holding slices of data held in segments at the lucene chunk level
Composed of the data via documents written in JSON
There are lots of reasons to use multiple components of the Elastic Stack
Including for Visualization which we will talk about a bit later.
But 1st let’s talk about just elasticsearch
With Elastic, to increase in scale and add more performance, you increase the Replication Factor
Basically ADD NODES -this increases HW resources to improve search performance and improve redundancy
The number of replica shards can be changed dynamically on a live cluster, allowing us to scale up or down as demand requires.
And Elastic will automatically redistribute as needed
nine shards: three primaries and six replicas. This means that we can scale out to a total of nine nodes, again with one shard per node. This would allow us to triple search performance compared to our original three-node cluster.
here Logging and Log Analysis
Basically taking over for Splunk which has become too expensive
Elasticsearch has made massive improvements to its geospatial capabilities in the last 2 releases
It way outperforms the geospatial abilities of MongoDB’s $geoNear and within operators
Which is why you would look to combine them – which we will talk about later on
But other good uses of Elasticsearch combined with elements of its Elastic STACK
But other good uses of Elasticsearch combined with elements of its Elastic STACK
BUT Now to Summarize those 2 – MongoDB and Elasticsearch
Summarize those 2 Both store data objects that have key-value pair, both allow querying that body of objects.
But both come from 2 different camps and are made for different purposes.
Elastic - Great with full and fuzzy text searching
Slow when adding ‘new’ Data - aka creating new indexes
Uses indexes to help you find the data - fast
Completes complex search queries quickly
Interacts well directly other associated technologies – kibana, beats, logstash, etc. and other NoSQL and SQL DB’s
In the end, it is about the ability to store data, aggregate things,
pass it along. Then ANALYZE and USE that data analysis for whatever purpose you desire
So let’s look at these 2 together now
- When your data has a lot of arrays - to perform queries that required many different $and clauses on an field with an array as a value.
MANY Smaller shards as they need additional write scopes
2nd case - Fuzzy - If you want to do a search on the word chicken in a menu application:
Examples of How we combine MongoDB and Elasticserch CURRENTLY at ObjectRocket
POTENTIAL and or Theoretical New Use Cases
Possibilities and Potential Combination uses are very broad –
New emerging markets and areas – from cryptocurrency peripherals for persistence to
Use MongoDB to Analyze cryptocurrency price swings and intervals - https://medium.com/@serbanmihai/aggregate-mongodb-data-with-node-js-and-mongoose-cryptocurrency-financial-time-series-ae739b4c9485
node.js (v7.8 or greater) Persistence is achieved using MongoDB
tribeca - very low latency cryptocurrency market making trade bot with a
full featured web client, backtester, and supports direct connectivity to several crypto coin exchanges
- reacts to market data by placing and canceling orders in under a millisecond
Fantasy Football wine sites -If you want to do a search and possibly a match on the words wine & red
db.comments.find( { $and: [ { $text: { $search: "win" }, { $text: { $search: "red" } } ] } ) WON’T work
$text special MongoDB operator - only use once per query,
Endless opportunities here to combine with other data stores - grab those result sets,
store the primary results in MongoDB, perform additional aggregations to further refine them
Post online for massive around the world use by colleagues
Use Elasticsearch again to keep frequently searched combinations nearby/fast
Endless opportunities here to combine with other data stores - grab those result sets,
store the primary results in MongoDB, perform additional aggregations to further refine them
Post online for massive around the world use by colleagues
Use elasticsearch again to keep frequently searched combinations nearby/fast