NoSQL and MongoDB

NoSQL and MongoDB
Rajesh Menon

Agenda – Day 1
• Day 1 – Theory / Demo
o Introduction to NoSQL – 3 hours
➢ What Is Meant by NoSQL?
➢ Distributed and Decentralized
➢ Elastic Scalability
➢ High Availability and Fault Tolerance
➢ Brewer's CAP Theorem
➢ Row-Oriented
➢ Schema-Free
➢ High Performance
➢ Types of NoSQL Databases
➢ Introduction to ReDis (Key-Value Pair)
➢ Introduction to HBase (Column Oriented Hadoop)
➢ Introduction to Cassandra (Column- Oriented)
➢ Introduction to MongoDB (Document Oriented)
➢ Introduction to Neo4j (Graph DB)
•
o Remaining 5 hours
➢ Aggregation – 30 mins
➢ Map Reduce – 30 mins
➢ Compatibility with SPARK – 30 mins
➢ Query Optimisation based on 3.6(Advanced functions) – 30 mins
➢ Deep Diving on the functions of Mongo DB 3.4 – 2.5 hours
➢ Indexing techniques – 30 mins

Agenda – Day 2
 1) Aggregation Framework. – 1 hour
 2) MongoDB Sharding. – 1 hour
 3) MongoDB Ad hoc queries. – 1 hour
 4) MongoDB is Schema – Less. – 1 hour
 5) Capped Collections. - 1 hour
 6) MongoDB Indexing. – 1 hour
✓ 7) Project Work – 2hours (Related to Security)

What Is Meant
by NoSQL?
• A NoSQL database provides a mechanism for storage
and retrieval of data that is modeled in means other
than the tabular relations used in relational databases.

Driving Trends - Data Size
• Data size is increasing exponentially year after year

Driving Trend: Semi-
Structured
Information
• Content is becoming more
unique
• We can blame generation Y for
this!
• Before: Job Title - Software
Engineer
• After: Job Title - ZOMG
Awesome Core Repository
Developer

Social Network Performance
Why is RDBMS performance horrible?
•To find all friends on depth 5,MySQL will create Cartesian
product on t_user_friend table 5 times
•Resulting in 50,000^5 records return
•All except 1,000 are discarded
•Neo4j will simply traverse through the nodes in the database
until there are no more nodes in which to traverse

The power of traversals
•Graphs data structures are localized
•Count all of the people around you
•Adding more people in the room may only slightly impact
your performance to count your neighbors

Distributed and
Decentralized
• Horizontal (Mongo DB) and Vertical
(Oracle) scaling
• Distributed database
• Decentralized database
• Eg: Mongo DB
• Application : BlockChain

Elastic Scalability
• Scales to multiple clusters
• Private Cloud
• Public Cloud (MongoDB Atlas)
• Literally infinite scale
• Keep adding resources
• Tune the database
• NoSQL scales better than RDBMS
• Provision resources in seconds

High Availability
and Fault
Tolerance
• Always available, provided configuration is OK
• Look at the load on the system
• Number of users, sessions, memory, CPU, network
• Fault tolerance built in
• If one node fails, others take over without affecting the
entire system
• Classic example : Hadoop
• Cloud is a catalyst. And the future.

Brewer's CAP
Theorem
http://vitalflux.com/wp-content/uploads/2015/04/cap-theorum.png

Brewer’s Cap
Theorem
explained
• Consistency: Any changes to a particular record
stored in database, in form of inserts, updates
or deletes is seen as it is, by other users
accessing that record at that particular time. If
they don’t see the updated data, it is termed as
eventual consistency.
• Availability: The system continues to work and
serve data in spite of node failures.
• Partition Tolerance: The database system could
be stored based on distributed architecture
such as Hadoop (HDFS).

Brewer’s Cap
Theorem
examples
• RDBMS systems such as Oracle, MySQL etc
supports consistency and availability.
• NoSQL datastore such as HBase supports
Consistency and Partition Tolerance.
• NoSQL datastore such as Cassandra, CouchDB
supports Availability and Partition Tolerance.

Brewer’s CAP
Theorem
notes –
Which DB to
choose ?
CP-based database system: When it is critical that all
clients see a consistent view of the database, the users of
one node will have to wait for any other nodes to come
into agreement before being able to read or write to the
database, availability takes a backseat to consistency and
one may want to choose database such as HBase that
supports CP (Consistency and Partition Tolerance)
AP-based database system: When there is a requirement
that database remain available at all times, one could DB
system which allows clients write data to one node of the
database without waiting for other nodes to come into
agreement. DB system then takes care of data
reconciliation at a little later time. This is the state of
eventual consistency. In applications which could sacrifice
data consistency in return of huge performance, one could
select databases such as CouchDB, Cassandra.

Row-Oriented
• A column-oriented DBMS (or
columnar database management system) is a
database management system (DBMS) that stores data
tables by column rather than by row. Practical use of
a column store versus a row store differs little in the
relational DBMS world.
• RDBMS
• Document-based Store- It stores documents made up of
tagged elements. {Example- CouchDB} Column-based
Store- Each storage block contains data from only
one column, {Example- HBase, Cassandra} Graph-based-
A network databasethat uses edges and nodes to
represent and store data.

Row-Oriented
• MongoDB is schema-free, allowing
you to create documents without
having to first create the structure for
that document. At the same time, it
still has many of the features of a
relational database, including strong
consistency and an expressive
query language.CouchDB: Views
in CouchDB are similar to indexes
in SQL.

Comparing
Couch DB
with Mongo
DB

Map Reduce
in mySQL
and
MongoDB

Schema-Free
• On schema-free databases like for
example MongoDB, you can simply add
records without any previous structure.
Moreover, you can group records that
do not have the same structure, for
example, you can have a collection
(something like a table on relational
databases where you group records)
with records of various structures, in
other words, they do not need to have
the same columns (properties).

Schema Free – Mongo DB Notes
MongoDB is a JSON-style data store. The documents stored in the database can have
varying sets of fields, with different types for each field.
And that’s true. But it doesn’t mean that there is no schema. There are in fact various
schemas:
•The one in your head when you designed the data structures
•The one that your database really implemented to store your data structures
•The one you should have implemented to fulfill your requirements
Every time you realise that you made a mistake (see point three above), or when your
requirements change, you will need to migrate your data.
Let’s review again MongoDB’s point of view here:
With a schemaless database, 90% of the time adjustments to the database become
transparent and automatic.
For example, if we wish to add GPA to the student objects, we add the attribute, resave,
and all is well — if we look up an existing student and reference GPA,
we just get back null. Further, if we roll back our code, the new GPA fields in the existing
objects are unlikely to cause problems if our code was well written.

Schema Free
- Hadoop
What Hadoop, NoSQL databases and other modern
big data tools allow is for each application or user to
come to the raw data with a different schema. Take
call center logs as an example. Someone performing
a columnar analysis on time and call length has a
different interpretation of the schema than someone
doing a row search for a specific call. But they aren't
imposing a schema-on-read; rather, they're flexibly
addressing different components of the schema to
maximize their individual query performance.
So, forget schema-less, schema-on-read and other
nonsense that is of use only to theorists and niche
players. Focus instead on providing ways for flexible
database schemas to be integrated into the full
business information pipeline.

High Performance
• mongostat is the most powerful utility. It
reports real-time statistics about connections,
inserts, queries, updates, deletes, queued reads
and writes, flushes, memory usage, page faults,
and much more. It can be useful to quickly spot-
check database activity, see if values are not
abnormally high, and make sure you have
enough capacity.
• mongotop returns the amount of time a
MongoDB instance spends performing read and
write operations. It is broken down by collection
(namespace). This allows you to make sure
there is no unexpected activity and see where
resources are consumed. All active namespaces
are reported. (frequency – every second)

NOSQL Database Types
Column Family
GraphDocument
Key-Value

Key-Value Stores
• Most based on Dynamo white
paper
• Dynamo: Amazon’s Highly
Available Key Value Store (2007)
• Data Model
• Global key-value mapping
• Massively scalable HashMap
• Highly Fault Tolerant
• Examples
• Riak, Redis, Voldemort

Key Value Stores:
Strengths and
Weaknesses
• Strengths
• Simple data model
• Horizontally scalable
• Weaknesses
• Simple data model
• Poor at handling complex data

Column Family
• Based on Google’s BigTable
white paper
• BigTable: Distributed Storage
System for Structured Data
(2006)
• Tuple of K-V where the key
maps to a set of columns
• Map Reduce for querying and
processing
• Examples
• Cassandra, HBase, Hypertable

Column
Family:
Strengths
and
Weaknesses
• Strengths
• Data model supports semi-structured
data
• Naturally indexed (columns)
• Horizontally scalable
• Weaknesses
• Does not handle interconnected data
well

Document-oriented
Databases
• Data Model
• Collection of documents
• Document is a key-value
collection
• Index-centric
• Examples
• MongoDB, CouchDB, Lotus
Notes?

Document-
oriented
Databases:
Strengths
and
Weaknesses
• Strengths
• Simple, but powerful data model
• Good scalability
• Weaknesses
• Does not handle interconnected data
well
• Querying is limited to keys and indexes
• MapReduce for large queries

Graph
Databases
Data Model
• Nodes with
properties
• Named
relationships with
properties
Examples
• Neo4j, Sones
GraphDB,
OrientDB,
InfiniteGraph,
AllegroGraph

Graph
Databases:
Strengths
and
Weaknesses
• Strengths
• Extremely powerful data model
• Performant when querying
interconnected data
• Easy to query
• Weaknesses
• Sharding
• Rewiring your brain

Typical Use Cases for Graph Databases
• Recommendations
• Business Intelligence
• Social Computing
• Master Data Management
• Geospatial
• Genealogy (Past and Present)
• Time Series Data
• Web Analytics
• Bioinformatics
• Indexing RDBMS

Maturity of Data Models
NOSQL RDBMS Graph Stores
•Most NOSQL: ~6 years
•Relational: 42 years
•Graph Theory: 276 years

Leonhard Euler
• Inventor of Graph Theory
(1736)
• Swiss mathematician
• The original hipster

Graph Data Model
Person
City
Event

Graph Data Model
Person
City
Event
Is Attending
Hosted In
Is Located In
Rated

Graph Data Model
Person
City
Event
Is Attending
Hosted In
Is Located In
firstName:kyle
lastName:adams
name:DevCon
name:San Jose
country:USA
Rated
score: 11 out of 10
comment: Amazing!!!

What is Neo4j?
•Leading Open Source graph database
•Embeddable and Server
•ACID compliant
•White board friendly
•Stable
•Has been in 24/7 operation since 2003

More Reasons Why Neo4j is Great
•High performance graph operations
•Traverse 1,000,000+ relationships/sec on commodity
hardware
•32 billion nodes & relationships per Neo4j instance
•64 billion properties per Neo4j instance
•Small footprint
•Standalone server is ~65mb

If NOSQL stands for Not
Only SQL,
....then how do we
execute queries?!

MySQL vs Neo4j

• First rule of fight club:
• Run a friends of friends query
• Second rule of fight club:
• 1,000 Users
• Third rule of fight club:
• Average of 50 friends per user
• Fourth rule of fight club:
• Limit the depth of 5
• Fifth rule of fight club:
• Intel i7 commodity laptop w/8GB RAM
The Experiment: Round 1

RDBMS Schema

select distinct uf3.* from t_user_friend uf1 inner join t_user_friend uf2 on
uf1.user_1 = uf2.user_2 inner join t_user_friend uf3 on
uf2.user_1 = uf3.user_2 where uf1.user_1 = ?
SQL: Friends of friends at depth 3

MySQL Results: Round 1- 1,000 Users
Depth Execution Time (sec) Records Returned
2 0.028 ~900
3 0.213 ~999
4 10.273 ~999
5 92,613.150 ~999

Social Graph

Neo4j Traversal API
TraversalDescription traversalDescription =
Traversal.description()
.relationships("IS_FRIEND_OF",Direction.OUTGOING)
.evaluator(Evaluators.atDepth(2))
.uniqueness(Uniqueness.NODE_GLOBAL);
Iterable<Node> nodes = traversalDescription.traverse(nodeById).nodes();

Neo4j Results: Round 1- 1,000 Users
2 0.04 ~900
3 0.06 ~999
4 0.07 ~999
5 0.07 ~999

• First rule of fight club:
• Run a friends of friends query
• Second rule of fight club:
• 1,000,000 Users
• Third rule of fight club:
• Average of 50 friends per user
• Fourth rule of fight club:
• Limit the depth of 5
• Fifth rule of fight club:
• Intel i7 commodity laptop w/8GB RAM
The Experiment: Round 2

MySQL Results: Round 1- 1,000,000 Users
2 0.016 ~2,500
3 30.267 ~125,000
4 1,543.505 ~600,00
5 Did not finish after an hour N/A

Neo4j Results: Round 1- 1,00,000 Users
2 0.010 ~2,500
3 0.168 ~110,000
4 1.359 ~600,000
5 2.132 ~800,000

Introduction to ReDis (Key-Value Pair)
Redis architecture contains two main
processes:Redis client and Redis Server.
Redis client and server can be in the
same
computer or in two different computers.
Redis server is responsible for storing
data
in memory. It handles all kinds of
manage
-ment and forms the major
part ofarchitecture.

Redis with
Clojure –
Pub / Sub
http://matthiasnehlsen.com/images/redesign2.png

Performance of Redis
Redis does a lot with very little CPU
utilization. In a non-scientific test, I fired up
50 JVMs (on four machines) subscribing to
the topic on which the TwitterClient publishes
tweets with matched percolation queries.
Then I changed the tracked term from
the Twitter Streaming API to “love”, which
reliably maxes out the rate of tweets
permitted. Typically, with this term I see
around 60 to 70 tweets per second. With 50
connected processes, 3000 to 3500 tweets
were delivered per second overall, yet the
CPU utilization of Redis idled somewhere
between 1.7% and 2.3%.

Introduction
to HBase
(Column
Oriented
Hadoop)
HBase is a distributed, NoSQL, open-source database,
initially conceived as an open-source alternative to
Google’s proprietary BigTable. Originally, HBase was
part of the Hadoop project, but was eventually spun
off as a subproject. Given this legacy, it is not
surprising that most often HBase is deployed on top of
a Hadoop cluster (it used HDFS as its underlying
storage), however a case study suggests that it can run
on top of Amazon Elastic Block Store (EBS) as well.
These days HBase is used by companies such as
Adobe, Facebook, Twitter and Yahoo – and many
others to process large amounts of data in real time,
since it is ideally placed to store the input and/or the
output of MapReduce jobs.

Introduction to
HBase (Column
Oriented Hadoop)
• HDFS is a distributed filesystem; One can do most regular FS
operations on it such as listing files in a directory, writing a regular
file, reading a part of the file, etc. Its not simply "a collection of
structured or unstructured data" anymore than
your EXT4 or NTFS filesystems are.
• HBase is a in-memory Key-Value store which
may persist to HDFS (it isn't a hard-requirement, you can run
HBase on any distributed-filesystem). For any read key request
asked of HBase, it will first check its runtime memory caches to see
if it has a value cached, and otherwise visit its stored files on HDFS
to seek and read out the specific value. There are various
configurations in HBase offered to control the way the cache is
utilised, but HBase's speed comes from a combination of caching
and indexed persistence (faster, seek-ed file reads).
• HBase's file-based persistence on HDFS does the key indexing
automatically when it writes, so there is no manual indexing need
by its users. These files are regular HDFS files, but specialised in
format for HBase's usage, known as HFiles.

Introduction to HBase
(Column Oriented
Hadoop)
HBase is a NoSQL, column
oriented database built on top
of hadoop to overcome the
drawbacks of HDFS as it allows
fast random writes and reads
in an optimized way.
https://s3.amazonaws.com/files.dezyre.com/images/blog/Overview+of+HBase+Architecture+and+its+Components/HBase+Architecture.jpg

Introduction to
Cassandra (Column-
Oriented)
Apache Cassandra is a free and open-source distributed NoSQL
database management system designed to handle large amounts of
data across many commodity servers, providing high availability with
no single point of failure.
• It is scalable, fault-tolerant, and consistent.
• It is a column-oriented database.
• Its distribution design is based on Amazon’s Dynamo and its data
model on Google’s Bigtable.
• Created at Facebook, it differs sharply from relational database
management systems.
• Cassandra implements a Dynamo-style replication model with no
single point of failure, but adds a more powerful “column family”
data model.
• Cassandra is being used by some of the biggest companies such as
Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and
more.

Introduction
to Cassandra
(Column-
Oriented)

Introduction to
MongoDB
(Document
Oriented)
• Document Oriented and NoSQL database.
• Supports Aggregation
• Uses BSON format
• Sharding (Helps in Horizontal Scalability)
• Supports Ad Hoc Queries
• Schema Less
• Capped Collection
• Indexing (Any field in MongoDB can be indexed)
• MongoDB Replica Set (Provides high availability)
• Supports Multiple Storage Engines

Introduction
to MongoDB
(Document
Oriented)
https://www.researchgate.net/profile/Beschi_Raja2/publication/322918614/figure/fig3/AS:59
0003382018051@1517679176661/MongoDB-Architecture.jpg

•1. Growth of MongoDB
2. Flexible Data Model
3. MongoDB features
4. Rich set drivers and connectivity
5. Availability & Uptime
6. Elastic Scalability
7. Security http://www.habilelabs.io/choose-mongodb-databases/

Introduction to
Neo4j (Graph
DB)
• SQL Like easy query language Neo4j CQL
• It follows Property Graph Data Model
• It supports Indexes by using Apache Lucence
• It supports UNIQUE constraints
• It contains a UI to execute CQL Commands : Neo4j Data Browser
• It supports full ACID(Atomicity, Consistency, Isolation and
Durability) rules
• It uses Native graph storage with Native GPE(Graph Processing
Engine)
• It supports exporting of query data to JSON and XLS format
• It provides REST API to be accessed by any Programming Language
like Java, Spring,Scala etc.
• It provides Java Script to be accessed by any UI MVC Framework
like Node JS.
• It supports two kinds of Java API: Cypher API and Native Java API
to develop Java applications.

Neo4j v/s
other NoSQL
databases

Introduction
to Neo4j
(Graph DB)
https://image.slidesharecdn.com/neo4j-131011114420-phpapp02/95/neo4j-graph-storage-7-638.jpg?cb=1381492129

Aggregation
MongoDB stores data in BSON (Binary JSON)
format, supports a dynamic schema and allows for
dynamic queries. The Mongo Query Language is
expressed as JSON and is different from the SQL
queries used in an RDBMS. MongoDB provides
anAggregation Framework that includes utility
functions such as count, distinct and group
Aggregation operations process data records and
return computed results. Aggregation operations
group values from multiple documents together,
and can perform a variety of operations on the
grouped data to return a single result. MongoDB
provides three ways to perform aggregation:
the aggregation pipeline, the map-reduce function,
and single purpose aggregation methods.

Aggregation and SQL
https://matthewmoisen.com/static/images/matthew_moisen_sql_to_mongo_aggregation.jpg

Map Reduce
Map Reduce is a 2 step
process to break down
a problem statement
into a solution.
Map : Collate all
information together
and sort it
Reduce: Divide the
Map into individual
keys and aggregate

Big Data 101
What is Big
Data?
It is a new set of approaches for analysing data sets
that were not previously accessible because they
posed challenges across one or more of the “3 V’s” of
Big Data
Volume - too Big – Terabytes and more of Credit Card
Transactions, Web Usage data, System logs
Variety - too Complex – truly unstructured data such
as Social Media, Customer Reviews, Call Center
Records
Velocity - too Fast - Sensor data, live web traffic,
Mobile Phone usage, GPS Data

Head Node
Data Node Data Node Data Node Data Node Data Node
File
Big Data 101
Hadoop is just a File System - HDFS
Read Optimised & Failure Tolerant

REDUCEMAP
Big Data 101
Map + Reduce = Extract, Load + Transform
Raw Data Raw Data Raw Data Raw Data
Mapper Mapper Mapper Mapper
Data Data Data Data
Reducer
Output

HDInsight
hands on
Hadoop
Streaming
with C#
Building the job
Executing
No need to unzip
data

HDInsight hands on
Hadoop Streaming with C#
C:appsdisthadoop-1.1.0-SNAPSHOTbinhadoop.cmd
jar C:appsdisthadoop-1.1.0-SNAPSHOTlibhadoop-streaming.jar
"-D mapred.output.compress=true"
"-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
-files "asv://container@storage/user/hadoop/code/Sentiment_v2.exe"
-numReduceTasks 0
-mapper "Sentiment_v2.exe"
-input "asv://container@storage.blob.core.windows.net/user/hadoop/data/"
-output "asv://container@storage.blob.core.windows.net/user/hadoop/output/Sentiment"

HDInsight hands on
Hadoop Streaming with C#
276.0|5|bob|government
276.0|5|bob|telling
276.0|5|bob|opposed
276.0|5|bob|liberty
276.0|5|bob|obviously
276.0|5|bob|fail
276.0|5|bob|comprehend
276.0|5|bob|qualifier
276.0|5|bob|legalized
276.0|5|bob|curtis

HDInsight hands on
Using Pig to Enrich the data
• Pig is a query language which shares
some concepts with SQL
• Invoked from the Hadoop command shell
• No GUI
• Does not do any work until it has to output a resultset
• Under the hood executes Map/reduce jobs

HDInsight hands on
Using Pig to Enrich the data with Sentiment
scores
• Load sentiment word lists and assign scores
• Loading the data
• Preprocess to get some key fields
• Count words in various contexts and add sentiment value
• Dump results to Azure Blob Storage

Code sample: LOAD Operation
data_raw =
LOAD ‘<filename>'
USING PigStorage('|')
AS
(filename:chararray,message_id:chararray,a
uthor_id:chararray,word:chararray);

Code sample: JOIN Statement
words_count_sentiment =
JOIN words_count_flat
BY words LEFT,
sentiment BY sentiment_word;

Code sample: SUM Operation
message_sum_sentiment =
FOREACH messages_grouped
GENERATE group
AS message_details,
SUM(messages_joined.sentiment_value) AS
sentiment;

HDInsight hands on
Outputting results to Hive
• Hive is a near SQL compliant
language with a lot of similarities
• Again, under the hood issues MapReduce queries
• Exposed to ODBC

HDInsight hands on
Outputting results to Hive
• Create some Hive tables to reference the Pig Output
• Use the Interactive console

Outputting data to Hive
Code review: CREATE EXTERNAL TABLE
CREATE EXTERNAL TABLE words
( word STRING,
counts INT,
sentiment INT )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '124'
STORED AS TEXTFILE
LOCATION
'asv://westburycorpus@westburycorpusnoreur.blob.co
re.windows.net/user/hadoop/pig_out/words';

https://image.slidesharecdn.com/mongodbandhadoop-120229224640-phpapp02/95/mongodb-and-hadoop-27-728.jpg?cb=1330556134

Map Reduce
Twitter
Mongo DB
https://techstakes.files.wordpress.com/2011/04/slide1.jpg

Map Reduce
across
Shards
https://image.slidesharecdn.com/talkmongodbmunich20121016-121015163941-
phpapp01/95/mapconfused-a-practical-approach-to-mapreduce-with-mongodb-5-638.jpg?cb=1350635383

The Current Solutions
10,000
2005 20152010
5,000
0
Current Database Solutions are designed for
structured data.
• Optimized to answer known questions quickly
• Schemas dictate form/context
• Difficult to adapt to new data types and new
questions
• Expensive at petabyte scale
STRUCTURED DATA UNSTRUCTURED DATA
GIGABYTESOFDATACREATED(INBILLIONS)
10%

Main Big Data Technologies
Hadoop NoSQL Databases Analytic Databases
Hadoop
• Low cost, reliable
scale-out architecture
• Distributed computing
Proven success in
Fortune 500 companies
• Exploding interest
NoSQL Databases
• Huge horizontal scaling
and high availability
• Highly optimized for
retrieval and appending
• Types
• Document stores
• Key Value stores
• Graph databases
Analytic RDBMS
• Optimized for bulk-load
and fast aggregate query
workloads
• Types
• Column-oriented
• MPP
• In-memory

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
Hadoop Core Components
• Hadoop Distributed File System (HDFS)
• Massive redundant storage across a commodity
cluster
• MapReduce
• Map: distribute a computational problem across a
cluster
• Reduce: Master node collects the answers to all
the sub-problems and combines them
• Many distros available
US and Worldwide: +1 (866) 660-7555 | Slide

Major Hadoop Utilities
Apache Hive
Apache Pig
Apache HBase
Sqoop
Oozie
Hue
Flume
Apache Whirr
Apache Zookeeper
SQL-like language and
metadata repository
High-level language for
expressing data
analysis programs
The Hadoop database.
Random, real -time
read/write access
Highly reliable
distributed
coordination service
Library for running
Hadoop in the cloud
Distributed service for
collecting and
aggregating log and
event data
Browser-based desktop
interface for interacting
with Hadoop
Server-based workflow
engine for Hadoop
activities
Integrating Hadoop
with RDBMS

“The working conditions can
be are shocking”
ETL Developer
Big Data Platform Challenges

Challenges
1. Somewhat immature
2. Lack of tooling
3. Steep technical learning curve
4. Hiring qualified people
5. Availability of enterprise-ready products and tools
6. High latency (Hadoop)
7. Running inside the cluster

Challenges
Would you rather do this?
Scheduling
Modeling
Ingestion / Manipulation /
Integration
… OR THIS?

Compatibility with SPARK
• The MongoDB Connector for Spark provides integration between MongoDB and Apache Spark.
• With the connector, you have access to all Spark libraries for use with MongoDB datasets:
Datasets for analysis with SQL (benefiting from automatic schema inference), streaming, machine
learning, and graph APIs. You can also use the connector with the Spark Shell.
• The MongoDB Connector for Spark is compatible with the following versions of Apache Spark and
MongoDB:
• MongoDB Connector for Spark Spark Version MongoDB Version
• 2.2.0 2.2.x 2.6 or later
• 2.1.0 2.1.x 2.6 or later
• 2.0.0 2.0.x 2.6 or later
• 1.1.0 1.6.x 2.6 or later

Hadoop v/s Spark
http://www.big-data.tips/wp-
content/uploads/2017/03/apache-spark-vs-hadoop-figure.jpg

Spark – Bare metal
https://weidongzhou.files.wordpress.com/2015/09/spark_engi
ne.jpg

Hadoop v/s Spark
https://www.springpeople.com/blog/wp-
content/uploads/2017/05/Comparing-Hadoop-MapReduce-
and-Spark.png

Query Optimisation based on 3.6(Advanced
functions)
• Retryable Writes
• Causal Consistency
• Change Streams
• New Aggregation Pipeline Stages and Operators
• Performance Advisor
• Default Bind to localhost
• Array Updates

Retryable
Writes
There’s always room for error when writing to a database
even when you think you’ve got all your bases covered.
With MongoDB 3.6, you no longer run the risk of executing
an update twice because of network glitches and the like,
thanks to the new Retryable Writes feature.
Instead of the developer or the application, it’s now the
driver itself that handles these system flukes. The
MongoDB driver that comes with 3.6 can automatically
retry certain failed write operations once, to recover from
transient network errors or replica set failovers.
The benefit here is all in the feature name: your writes will
automatically be retried by MongoDB itself, so you don’t
have to worry about any write inconsistencies.

Causal
Consistency
Prior to MongoDB 3.6, reading from primaries was
the only reliable way to go. Causal relationships
between read and write operations as they occurred
on primaries (and got replicated to secondaries)
weren’t guaranteed. These could result in lags (e.g.
writes to the primary not replicated to the
secondaries, multiple secondaries writing updates
at different times, etc.) which could make reading
from secondaries inconsistent.
This all changes with MongoDB 3.6, which in tl;dr
format is: you can now also reliably read from
secondaries. You can find the longer technical
explanation here.

Change
Streams
Just like you get notified about real-time changes for about
almost anything these days, MongoDB is now also able to
do the same through a feature called Change Streams.
The benefit of Change Streams is immediately visible. You
can now subscribe to changes in a collection and get
notified. A new method, called watch, listens for these
changes, notifies you, and can even trigger an automatic
series of events, as defined in your change stream.
Change streams can “listen” to five events for now (Insert,
Delete, Replace, Update and Invalidate) and can only be set
up based on user roles, which means that only those who
have read access to collections can create change streams
in those collections.

New
Aggregation
Pipeline Stages
and Operators
MongoDB users can feel a bit more
empowered by an aggregation pipeline
that boasts new operators, stages, and
an improved $lookup operator with
even more powerful join capabilities.
Studio 3T’s Aggregation Editor will of
course support these new additions,
the full list of which you can find in the
MongoDB 3.6 Release Notes.

Performance
Advisor
MongoDB’s Ops Manager comes bundled
with Performance Advisor, a feature that
alerts you about slow queries – meaning
queries that take longer than the default
slowOpThresholdMs of 100 milliseconds –
and suggests new indexes to improve query
performance.
Indexes help speed up queries significantly,
so having automated suggestions on how to
optimize them is quite a leg-up. But there is a
tradeoff to consider: the more indexes you
have, the worse your write performance. And
it’s still up to you – and not Performance
Advisor – to strike the right balance.

Default bind
to localhost
In an effort to enforce security,
MongoDB 3.6 now by default binds to
localhost if no authentication is
enabled, so that only connections
from clients running on the same
machine are accepted in such a case.
Only users from whitelisted IP
addresses can externally connect to
your unsecured databases, everything
else will be denied.

Array
Updates
Nested arrays are easier to manipulate that ever
in MongoDB 3.6. Now, the query $type : "array"
detects that fields are arrays, unlike before when
it would only return documents with array fields
with an element of BSON type array.
MongoDB also introduced new operators which
will make updating all elements in an array much
easier and with less code.
We already made showing nested fields and
exploring arrays easier with Stud

Cloud
Companies
Amazon
Google
IBM
Microsoft
DeepCut Confidential

Amazon Cloud
Infrastructure

Amazon Cloud Features
• Elastic Web-Scale Computing
• Completely Controlled
• Flexible Cloud Hosting Services
• Designed for use with other Amazon Web Services
• Reliable
• Secure
• Inexpensive
• Easy to Start

Google
AppEngine
Features
Popular languages and frameworks
Focus on your code
Multiple storage options
Powerful built-in services
Familiar development tools
Deploy at Google scale

IBM
SmartCloud
Features
Expert Cloud Consulting
Private and Hybrid Clouds
IaaS, PaaS and SaaS
Speed
Empowerment
Economics

Microsoft Cloud
Features
Infrastructure Services
Develop Modern Applications
Insights from Data
Identity and Access Management

Cloud
Deployment
Models
Private
Public
Hybrid
Community

Deep Diving
on the
functions of
Mongo DB
3.4
Database
commands
mongo shell
methods

Indexing
Techniques
Create Indexes to Support Your Queries
An index supports a query when the index contains all the
fields scanned by the query. Creating indexes that
supports queries results in greatly increased query
performance.
Use Indexes to Sort Query Results
To support efficient queries, use the strategies here when you
specify the sequential order and sort order of index fields.
Ensure Indexes Fit in RAM
When your index fits in RAM, the system can avoid reading
the index from disk and you get the fastest processing.
Create Queries that Ensure Selectivity
Selectivity is the ability of a query to narrow results using the
index. Selectivity allows MongoDB to use the index for a
larger portion of the work associated with fulfilling the
query.

Name Description
db.collection.createIndex() Builds an index on a collection.
db.collection.dropIndex() Removes a specified index on a collection.
db.collection.dropIndexes() Removes all indexes on a collection.
db.collection.getIndexes() Returns an array of documents that describe the existing indexes on a collection.
db.collection.reIndex() Rebuilds all existing indexes on a collection.
db.collection.totalIndexSize()
Reports the total size used by the indexes on a collection. Provides a wrapper around the totalIndexSize field of
the collStats output.
cursor.explain() Reports on the query execution plan for a cursor.
cursor.hint() Forces MongoDB to use a specific index for a query.
cursor.max() Specifies an exclusive upper index bound for a cursor. For use with cursor.hint()
cursor.min() Specifies an inclusive lower index bound for a cursor. For use with cursor.hint()
Indexing Methods in the mongo Shell

Name Description
createIndexes Builds one or more indexes for a collection.
dropIndexes Removes indexes from a collection.
compact Defragments a collection and rebuilds the indexes.
reIndex Rebuilds all indexes on a collection.
validate Internal command that scans for a collection’s data and indexes for correctness.
geoNear
Performs a geospatial query that returns the documents closest to a given
point.
geoSearch Performs a geospatial query that uses MongoDB’s haystack index functionality.
checkShardingIndex Internal command that validates index on shard key.
Indexing Database Commands

Name Description
$geoWithin
Selects geometries within a bounding GeoJSON geometry.
The 2dsphere and 2d indexes support $geoWithin.
$geoIntersects
Selects geometries that intersect with a GeoJSON geometry. The 2dsphere index
supports $geoIntersects.
$near
Returns geospatial objects in proximity to a point. Requires a geospatial index.
The 2dsphere and 2d indexes support $near.
$nearSphere
Returns geospatial objects in proximity to a point on a sphere. Requires a
geospatial index. The 2dsphere and 2d indexes support $nearSphere.
Geospatial Query Selectors

Name Description
$explain Forces MongoDB to report on query execution plans. See explain().
$hint Forces MongoDB to use a specific index. See hint()
$max Specifies an exclusive upper limit for the index to use in a query. See max().
$min Specifies an inclusive lower limit for the index to use in a query. See min().
$returnKey Forces the cursor to only return fields included in the index.
Indexing Query Modifiers

Thanks !!!
Keep in touch
Rajesh30menon
@YAHOO, GMAIL, HOTMAIL, SKYPE, TWITTER, INSTAGRAM, PINTEREST
My blog : http://www.technospirituailty.com
MY BOOKS : Link : https://goo.gl/bQ8cnM (Amazon.com)
Link : https://goo.gl/owgMxT (Amazon.in)
http://www.technospirituality.com 155

NoSQL and MongoDB

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a NoSQL and MongoDB

Similar a NoSQL and MongoDB (20)

Más de Rajesh Menon

Más de Rajesh Menon (7)

Último

Último (20)

NoSQL and MongoDB

Notas del editor