2. Agenda – Day 1
• Day 1 – Theory / Demo
o Introduction to NoSQL – 3 hours
➢ What Is Meant by NoSQL?
➢ Distributed and Decentralized
➢ Elastic Scalability
➢ High Availability and Fault Tolerance
➢ Brewer's CAP Theorem
➢ Row-Oriented
➢ Schema-Free
➢ High Performance
➢ Types of NoSQL Databases
➢ Introduction to ReDis (Key-Value Pair)
➢ Introduction to HBase (Column Oriented Hadoop)
➢ Introduction to Cassandra (Column- Oriented)
➢ Introduction to MongoDB (Document Oriented)
➢ Introduction to Neo4j (Graph DB)
•
o Remaining 5 hours
➢ Aggregation – 30 mins
➢ Map Reduce – 30 mins
➢ Compatibility with SPARK – 30 mins
➢ Query Optimisation based on 3.6(Advanced functions) – 30 mins
➢ Deep Diving on the functions of Mongo DB 3.4 – 2.5 hours
➢ Indexing techniques – 30 mins
4. What Is Meant
by NoSQL?
• A NoSQL database provides a mechanism for storage
and retrieval of data that is modeled in means other
than the tabular relations used in relational databases.
6. Driving Trends - Data Size
• Data size is increasing exponentially year after year
7. Driving Trend: Semi-
Structured
Information
• Content is becoming more
unique
• We can blame generation Y for
this!
• Before: Job Title - Software
Engineer
• After: Job Title - ZOMG
Awesome Core Repository
Developer
9. Social Network Performance
Why is RDBMS performance horrible?
•To find all friends on depth 5,MySQL will create Cartesian
product on t_user_friend table 5 times
•Resulting in 50,000^5 records return
•All except 1,000 are discarded
•Neo4j will simply traverse through the nodes in the database
until there are no more nodes in which to traverse
10. Social Network Performance
The power of traversals
•Graphs data structures are localized
•Count all of the people around you
•Adding more people in the room may only slightly impact
your performance to count your neighbors
11. Distributed and
Decentralized
• Horizontal (Mongo DB) and Vertical
(Oracle) scaling
• Distributed database
• Decentralized database
• Eg: Mongo DB
• Application : BlockChain
12. Elastic Scalability
• Scales to multiple clusters
• Private Cloud
• Public Cloud (MongoDB Atlas)
• Literally infinite scale
• Keep adding resources
• Tune the database
• NoSQL scales better than RDBMS
• Provision resources in seconds
13. High Availability
and Fault
Tolerance
• Always available, provided configuration is OK
• Look at the load on the system
• Number of users, sessions, memory, CPU, network
• Fault tolerance built in
• If one node fails, others take over without affecting the
entire system
• Classic example : Hadoop
• Cloud is a catalyst. And the future.
15. Brewer’s Cap
Theorem
explained
• Consistency: Any changes to a particular record
stored in database, in form of inserts, updates
or deletes is seen as it is, by other users
accessing that record at that particular time. If
they don’t see the updated data, it is termed as
eventual consistency.
• Availability: The system continues to work and
serve data in spite of node failures.
• Partition Tolerance: The database system could
be stored based on distributed architecture
such as Hadoop (HDFS).
16. Brewer’s Cap
Theorem
examples
• RDBMS systems such as Oracle, MySQL etc
supports consistency and availability.
• NoSQL datastore such as HBase supports
Consistency and Partition Tolerance.
• NoSQL datastore such as Cassandra, CouchDB
supports Availability and Partition Tolerance.
17. Brewer’s CAP
Theorem
notes –
Which DB to
choose ?
CP-based database system: When it is critical that all
clients see a consistent view of the database, the users of
one node will have to wait for any other nodes to come
into agreement before being able to read or write to the
database, availability takes a backseat to consistency and
one may want to choose database such as HBase that
supports CP (Consistency and Partition Tolerance)
AP-based database system: When there is a requirement
that database remain available at all times, one could DB
system which allows clients write data to one node of the
database without waiting for other nodes to come into
agreement. DB system then takes care of data
reconciliation at a little later time. This is the state of
eventual consistency. In applications which could sacrifice
data consistency in return of huge performance, one could
select databases such as CouchDB, Cassandra.
18. Row-Oriented
• A column-oriented DBMS (or
columnar database management system) is a
database management system (DBMS) that stores data
tables by column rather than by row. Practical use of
a column store versus a row store differs little in the
relational DBMS world.
• RDBMS
• Document-based Store- It stores documents made up of
tagged elements. {Example- CouchDB} Column-based
Store- Each storage block contains data from only
one column, {Example- HBase, Cassandra} Graph-based-
A network databasethat uses edges and nodes to
represent and store data.
19. Row-Oriented
• MongoDB is schema-free, allowing
you to create documents without
having to first create the structure for
that document. At the same time, it
still has many of the features of a
relational database, including strong
consistency and an expressive
query language.CouchDB: Views
in CouchDB are similar to indexes
in SQL.
23. Schema-Free
• On schema-free databases like for
example MongoDB, you can simply add
records without any previous structure.
Moreover, you can group records that
do not have the same structure, for
example, you can have a collection
(something like a table on relational
databases where you group records)
with records of various structures, in
other words, they do not need to have
the same columns (properties).
24. Schema Free – Mongo DB Notes
MongoDB is a JSON-style data store. The documents stored in the database can have
varying sets of fields, with different types for each field.
And that’s true. But it doesn’t mean that there is no schema. There are in fact various
schemas:
•The one in your head when you designed the data structures
•The one that your database really implemented to store your data structures
•The one you should have implemented to fulfill your requirements
Every time you realise that you made a mistake (see point three above), or when your
requirements change, you will need to migrate your data.
Let’s review again MongoDB’s point of view here:
With a schemaless database, 90% of the time adjustments to the database become
transparent and automatic.
For example, if we wish to add GPA to the student objects, we add the attribute, resave,
and all is well — if we look up an existing student and reference GPA,
we just get back null. Further, if we roll back our code, the new GPA fields in the existing
objects are unlikely to cause problems if our code was well written.
25. Schema Free
- Hadoop
What Hadoop, NoSQL databases and other modern
big data tools allow is for each application or user to
come to the raw data with a different schema. Take
call center logs as an example. Someone performing
a columnar analysis on time and call length has a
different interpretation of the schema than someone
doing a row search for a specific call. But they aren't
imposing a schema-on-read; rather, they're flexibly
addressing different components of the schema to
maximize their individual query performance.
So, forget schema-less, schema-on-read and other
nonsense that is of use only to theorists and niche
players. Focus instead on providing ways for flexible
database schemas to be integrated into the full
business information pipeline.
27. High Performance
• mongostat is the most powerful utility. It
reports real-time statistics about connections,
inserts, queries, updates, deletes, queued reads
and writes, flushes, memory usage, page faults,
and much more. It can be useful to quickly spot-
check database activity, see if values are not
abnormally high, and make sure you have
enough capacity.
• mongotop returns the amount of time a
MongoDB instance spends performing read and
write operations. It is broken down by collection
(namespace). This allows you to make sure
there is no unexpected activity and see where
resources are consumed. All active namespaces
are reported. (frequency – every second)
30. Key-Value Stores
• Most based on Dynamo white
paper
• Dynamo: Amazon’s Highly
Available Key Value Store (2007)
• Data Model
• Global key-value mapping
• Massively scalable HashMap
• Highly Fault Tolerant
• Examples
• Riak, Redis, Voldemort
31. Key Value Stores:
Strengths and
Weaknesses
• Strengths
• Simple data model
• Horizontally scalable
• Weaknesses
• Simple data model
• Poor at handling complex data
32. Column Family
• Based on Google’s BigTable
white paper
• BigTable: Distributed Storage
System for Structured Data
(2006)
• Tuple of K-V where the key
maps to a set of columns
• Map Reduce for querying and
processing
• Examples
• Cassandra, HBase, Hypertable
36. Graph
Databases
Data Model
• Nodes with
properties
• Named
relationships with
properties
Examples
• Neo4j, Sones
GraphDB,
OrientDB,
InfiniteGraph,
AllegroGraph
39. Typical Use Cases for Graph Databases
• Recommendations
• Business Intelligence
• Social Computing
• Master Data Management
• Geospatial
• Genealogy (Past and Present)
• Time Series Data
• Web Analytics
• Bioinformatics
• Indexing RDBMS
40. Maturity of Data Models
NOSQL RDBMS Graph Stores
•Most NOSQL: ~6 years
•Relational: 42 years
•Graph Theory: 276 years
44. Graph Data Model
Person
City
Event
Is Attending
Hosted In
Is Located In
firstName:kyle
lastName:adams
name:DevCon
name:San Jose
country:USA
Rated
score: 11 out of 10
comment: Amazing!!!
45. What is Neo4j?
•Leading Open Source graph database
•Embeddable and Server
•ACID compliant
•White board friendly
•Stable
•Has been in 24/7 operation since 2003
46. More Reasons Why Neo4j is Great
•High performance graph operations
•Traverse 1,000,000+ relationships/sec on commodity
hardware
•32 billion nodes & relationships per Neo4j instance
•64 billion properties per Neo4j instance
•Small footprint
•Standalone server is ~65mb
47. If NOSQL stands for Not
Only SQL,
....then how do we
execute queries?!
50. Social Network Performance
• First rule of fight club:
• Run a friends of friends query
• Second rule of fight club:
• 1,000 Users
• Third rule of fight club:
• Average of 50 friends per user
• Fourth rule of fight club:
• Limit the depth of 5
• Fifth rule of fight club:
• Intel i7 commodity laptop w/8GB RAM
The Experiment: Round 1
55. Social Network Performance
Neo4j Traversal API
TraversalDescription traversalDescription =
Traversal.description()
.relationships("IS_FRIEND_OF",Direction.OUTGOING)
.evaluator(Evaluators.atDepth(2))
.uniqueness(Uniqueness.NODE_GLOBAL);
Iterable<Node> nodes = traversalDescription.traverse(nodeById).nodes();
56. Social Network Performance
Neo4j Results: Round 1- 1,000 Users
Depth Execution Time (sec) Records Returned
2 0.04 ~900
3 0.06 ~999
4 0.07 ~999
5 0.07 ~999
57. Social Network Performance
• First rule of fight club:
• Run a friends of friends query
• Second rule of fight club:
• 1,000,000 Users
• Third rule of fight club:
• Average of 50 friends per user
• Fourth rule of fight club:
• Limit the depth of 5
• Fifth rule of fight club:
• Intel i7 commodity laptop w/8GB RAM
The Experiment: Round 2
58. Social Network Performance
MySQL Results: Round 1- 1,000,000 Users
Depth Execution Time (sec) Records Returned
2 0.016 ~2,500
3 30.267 ~125,000
4 1,543.505 ~600,00
5 Did not finish after an hour N/A
59. Social Network Performance
Neo4j Results: Round 1- 1,00,000 Users
Depth Execution Time (sec) Records Returned
2 0.010 ~2,500
3 0.168 ~110,000
4 1.359 ~600,000
5 2.132 ~800,000
60. Introduction to ReDis (Key-Value Pair)
Redis architecture contains two main
processes:Redis client and Redis Server.
Redis client and server can be in the
same
computer or in two different computers.
Redis server is responsible for storing
data
in memory. It handles all kinds of
manage
-ment and forms the major
part ofarchitecture.
62. Performance of Redis
Redis does a lot with very little CPU
utilization. In a non-scientific test, I fired up
50 JVMs (on four machines) subscribing to
the topic on which the TwitterClient publishes
tweets with matched percolation queries.
Then I changed the tracked term from
the Twitter Streaming API to “love”, which
reliably maxes out the rate of tweets
permitted. Typically, with this term I see
around 60 to 70 tweets per second. With 50
connected processes, 3000 to 3500 tweets
were delivered per second overall, yet the
CPU utilization of Redis idled somewhere
between 1.7% and 2.3%.
63. Introduction
to HBase
(Column
Oriented
Hadoop)
HBase is a distributed, NoSQL, open-source database,
initially conceived as an open-source alternative to
Google’s proprietary BigTable. Originally, HBase was
part of the Hadoop project, but was eventually spun
off as a subproject. Given this legacy, it is not
surprising that most often HBase is deployed on top of
a Hadoop cluster (it used HDFS as its underlying
storage), however a case study suggests that it can run
on top of Amazon Elastic Block Store (EBS) as well.
These days HBase is used by companies such as
Adobe, Facebook, Twitter and Yahoo – and many
others to process large amounts of data in real time,
since it is ideally placed to store the input and/or the
output of MapReduce jobs.
64. Introduction to
HBase (Column
Oriented Hadoop)
• HDFS is a distributed filesystem; One can do most regular FS
operations on it such as listing files in a directory, writing a regular
file, reading a part of the file, etc. Its not simply "a collection of
structured or unstructured data" anymore than
your EXT4 or NTFS filesystems are.
• HBase is a in-memory Key-Value store which
may persist to HDFS (it isn't a hard-requirement, you can run
HBase on any distributed-filesystem). For any read key request
asked of HBase, it will first check its runtime memory caches to see
if it has a value cached, and otherwise visit its stored files on HDFS
to seek and read out the specific value. There are various
configurations in HBase offered to control the way the cache is
utilised, but HBase's speed comes from a combination of caching
and indexed persistence (faster, seek-ed file reads).
• HBase's file-based persistence on HDFS does the key indexing
automatically when it writes, so there is no manual indexing need
by its users. These files are regular HDFS files, but specialised in
format for HBase's usage, known as HFiles.
65. Introduction to HBase
(Column Oriented
Hadoop)
HBase is a NoSQL, column
oriented database built on top
of hadoop to overcome the
drawbacks of HDFS as it allows
fast random writes and reads
in an optimized way.
https://s3.amazonaws.com/files.dezyre.com/images/blog/Overview+of+HBase+Architecture+and+its+Components/HBase+Architecture.jpg
66. Introduction to
Cassandra (Column-
Oriented)
Apache Cassandra is a free and open-source distributed NoSQL
database management system designed to handle large amounts of
data across many commodity servers, providing high availability with
no single point of failure.
• It is scalable, fault-tolerant, and consistent.
• It is a column-oriented database.
• Its distribution design is based on Amazon’s Dynamo and its data
model on Google’s Bigtable.
• Created at Facebook, it differs sharply from relational database
management systems.
• Cassandra implements a Dynamo-style replication model with no
single point of failure, but adds a more powerful “column family”
data model.
• Cassandra is being used by some of the biggest companies such as
Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and
more.
69. Introduction to
MongoDB
(Document
Oriented)
• Document Oriented and NoSQL database.
• Supports Aggregation
• Uses BSON format
• Sharding (Helps in Horizontal Scalability)
• Supports Ad Hoc Queries
• Schema Less
• Capped Collection
• Indexing (Any field in MongoDB can be indexed)
• MongoDB Replica Set (Provides high availability)
• Supports Multiple Storage Engines
71. •1. Growth of MongoDB
2. Flexible Data Model
3. MongoDB features
4. Rich set drivers and connectivity
5. Availability & Uptime
6. Elastic Scalability
7. Security http://www.habilelabs.io/choose-mongodb-databases/
72. Introduction to
Neo4j (Graph
DB)
• SQL Like easy query language Neo4j CQL
• It follows Property Graph Data Model
• It supports Indexes by using Apache Lucence
• It supports UNIQUE constraints
• It contains a UI to execute CQL Commands : Neo4j Data Browser
• It supports full ACID(Atomicity, Consistency, Isolation and
Durability) rules
• It uses Native graph storage with Native GPE(Graph Processing
Engine)
• It supports exporting of query data to JSON and XLS format
• It provides REST API to be accessed by any Programming Language
like Java, Spring,Scala etc.
• It provides Java Script to be accessed by any UI MVC Framework
like Node JS.
• It supports two kinds of Java API: Cypher API and Native Java API
to develop Java applications.
76. Aggregation
MongoDB stores data in BSON (Binary JSON)
format, supports a dynamic schema and allows for
dynamic queries. The Mongo Query Language is
expressed as JSON and is different from the SQL
queries used in an RDBMS. MongoDB provides
anAggregation Framework that includes utility
functions such as count, distinct and group
Aggregation operations process data records and
return computed results. Aggregation operations
group values from multiple documents together,
and can perform a variety of operations on the
grouped data to return a single result. MongoDB
provides three ways to perform aggregation:
the aggregation pipeline, the map-reduce function,
and single purpose aggregation methods.
84. Map Reduce
Map Reduce is a 2 step
process to break down
a problem statement
into a solution.
Map : Collate all
information together
and sort it
Reduce: Divide the
Map into individual
keys and aggregate
85. Big Data 101
What is Big
Data?
It is a new set of approaches for analysing data sets
that were not previously accessible because they
posed challenges across one or more of the “3 V’s” of
Big Data
Volume - too Big – Terabytes and more of Credit Card
Transactions, Web Usage data, System logs
Variety - too Complex – truly unstructured data such
as Social Media, Customer Reviews, Call Center
Records
Velocity - too Fast - Sensor data, live web traffic,
Mobile Phone usage, GPS Data
86. Head Node
Data Node Data Node Data Node Data Node Data Node
File
Big Data 101
Hadoop is just a File System - HDFS
Read Optimised & Failure Tolerant
87. REDUCEMAP
Big Data 101
Map + Reduce = Extract, Load + Transform
Raw Data Raw Data Raw Data Raw Data
Mapper Mapper Mapper Mapper
Data Data Data Data
Reducer
Output
89. HDInsight hands on
Hadoop Streaming with C#
C:appsdisthadoop-1.1.0-SNAPSHOTbinhadoop.cmd
jar C:appsdisthadoop-1.1.0-SNAPSHOTlibhadoop-streaming.jar
"-D mapred.output.compress=true"
"-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
-files "asv://container@storage/user/hadoop/code/Sentiment_v2.exe"
-numReduceTasks 0
-mapper "Sentiment_v2.exe"
-input "asv://container@storage.blob.core.windows.net/user/hadoop/data/"
-output "asv://container@storage.blob.core.windows.net/user/hadoop/output/Sentiment"
90. HDInsight hands on
Hadoop Streaming with C#
276.0|5|bob|government
276.0|5|bob|telling
276.0|5|bob|opposed
276.0|5|bob|liberty
276.0|5|bob|obviously
276.0|5|bob|fail
276.0|5|bob|comprehend
276.0|5|bob|qualifier
276.0|5|bob|legalized
276.0|5|bob|curtis
91. HDInsight hands on
Using Pig to Enrich the data
• Pig is a query language which shares
some concepts with SQL
• Invoked from the Hadoop command shell
• No GUI
• Does not do any work until it has to output a resultset
• Under the hood executes Map/reduce jobs
92. HDInsight hands on
Using Pig to Enrich the data with Sentiment
scores
• Load sentiment word lists and assign scores
• Loading the data
• Preprocess to get some key fields
• Count words in various contexts and add sentiment value
• Dump results to Azure Blob Storage
93. Using Pig to Enrich the data
Code sample: LOAD Operation
data_raw =
LOAD ‘<filename>'
USING PigStorage('|')
AS
(filename:chararray,message_id:chararray,a
uthor_id:chararray,word:chararray);
94. Using Pig to Enrich the data
Code sample: JOIN Statement
words_count_sentiment =
JOIN words_count_flat
BY words LEFT,
sentiment BY sentiment_word;
95. Using Pig to Enrich the data
Code sample: SUM Operation
message_sum_sentiment =
FOREACH messages_grouped
GENERATE group
AS message_details,
SUM(messages_joined.sentiment_value) AS
sentiment;
96. HDInsight hands on
Outputting results to Hive
• Hive is a near SQL compliant
language with a lot of similarities
• Again, under the hood issues MapReduce queries
• Exposed to ODBC
97. HDInsight hands on
Outputting results to Hive
• Create some Hive tables to reference the Pig Output
• Use the Interactive console
98. Outputting data to Hive
Code review: CREATE EXTERNAL TABLE
CREATE EXTERNAL TABLE words
( word STRING,
counts INT,
sentiment INT )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '124'
STORED AS TEXTFILE
LOCATION
'asv://westburycorpus@westburycorpusnoreur.blob.co
re.windows.net/user/hadoop/pig_out/words';
104. The Current Solutions
10,000
2005 20152010
5,000
0
Current Database Solutions are designed for
structured data.
• Optimized to answer known questions quickly
• Schemas dictate form/context
• Difficult to adapt to new data types and new
questions
• Expensive at petabyte scale
STRUCTURED DATA UNSTRUCTURED DATA
GIGABYTESOFDATACREATED(INBILLIONS)
10%
105. Main Big Data Technologies
Hadoop NoSQL Databases Analytic Databases
Hadoop
• Low cost, reliable
scale-out architecture
• Distributed computing
Proven success in
Fortune 500 companies
• Exploding interest
NoSQL Databases
• Huge horizontal scaling
and high availability
• Highly optimized for
retrieval and appending
• Types
• Document stores
• Key Value stores
• Graph databases
Analytic RDBMS
• Optimized for bulk-load
and fast aggregate query
workloads
• Types
• Column-oriented
• MPP
• In-memory
107. Major Hadoop Utilities
Apache Hive
Apache Pig
Apache HBase
Sqoop
Oozie
Hue
Flume
Apache Whirr
Apache Zookeeper
SQL-like language and
metadata repository
High-level language for
expressing data
analysis programs
The Hadoop database.
Random, real -time
read/write access
Highly reliable
distributed
coordination service
Library for running
Hadoop in the cloud
Distributed service for
collecting and
aggregating log and
event data
Browser-based desktop
interface for interacting
with Hadoop
Server-based workflow
engine for Hadoop
activities
Integrating Hadoop
with RDBMS
112. Compatibility with SPARK
• The MongoDB Connector for Spark provides integration between MongoDB and Apache Spark.
• With the connector, you have access to all Spark libraries for use with MongoDB datasets:
Datasets for analysis with SQL (benefiting from automatic schema inference), streaming, machine
learning, and graph APIs. You can also use the connector with the Spark Shell.
• The MongoDB Connector for Spark is compatible with the following versions of Apache Spark and
MongoDB:
• MongoDB Connector for Spark Spark Version MongoDB Version
• 2.2.0 2.2.x 2.6 or later
• 2.1.0 2.1.x 2.6 or later
• 2.0.0 2.0.x 2.6 or later
• 1.1.0 1.6.x 2.6 or later
121. Query Optimisation based on 3.6(Advanced
functions)
• Retryable Writes
• Causal Consistency
• Change Streams
• New Aggregation Pipeline Stages and Operators
• Performance Advisor
• Default Bind to localhost
• Array Updates
122. Retryable
Writes
There’s always room for error when writing to a database
even when you think you’ve got all your bases covered.
With MongoDB 3.6, you no longer run the risk of executing
an update twice because of network glitches and the like,
thanks to the new Retryable Writes feature.
Instead of the developer or the application, it’s now the
driver itself that handles these system flukes. The
MongoDB driver that comes with 3.6 can automatically
retry certain failed write operations once, to recover from
transient network errors or replica set failovers.
The benefit here is all in the feature name: your writes will
automatically be retried by MongoDB itself, so you don’t
have to worry about any write inconsistencies.
123. Causal
Consistency
Prior to MongoDB 3.6, reading from primaries was
the only reliable way to go. Causal relationships
between read and write operations as they occurred
on primaries (and got replicated to secondaries)
weren’t guaranteed. These could result in lags (e.g.
writes to the primary not replicated to the
secondaries, multiple secondaries writing updates
at different times, etc.) which could make reading
from secondaries inconsistent.
This all changes with MongoDB 3.6, which in tl;dr
format is: you can now also reliably read from
secondaries. You can find the longer technical
explanation here.
124. Change
Streams
Just like you get notified about real-time changes for about
almost anything these days, MongoDB is now also able to
do the same through a feature called Change Streams.
The benefit of Change Streams is immediately visible. You
can now subscribe to changes in a collection and get
notified. A new method, called watch, listens for these
changes, notifies you, and can even trigger an automatic
series of events, as defined in your change stream.
Change streams can “listen” to five events for now (Insert,
Delete, Replace, Update and Invalidate) and can only be set
up based on user roles, which means that only those who
have read access to collections can create change streams
in those collections.
125. New
Aggregation
Pipeline Stages
and Operators
MongoDB users can feel a bit more
empowered by an aggregation pipeline
that boasts new operators, stages, and
an improved $lookup operator with
even more powerful join capabilities.
Studio 3T’s Aggregation Editor will of
course support these new additions,
the full list of which you can find in the
MongoDB 3.6 Release Notes.
126. Performance
Advisor
MongoDB’s Ops Manager comes bundled
with Performance Advisor, a feature that
alerts you about slow queries – meaning
queries that take longer than the default
slowOpThresholdMs of 100 milliseconds –
and suggests new indexes to improve query
performance.
Indexes help speed up queries significantly,
so having automated suggestions on how to
optimize them is quite a leg-up. But there is a
tradeoff to consider: the more indexes you
have, the worse your write performance. And
it’s still up to you – and not Performance
Advisor – to strike the right balance.
127. Default bind
to localhost
In an effort to enforce security,
MongoDB 3.6 now by default binds to
localhost if no authentication is
enabled, so that only connections
from clients running on the same
machine are accepted in such a case.
Only users from whitelisted IP
addresses can externally connect to
your unsecured databases, everything
else will be denied.
128. Array
Updates
Nested arrays are easier to manipulate that ever
in MongoDB 3.6. Now, the query $type : "array"
detects that fields are arrays, unlike before when
it would only return documents with array fields
with an element of BSON type array.
MongoDB also introduced new operators which
will make updating all elements in an array much
easier and with less code.
We already made showing nested fields and
exploring arrays easier with Stud
131. Amazon Cloud Features
• Elastic Web-Scale Computing
• Completely Controlled
• Flexible Cloud Hosting Services
• Designed for use with other Amazon Web Services
• Reliable
• Secure
• Inexpensive
• Easy to Start
DeepCut Confidential
133. Google
AppEngine
Features
Popular languages and frameworks
Focus on your code
Multiple storage options
Powerful built-in services
Familiar development tools
Deploy at Google scale
DeepCut Confidential
150. Indexing
Techniques
Create Indexes to Support Your Queries
An index supports a query when the index contains all the
fields scanned by the query. Creating indexes that
supports queries results in greatly increased query
performance.
Use Indexes to Sort Query Results
To support efficient queries, use the strategies here when you
specify the sequential order and sort order of index fields.
Ensure Indexes Fit in RAM
When your index fits in RAM, the system can avoid reading
the index from disk and you get the fastest processing.
Create Queries that Ensure Selectivity
Selectivity is the ability of a query to narrow results using the
index. Selectivity allows MongoDB to use the index for a
larger portion of the work associated with fulfilling the
query.
151. Name Description
db.collection.createIndex() Builds an index on a collection.
db.collection.dropIndex() Removes a specified index on a collection.
db.collection.dropIndexes() Removes all indexes on a collection.
db.collection.getIndexes() Returns an array of documents that describe the existing indexes on a collection.
db.collection.reIndex() Rebuilds all existing indexes on a collection.
db.collection.totalIndexSize()
Reports the total size used by the indexes on a collection. Provides a wrapper around the totalIndexSize field of
the collStats output.
cursor.explain() Reports on the query execution plan for a cursor.
cursor.hint() Forces MongoDB to use a specific index for a query.
cursor.max() Specifies an exclusive upper index bound for a cursor. For use with cursor.hint()
cursor.min() Specifies an inclusive lower index bound for a cursor. For use with cursor.hint()
Indexing Methods in the mongo Shell
152. Name Description
createIndexes Builds one or more indexes for a collection.
dropIndexes Removes indexes from a collection.
compact Defragments a collection and rebuilds the indexes.
reIndex Rebuilds all indexes on a collection.
validate Internal command that scans for a collection’s data and indexes for correctness.
geoNear
Performs a geospatial query that returns the documents closest to a given
point.
geoSearch Performs a geospatial query that uses MongoDB’s haystack index functionality.
checkShardingIndex Internal command that validates index on shard key.
Indexing Database Commands
153. Name Description
$geoWithin
Selects geometries within a bounding GeoJSON geometry.
The 2dsphere and 2d indexes support $geoWithin.
$geoIntersects
Selects geometries that intersect with a GeoJSON geometry. The 2dsphere index
supports $geoIntersects.
$near
Returns geospatial objects in proximity to a point. Requires a geospatial index.
The 2dsphere and 2d indexes support $near.
$nearSphere
Returns geospatial objects in proximity to a point on a sphere. Requires a
geospatial index. The 2dsphere and 2d indexes support $nearSphere.
Geospatial Query Selectors
154. Name Description
$explain Forces MongoDB to report on query execution plans. See explain().
$hint Forces MongoDB to use a specific index. See hint()
$max Specifies an exclusive upper limit for the index to use in a query. See max().
$min Specifies an inclusive lower limit for the index to use in a query. See min().
$returnKey Forces the cursor to only return fields included in the index.
Indexing Query Modifiers
155. Thanks !!!
Keep in touch
Rajesh30menon
@YAHOO, GMAIL, HOTMAIL, SKYPE, TWITTER, INSTAGRAM, PINTEREST
My blog : http://www.technospirituailty.com
MY BOOKS : Link : https://goo.gl/bQ8cnM (Amazon.com)
Link : https://goo.gl/owgMxT (Amazon.in)
http://www.technospirituality.com 155
Notas del editor
The following trends make it increasingly difficult to perform analytics with relational databases
And more importantly, the following trends make it near to impossible to perform these analytics within the click stream. (i.e. on-the-fly analysis and results)
Creating more data year after year.
Storing and process this data is becoming increasing difficult for the relational databases
The total amount of data grows and becomes more connected. However it’s losing some of its predictable structure.
Blame generation Y! Yes me. I don’t want my information to fit into a 1970’s style database anymore., I want it to be all about me. This causes data to become more morphable.
Before we start talking about NOSQL let’s give relational databases a little credit.
Relational database are still great for tabular data
Performance degrades as data becomes more deeply connected and voluminous
I’m not telling to you shy away from relational database, but in this polyglot persistence world different use cases require different ways of storing and process today’s data
To find all friends on depth 5, MySQL will create Cartesian product on t_user_friend
table 5 times, resulting in 50,0005 records, out of which all but 1,000 are discarded. Neo4j,
on the other hand, will simply visit nodes in the database, and when there is no more nodes
to visit, it will stop the traversal.
Its not magic, its all about the data structure and how they’re localized.
Lets say we have about 50 people in the room and I ask you the count the people around you. It may take a few seconds to complete the task. But if we add a 100 more people in the room, you ability to count the people around you is only slightly affected by the increase in total number of people
4 types of databases in the NOSQL universe:
K-V Stores
Column Family Store
Document Databases
Graph Databases
Who here has worked with NOSQL stores before?
For the people that raised their hand how many used...
KV Stores?
Column Family?
Document DBs?
Graph DBs?
If you raised your hand for Graph DBs, then pat yourself on the back b/c that’s where I spend most of my time.
Let’s look at each of the types
Its a massively scalable HashMap
Strengths: Again...Its a HashMap! If you can understand how HashMaps work, the KV stores are relatively easy to adopt.
Weaknesses: At the same time the simple data structure is a weakness. It’s difficult to represent complex and interconnected data.
Essentially K-VVVVVVVVVV stores
Strengths: Supports semi-structured data
Weaknesses: Does not handle interconnected data well. You may pull your hair out trying to write code against these stores. However, the Spring Data project aims to reduce some of that complexity
These are becoming more popular today
Contains documents and a document is simply a key-value collection
Usually have great index support!!!
Is there anyone out there thats still using Notes? Please say no. Notes was actually one of the early Document Databases. I suppose you can say that’s one thing that isn’t completely terrible from the Lotus products.
Again we see this trend where all of these NOSQL stores do not handle interconnected data well. I wonder where this is going
Finally we have graph databases. My little section of the NOSQL universe
Has the richest data model of all of the NOSQL types
Graphs are mutable which makes it extremely hard to shard because graphs are naturally mutable. You can shard based on domains but you would need to reduce the chances of creating relationships between the two graphs.
In the following graph we see that KV stores are the best at scaling due to their simplistic data model and Graph databases are the worst at scaling because of the complexity and interconnectedness of the data.
Even though Graphs DBs are the worst at scaling out of all of the NOSQL types, we’re still able to cover 90% of today’s use cases.
Indexing relational DBs: Some people classify SOLR as a NOSQL store
The relational model is quite mature, but Graph theory is much older.
So when you boss says that you can’t use Graph database because they’re not mature enough, just tell him that he needs to check his facts.
This is my homeboy Leonhard Euler. Inventor of Graph Theory, swiss math ninja, Volvo lover, and apparently from his choice in clothing, he’s also the original hipster. But I’ll let that one slide.
What you draw on the white board is what you implement in your code. And truthfully, this was the main reason why I was attracted to graph databases in the first places.
I constantly found myself in the position where I would map out my domain on a white board, spend a ton of time normalizing my tables thinking I was this total SQL badass ninja, then I would deploy to production and performance would be horrible. Then I would have to denormalize the crap out of my database and before I knew it a week had already passed.
And more specifically how to query a Graph database?
Some of you already know this comic, but I have to give credit to the Basho Riak team for having a nerdy sense of humor.
The real answer for the Graph db world is traversals
This brings us to an experiment in which Neo Technology has benchmarked performance of MySQL and Neo4j in a social graph
We want to run a query that find all of the friends of Kyle. then the friends of his friends and so on.
We have a table that stores all users and another table that stores primary and foreign keys that map the friendships
This is an example of the SQL query used as depth 3. find friends of friends of friends of a particular user
find friends of friends of friends of the user
We see a dramatic decrease in performance the more inner joins we add to the query.
For Neo4j the social network is a typical graph
Neo4j’s traversal API is used to return a result set.
IS_FRIEND_OF = traverse relationships that a typed “IS_FRIEND_OF”
Evaluator.atDepth(2) = is how you limit the depth
Uniqueness.NODE_GLOBAL = means a node cannot be traversed more than once
traverse(nodeById) = is the id of the node where we want to start our traversal
So let look at Neo4j’s performance
We see that performance is relatively unaffected as we increase the depth of traversal
We perform the same queries but we increase the total amount of users to 1 million.
In MySQL we will have 1,000,000 records in t_user table, and approximately 1,000,000 X
50 = 50,000,000 records in t_user_friend table.
1,543.505 ~ 25 minutes
Depth five didn’t finish after running for an hour
For Neo4j we have a linear increase in execution time.
TAKE-AWAYS
Pentaho provides complete integrated DI+BI for every leading big data platform.
Big Data solutions are not databases. They don’t provide the capabilities that BI toolsets expect of a database.
Hadoop also has a high latency. This means the smallest query possible has an execution time that is much slower than that of a database
Hadoop is optimized for executing very intensive data processing tasks on very large amounts of data. It is not optimized for quick queries. Some Hadoop experts recommend configuring the workloads so that Hadoop jobs take an hour or more. This conflicts with OLAP performance criteria of 5-10 seconds per query.
There are database implementations within the Hadoop world, Hive, HBase etc.
Unfortunately for developers who are used to working with data transformation tools, the productivity within the Hadoop environment is not what they are used to.
TAKE-AWAYS
The better choice is obviously visual development