3. CAP Theorem
Consistency, Availability, Partition Tolerance (CAP)
You can’t continually maintain perfect consistency,
availability, and partition tolerance simultaneously.
CAP is defined by:-
Consistency: all nodes see the same data at the same time
Availability: a guarantee that every request receives a
response about whether it
was successful or failed
Partition tolerance: the system continues to operate despite
arbitrary message loss
5.
NoSQL databases are next generation databases mostly addressing
some of the points:
Being non-relational,
distributed,
open-source, and
horizontally scalable
Often more characteristics apply to NoSQL databases such as:
Schema-free, easy replication support, simple API, eventually
consistent/BASE (basically available, soft-state, eventual consistency
Not ACID but BASE
NoSQL Databases
6. Properties of NoSQL Databases
Non-relational
Distributed
Open-source
Horizontally scalable
Schema-free
Easy replication support
Simple API
BASE not ACID
The current number of NoSQL databases has more than 225.
NoSQL databases are widely used in many famous enterprises such as
Google, Yahoo, Facebook, Twitter, Taobao, Amazon, and so on
7. Categories of NoSQL Databases
●
Here are the four main types of NoSQL databases:
●
Document databases
●
Key-value stores
●
Column-oriented databases
●
Graph databases
●
According to the statistics of the DB-Engines
Ranking website, Apache Cassandra and Apache
HBase are the more widely discussed ones of the
wide column store databases.
8. Document based
●
A document database stores data in JSON, BSON ,
or XML documents.
●
In a document database, documents can be nested.
Particular elements can be indexed for faster
querying.
●
The most widely adopted document databases are
usually implemented with a scale-out architecture,
providing a clear path to scalability of both data
volumes and traffic.
●
Examples of document stores are MongoDB and
CouchDB.
9. Cont’d
●
A collection is a group of documents. The
documents within a collection are usually related
to the same subject, such as employees, products,
and so on.
●
A document is a set of ordered key-value pairs,
where key is a string used to reference a
particular value, and value can be either a string
or a document.
●
JSON (JavaScript Object Notation), BSON (Binary
JSON), and XML (eXtensible Markup Language) are
formats commonly used to define documents.
11. KEY-VALUE STORES
●
Key-value stores are the least complex of the NoSQL databases.
They are, as the name suggests, a collection of key-value pairs.
●
The data in this category of NoSQL databases is stored with the
format of “Key → Value” ,
●
where
●
Key is a string used to identify a unique value;
●
Value is an object whose value can be a simple string, numeric
value, or a complex BLOB JSON object, image, audio, and so
on;
●
According to the statistics of the DB-Engines Ranking Website,
both Redis and DynamoDB.
13. Graph Databases
●
The most complex one, geared toward storing
relations between entities in an efficient manner.
●
The graph database model (GDM) is composed of
vertices and edges [5], where
– A vertex is an entity instance, which is equivalent to a
tuple in RDM;
– An edge is used to define the relationship between
vertices;
– Each vertex and edge contains any number of attributes
that store the actual data value
●
18. Basics
●
The major challenges associated with big data are as follows
−
– Capturing data
– Curation
– Storage
– Searching
– Sharing
– Transfer
– Analysis
– Presentation
●
To fulfill the above challenges, organizations normally take
the help of enterprise Solutions of Layered Frameworks.
19. Hadoop Ecosystem
●
Apache Hadoop is an open source framework.
●
Hadoop provides businesses with the ability to distribute data storage,
parallel processing, and process data at higher volume, higher velocity,
variety, value, and veracity.
●
Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems. It includes Many Apache projects.
– HDFS: Hadoop Distributed File System
– YARN: Yet Another Resource Negotiator
– MapReduce: Programming based Data Processing
– Spark: In-Memory data processing
– PIG, HIVE: Query based processing of data services
– HBase: NoSQL Database
– Mahout, Spark MLLib: Machine Learning algorithm libraries
– Solar, Lucene: Searching and Indexing
– Zookeeper: Managing cluster
– Flume,Chukwa, Scribe, Kafka, Sqoop : Data collection
20.
21. Cont’d
●
All these toolkits or components revolve around one term
i.e. Data.
●
That’s the beauty of Hadoop that it revolves around data
and hence making its synthesis easier.
●
There are four major elements
of Hadoop i.e.
– HDFS,
– MapReduce,
– YARN, and
– Hadoop Common.
●
Let’s study each in more detail.
22. HDFS
●
HDFS is is responsible for storing large data sets of structured
or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
●
HDFS consists of two core components i.e.
– Name node
– Data Node
●
Name Node is the prime node which contains metadata (data
about data) requiring comparatively fewer resources than the data
nodes that stores the actual data.
●
These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
●
HDFS maintains all the coordination between the clusters and
hardware, thus working at the heart of the system.
23. MapReduce
●
By making the use of distributed and parallel algorithms,
MapReduce makes it possible to carry over the processing’s
logic and helps to write applications which transform big
data sets into a manageable one.
●
MapReduce makes the use of two functions i.e. Map()
and Reduce() whose task is:
– Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-value
pair based result which is later on processed by the Reduce()
method.
– Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into
smaller set of tuples.
24.
25.
26. ●
A Word Count Example of MapReduce
●
Let us understand, how a MapReduce works
by taking an example where I have a text file
called example.txt whose contents are as
follows:
●
Dear, Bear, River, Car, Car, River, Deer, Car
and Bear
●
Now, suppose, we have to perform a word
count on the sample.txt using MapReduce. So,
we will be finding unique words and the
number of occurrences of those unique words.
●