SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Data Storage and Management
Project
on
Comparison of HBase and Cassandra
Shrikant Uday Samarth
X18129137
MSc in Data Analytics {2019-2020}
Submitted to: Muhammad Iqbal
Contents
Abstract................................................................................................................................................3
Introduction ............................................................................................................................................3
Key Characteristic: ..................................................................................................................................4
HBase: .................................................................................................................................................4
Cassandra:...........................................................................................................................................5
Architecture ........................................................................................................................................5
HBase Architecture: ............................................................................................................................5
Cassandra Architecture:.............................................................................................................7
Comparison - Hbase and Cassandra..........................................................................................8
Scalability:.......................................................................................................................................8
Availability:......................................................................................................................................8
Reliability:........................................................................................................................................9
Transaction Management:..........................................................................................................9
Learning from Literature Survey................................................................................................10
Performance Test Plan: .................................................................................................................12
Evaluation and Results:.........................................................................................................................13
Workload Result:...............................................................................................................................14
Conclusion and Discussion.........................................................................................................23
References ............................................................................................................................................24
Comparison of HBase and Cassandra
Shrikant Uday Samarth
X18129137
Abstract
In the current era with the heavy usage of internet, huge amount of
data is being generated from various resources such as banking
sectors, government organizations, IOT devices, web applications etc.
The generated data is going above petabytes. Thus the term ‘Big Data’
came into picture. The term ‘Big Data' depicts inventive strategies and
advancements to catch, store, appropriate, oversee and break down
petabyte-or bigger measured datasets with high-speed and diverse
structures. With the complexity of the Big Data new techniques,
algorithms and architecture is required. The traditional MySQL is
incapable of processing and managing and analysing of the enormous
amount of data. So, companies are moving towards databases like
Spark, Hadoop etc. which have many advantages like quick execution,
real time analytics, parallel and distributed computing and many more
in a cost effective manner. The popular NoSQL databases are HBase
and Cassandra which works on Hadoop. In this undertaking, we have
studied the performance of both the HBase and Cassandra databases
using various operations which are performed on Ubuntu.
Introduction
In recent years, huge amount of data has been generated with the increase in
the usage of internet. The data is going above petabyte. Huge datasets or
combination of dataset collections are referred to as Big data collections
whose size(volume), unpredictability (fluctuation), and rate of development
(speed) makes them hard to be captured, overseen, handled or broke down by
traditional or conventional technologies. Couple of years back, associations
had the capacity to deal with the information with the assistance of relational
database management system (RDBMS), but, as the social media and internet
have emerged, it got beside inconceivable for these conventional databases to
handle these humongous data; resulting into poor data transmitting rate,
slow processing time and low scalability. Due to these disadvantages the data
cost started to rise and it became difficulty for associations to survive. So
when, the NoSQL started getting popular as they are more scalable, provide
quick execution, real time analytics, parallel and distributed computing and
cost effective. NoSQL can handle huge amount of structured, semi-structured
and unstructured datasets. Moreover, NoSQL does not require to follow the
established relational schema, which helps organizations to narrow down the
operational goals (TechTarget, 2017).
Types of NoSQL databases are as follows:
• Key-Value Store - It has a Big Hash Table of keys & values {Example-
Riak}
• Document-based Store- Stores documents made up of tagged
elements.
• Column-based Store - Every storage block contains information from
just a single column
• Graph-based - A system database that utilizes edges and hubs to
speak to and store information (KUMAR, 2018)
Despite the fact that there was a tremendous goad in the prevalence of
NoSQL information store, there has always been a doubt regarding the
performance of NoSQL databases and which database in reasonable
better to which database. In this project we are going to compare two of
such popular databases i.e HBase and Cassandra with various
parameters.
Key Characteristic:
HBase:
HBase is developed by Apache Software Foundation’s Apache Hadoop
project is an open-source database which is written in java (Wikipedia,
2019). Key Characteristic of Hbase are as follows:
• HBase is based on Google’s BigTable.
• It is a completely NoSQL data store which is built over the HDFS file
system.
• It can also use the MapReduce computational framework to retrieve
and store data.
• Support constant updates as crisp data streams.
• It is completely column family oriented.
• HBase has a master slave concept and it reduces the single point
failure.
• The zookeeper service in the HBase service discovery pattern keeps
the HMaster and the region servers together.
• It is used to process the huge amount of data.
• Different tasks can be performed and the information can be adjusted
inside the HBase database.
Cassandra:
Cassandra is also an efficient type of NoSQL database which is also
developed by Facebook Foundation. It was then released as free open-source
project on Google code. Later it was supported by Apache Foundation.
(Wikimedia Foundation, 2019)Key Characteristic of Cassandra are as
follows:
• Cassandra is based on Java system which can be managed and
monitored by JMX.
• Cassandra is a free open source NoSQL database but it stores the values
in the form of key value pairs.
• Cassandra works on the principle of clustering (i.e master-less
replication), so whenever query is given to Cassandra cluster, data can be
taken from one of the nodes and has ability to handle huge amount of
data which makes it as a highly robust database. Cassandra fault
tolerant.
• Two types of consistency: i.e Eventual and Strong consistency. The
information will be predictable however with some delay in eventual
consistency. Whatever conflicts emerges they are settled as write
operation is important.
• Cassandra runs on number of nodes and provides high tolerance with
which we can sync the data. It pursues distributed design where all
nodes speak with one another.
• A bloom filter is an incredibly quick approach to test the presence of an
information structure in a set. A bloom filter can tell if a thing may exist
in a set or unquestionably does not exist in the set.
• SSTable requested permanent key esteem map. It is fundamentally a
proficient method for putting away huge arranged information sections in
a document.
• The data distribution in Cassandra has configurable nodes which utilizes
replication and replication procedures to decide how information is
repeated crosswise over DC's, racks and hubs (Mehra, 2015).
Architecture
HBase Architecture:
Hbase architecture is a column oriented NoSQL data store which is built over
the HDFS file system. The advantage over HDFS is that HBase uses master
slave concept and it reduces the single point failure. One of the interesting
abilities with regards to HBase is Auto-Sharding, which just implies that
tables are powerfully conveyed by the framework when they become too large.
In HDFS there is possibly one master, the data cannot be recovered from the
slave if master fails, so, in HBase that condition is wiped out as HBase have
HMaster and it can have various HMasters but only one can be active at a
time (similar to Datanodes in HDFS). (Sinha, 2019).
Zookeeper, HMaster and Regionserver along with HDFS (as the underlying
storage for HBase) are the three fundamental components of HBase as shown
in the below diagram:
Region Servers:
Region servers serve information for read and write purposes. That implies
customers can legitimately speak with HBase Region Servers while getting to
information. Further, the HBase Master process handles the region task just
as DDL (create, delete tables) activities.
A region comprises of the considerable number of rows between the start and
the end key which are appointed to that Region. Those Regions which we
assignees to the nodes in the HBase Cluster, is the thing that we call "Region
Servers". (TEAM, 2018). It manages all the regions of a table in Hbase. Every
region inside a HBase table may have various columnfamilies. Each
columnfamily of the region is in the Store. Memstore are the memory
modification to the store. Storefile (HFile) are the files where the actual
information is put away as Key-Value sets of segments and its qualities.
(Pethuru Raj, 2014)
HMaster:
HBase master is in charge of region task just as DDL (create, erase tables)
activities. HMaster Coordinates with the region servers and keeps a track of
admin functions. Moreover, Master monitors all instances of region server in
the HBase cluster. Essentially, a master assigns Regions on the start-up.
Likewise, with the end goal of recovery or load adjusting. It also acts as an
interface for making, erasing and refreshing tables in HBase.
Zookeeper:
It is a centralized service administration that deals with the configuration of
the HBase, organize the procedures between the HBase customers and the
HMaster, and in charge of dealing with the conveyed synchronization when
there are various Hbase customers associated with HBase and getting to the
mutual resources.
Cassandra Architecture:
Cassandra is intended to deal with enormous information. Cassandra is
designed on distributed architecture. Its primary element is to store
information on various nodes with no single purpose of failure. Cannandra
works on the concept that hardware failure, any node can be down at any
time. To avoid such issues, stored data is used from another node. Cassandra
stores information on various nodes with peer to peer distributed style
architecture. Gossip protocol is used for interaction between nodes. (Guru99,
2019)
The architecture for Cassandra is shown below:
Node: This is the basic component of Cassandra where data is stored in the
cluster. In case of node failure, other nodes take its place.
Datacenter: It is the unified spot to house PC and systems administration
frameworks to help meet an association's data innovation needs.
Cluster: It is the collection of all data center. All nodes taking an interest in
a cluster have a similar name. Seed nodes are utilized amid start up to help
find every single participating node. Seeds hubs have no exceptional reason
other than helping bootstrap the group utilizing the tattle convention. At the
point when a node begins up it looks to its seed rundown to get data about
different nodes in the group.
Commit-Log: Each write activity is written in the commit log. Commit log is
utilized for accident recuperation.
Mem-table- At the point when the substance which is safeguarded in mem
table reaches a threshold value, it is flushed to plate document which is called
SSTables. A memTable is a compose back reserve living in memory which has
not been flushed to disk yet.
Data Replication- This procedure is the place the replication of nodes is done
as such that there is no loss failure. Partitioning of information on a mutual
nothing framework results in a single point of failure for example if one of the
nodes goes down, the information would be inaccessible. This confinement
overcomes duplicating the information which is known as replicas.
Replication of information guarantees adaptation to non-critical failure and
dependability.
Comparison - Hbase and Cassandra
All the NoSQL databases generally pursue CAP hypothesis of execution. Top
represents Consistency, Availability and Partition resilience. The essential viewpoint
to post here is that the databases can accomplish any two out of three at a time.
Scalability:
The scalability for any database, is the capacity to add computational assets to a
database so as to acquire throughput. Two types of scalability measures are available
i.e. Horizontal and Vertical. Vertical scalability is the movement from one machine
to another which has more RAM or CPU. This scalable approach for vertical database
is expensive process. While managing a lot of information, moving to a type of storage
infrastructure is necessary. In terms of asset commitment, if there are any
necessities for looking after uptime, critical operational arranging and exertion are
normally required to relocate to the new framework. If the volume of information is
huge, at that point the physical exchange from the old framework to the new can
take alot of time depending upon the load. In Horizontal scalable, hardware can be
added incrementally. a framework is level versatile if equipment can be included
steadily. If greater limits are needed, extra hardware can be included. Ideally for a
hardware, a linear increment should be provided in limit accessible without
reconfiguration or personal time expected of existing nodes.
Apache Cassandra meets the prerequisites of an ideal horizontally scalable
framework by considering consistent expansion of nodes. As you need more capacity,
you add nodes to the cluster and the group will use the new assets automatically.
The HBase is exceedingly scalable as the information when develops in the database
it is appropriated horizontally along the tables. It very well may be advocated as the
HBase depends on Googles Big Table. Level versatility in Hbase can be observed over
the Region Servers which goes about as the slaves in the cluster. HBase likewise
offers strong row-level consistency, and "coprocessors" that give the counterparts of
triggers and stored methods. (PFEIL, 2010)
Availability:
This condition expresses that each request gets a reaction on progress/failure.
Accomplishing accessibility in an appropriated framework necessitates that the
framework stays operational 100% of the time. Each customer gets a reaction,
whatever the condition of any individual nodes in the framework. This measurement
is inconsequential to measure: it is possible that you can submit read or write
commands, or you can't. Consequently, the databases are time free as the nodes
should be accessible online consistently.
Since Hbase uses master slave relationship. One of the interesting abilities with
regards to HBase is Auto-Sharding, which just implies that tables are powerfully
conveyed by the framework when they become too large. In HDFS there is possibly
one master, the data cannot be recovered from the slave if master fails, so, in HBase
that condition is wiped out as HBase have HMaster and it can have various HMasters
but only one can be active at a time (similar to Datanodes in HDFS) (Sinha, 2019)
In Hbase whatever data comes it will keep a copy into the secondary node. So, in
case of Master node failure secondary node is available which act as a master. All
the data nodes are just for processing the data. And failure of these node is not a big
concern, as because of the presence of synch, scheduling and communication
between the other node makes the data retrieval easily possible.
Cassandra has many nodes and if anyone is working with one node, by default it will
take three replicas. One replica is in same cluster and second copy is in another
cluster; so in case of node or cluster failure, the data could be available from the
other rack or other node. Since the copies is in another node of the same rack this
helps to decrease the communication problem hence it decreases the cost. Hence,
Cassandra framework stays operational all the time.
Reliability:
Reliability of a database can be set apart by its execution of the expectations which
demonstrates consistency and according to its characterized specification. At the
point when the framework condition changes or any fault happens in the framework
and still the database demonstrates the equivalent or much-improved execution then
we can say that the database is profoundly solid(Reliable).
In HBase reliability is measured by zookeeper which is a centralized service
administration that deals with the configuration of the HBase, organize the
procedures between the HBase customers and the HMaster, and in charge of dealing
with the conveyed synchronization when there are various Hbase customers
associated with HBase and getting to the mutual resources. Hence, Hbase is up to
the mark on all parameters of consistency.
Cassandra is designed on distributed architecture. Its primary element is to store
information on various nodes with no single purpose of failure. Cassandra works on
the concept that hardware failure, any node can be down at any time. To avoid such
issues, stored data is used from another node. Cassandra stores information on
various nodes with peer to peer distributed style architecture. Gossip protocol is used
for interaction between nodes. Hence, Cassandra is highly consistent.
Transaction Management:
HBase gives atomicity of transformations (puts/composes) on a for each row basis,
regardless of whether the put task ranges over different column families.
Nonetheless, no value-based assurance is given to changes crosswise over rows.
In HBase, despite the fact that the information is put away on storage like HDFS,
the write dependably experience a lot of servers called region servers. Each table is
isolated into key space allotments called regions and a region server is in charge of
serving traffic for a subset of regions. At the point when a write occurs for a specific
line in a region served by a region server, that region server takes a line lock over all
column families for that line and prevents some other synchronous writes to that
line. At that point it continues the write task to a WAL on HDFS. Simply from that
point forward, the put task is connected to every section family associated with the
Put. This is the way HBase ensures push level atomicity of write activities.
In Cassandra, the write and update activity of row which could be in excess of two
will be treated as one single write task. It implies that write task is nuclear at
partition level. The atomicity is upheld at the row level. Cassandra writes and makes
the replica of information in every one of the nodes and wait for affirmation from the
nodes. At whatever point any progressions are made in the column Cassandra
utilizes timestamp to think about the change. At isolation level, Cassandra performs
activity at full row level. At the point when write is performed at the node, the
entrance is given to client or customer. Durability implies when the write is finished
it will endure regardless of whether the server crashes down. In the event that server
crashes before mem-tables updation flush the disk, the commit log is recovered on
the reboot of the node to recuperate lost transaction.
Learning from Literature Survey
Big Data is known by 3Vs namely Velocity, Variety and Volume. Fifteen years back,
the size of data was manageable. There were data centres that could store data in
structured format and the data was not so scalable. After the introduction of mobiles
and technological advancements in internet, there has been a growth in amount of
data in last 5 years. The data generated in last 5 years was equivalent to data
generated 10 years before the last 5 years. No sooner did the quantity of data
increased than the realization has struck humanity that data is not only in
structured format but also unstructured and semi-structured.
Volume of the data is increasing day by day where Thomas Reuter stated in the
annual report that in 2010 the world data accounted to be more than 800 Exabyte’s
and more counting. In last decade, we can have considered the volume increase in
the data after the rigorous use of social media platforms on daily basis. Not only the
volume, the data has been in different formats as well. The traditional data types
were structured which accounted of info like time, date, amount if we are to consider
an example of banking scenario. But today, we have data in all formats namely
image, reader, writer, movies and so on. The accounts to be unstructured data which
brings about variety in Big Data.
Velocity is defined as the speed with which the data is recorded in the databases.
The frequency with which the data is registered is considered into the Velocity of
data. The best example is the live streaming data on Amazon prime or Netflix which
telecasts the shows and also records the live feeds of the comments as a feedback to
the shows which is used for sentimental analysis and decides the rating of the show.
One more V that plays its part here is the term called Veracity which literally means
the authenticity of the data. We have always seen that the data coming from an
unknown source might cause issues to the user resulting into falsehoodness. This
will in turn cause unreliable patterns which will give incorrect results to the data
analyst. So even if this doesn't stand into the core Vs of Big Data, veracity also plays
an equal and important role in Big Data Analysis. (Williamson, 2018)
As we go through some researches online, a technical report submitted by Hiren
Patel by the name "HBase: A NoSQL Database", he mentioned his views about HBase
as it is the strong NoSQL storage system which is inspired by Google’s BigTable. Each
row in HBase table is associated with the key and this key is used for access
mechanism in a row. Its architecture is very strong and capable of storing sparse
data in real time. The three main components the master server, region server and
zookeeper are an essential part of its architecture which enables fast operations on
data. In the views of Hiren, HBase has lost the grip over the ever increasing data and
it is not recommended to use it for present Big Data operations. For some feature, it
has upper hand over the MongoDB and Cassandra but overall both are performing
well than the HBase (Patel, 2017).
Cassandra is a distributed database which is highly scalable when it comes to
storage and throughput. Cassandra’s design brings together the data model
described in Google’s Bigtable paper and the behaviour of Amazon Dynamo (Ref:
Dynamo: Amazons Highly Available Key-value Store In Proceedings of twenty-first
ACMSIGOPS symposium on Operating systems prin-ciples (2008) (Chang, 2008).
Cassandra, along with its remote ThriftAPI (Ref: Thrift: Scalable Cross-Language
Services Implementa-tionFacebook) (Slee, 8), were initially developed by Facebook
as a data platform to build services such as Inbox Search that scale to serve
hundreds of millions of users (Lakshman, 35-40). After it was released to the Apache
Software Foundation Incubator in 2009, Cassandra was then accepted as a top-level
Apache project in March of 2010 (Featherston, 2010).
Whenever HBase and Cassandra were compared on same grounds, many data
analysts notably Rajith Kumar and Roseline Mary, in their research of comparison
of different NoSQL databases like Cassandra, HBase and MongoDB, they evaluated
the databases on Yahoo Cloud Serving Benchmark with different variations in read
and update workloads and found out that Cassandra works better and faster than
HBase (Mary, 2017). This motivates my research to perform an experiment with
different amounts of workload on HBase and Cassandra and validate the
performance.
Performance Test Plan:
The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source
determination and program suite for assessing recovery and support abilities
of PC programs. Usually used to look at relative execution of NoSQL database
the board frameworks. The first benchmark was created by workers in the
examination division of Yahoo! who released it in 2010. In this project we have
used this YCSB benchmarking tool for evaluation of Hbase and Cassandra
performance. The figure below shoes the architecture of YCSB benchmarking
structure.
The above figure shows the architecture of the YCSB benchmarking tool. The
YCSB client is producing the information to be stacked to the database, and
creating the tasks which make up the workload. The essential task is that the
workload executor drives different threads. Each string executes a successive
arrangement of activities by making calls to the database interface layer, both
to stack the database and to execute the outstanding workload. The thread
additionally measures the inactivity and accomplished throughput of their
tasks, and report these estimations to the statistics module. At last the
statistics module totals the estimations and reports normal, 95th and 99th
percentile latencies, and either a histogram or time arrangement of the
latencies. Workload executor: Workload executor contains the unique
outstanding tasks at hand situations which are collection of write, Read, and
update activities. (Brian F. Cooper)
For the further process, the system configurations on which the test are
performed are as follows:
Processor 1.7GHz dual-core Intel Core i5-4210U CPU, 2 Cores. 12 GB Ram
memory, Windows 8 Operating System, System type: 64-bit Operating
System, x64-based processor.
• Visualation Tool: Open Stack Cloud Server.
• HBase Virtual Machine: Ubuntu (64-bit) Operating System, 4GB Ram
Memory, 2 Core virtual processor.
• Cassandra Virtual Machine: Ubuntu (64-bit) Operating System, 4GB
Ram Memory, 2 Core virtual processor.
• Workload Parameters: Workload A & Workload B
The Database selected for this projects are HBase and Cassandra. After
successful installation of HBase and Cassandra, YSCB benchmarking tool is
installed to perform the performance evaluation for the selected databases.
Testharness tool was used to evaluate the various operations to perform the
performance test. Testharness is an automation tool to perform tests. It refers
to the framework test drivers and other supporting devices that requires to
execute tests. It gives stubs and drivers which are little projects that connect
with the product under test.
Workload A: Update heavy workload, this workload gives mix 50/50 read and
write operations.
Workload B: Read mostly workload, this workload gives 95/5 read and write
mix.
The operation counts taken for running the performance evaluations are
100000, 250000, 350000, 500000. Tests are performed for both Workload A
and Workload B for Hbase and Cassandra database.
Evaluation and Results:
For the Hbase and Cassandra performance evaluation we have been
performed test for the Workload A and Workload B using the benchmarking
tool (YCSB). Specifications are as follows:
Workload A: Update heavy workload, the workload gives 50% Read and 50%
Update
Workload B: Read mostly workload, this workload gives 95% Read and 5%
update mix.
Workload Result:
Below are the findings for Workload A for both HBase and Cassandra:
Below are the findings for Workload B for both HBase and Cassandra:
Database
Workload
B
Operation
Average
latency
(Read)
Record
Read
Average
latency
(Update)
Update
Operations
Overall
Throughput
Load
Average
latency
(Insert)
Load Overall
Throughput
Hbase
Workload
1
100000 225.89 95005 522.51 4995 3920.95 373.27 2330.19
Hbase
Workload
2
250000 328.28 237454 489.27 12546 2793.89 341.53 2858.48
Hbase
Workload
3
350000 464.38 332546 492.77 17454 2061.19 359.49 2727.11
Hbase
Workload
4
500000 2321.91 475098 769.06 24902 442.97 382.34 2535.05
Cassandra
Workload
1
100000 428.17 60518 435.19 4983 2075.08 482.66 1547.50
Cassandra
Workload
2
250000 408.05 237434 373.82 12566 2298.66 459.94 2044.25
Cassandra
Workload
3
350000 401.77 214270 375.02 17491 2463.87 458.73 1929.41
Cassandra
Workload
4
500000 401.39 328559 364.79 25237 2456.92 455.86 2038.63
Database
Workload
A
Operation
Average
latency
(Read)
Read
Operations
Average
latency
(Update)
Update
Operations
Overall
Throughput
Load
Average
latency
(Insert)
Load
Overall
Throughput
Hbase
Workload
1
100000 248.88 50237 304.18 49763 3000.57 355.02 2449.06
Hbase
Workload
2
250000 254.00 125120 294.21 124880 3527.78 359.82 2600.59
Hbase
Workload
3
350000 1163.17 174384 351.78 175616 1261.51 347.59 2817.90
Hbase
Workload
4
500000 2367.44 249951 550.20 250049 674.19 353.89 2687.72
Cassandra
Workload
1
100000 445.95 37399 392.75 50090 2201.19 534.64 1282.09
Cassandra
Workload
2
250000 436.84 125285 369.35 124715 2310.28 476.96 1969.90
Cassandra
Workload
3
350000 444.61 127737 371.16 175480 2345.36 474.47 1881.75
Cassandra
Workload
4
500000 509.43 249917 368.29 250083 2193.01 454.61 2122.84
For Workload A
Comparison of Hbase and Cassandra workload on the basis of Average
Latency, Read Operation, Throughput, number of operations from Run files
and Overall Load throughput, Average load latency and load operations from
Load files are shown below. The workload 1, workload 2, workload 3 and 4
are for 10000, 250000, 350000 and 500000 respectively.
1. Average Latency Vs Record Read Operations for Workload A:
From the above figure shows the Hbase and Cassandra comparison with
respect to record read and average read latency. The average latency of
Cassandra is consistent with respect to read record counts. Whereas the
Average latency for HBase a spike can be seen for the fourth read record. So,
we can say that the average latency for HBase is consistent for low read
operations than the high read operations. In contrast, there is not major
difference in the average latency for Cassandra.
2. Overall Throughput Vs Operations for Workload A:
Throughput is only the recorded transmission of information starting with one
node then onto the next in the given timeframe. From the below graph of
Throughput Vs Operations it can be seen that, Hbase performs differently
than Cassandra database. For low workloads throughput for HBase is high
and it continues to increase until 250000, but for high workloads the graph
gets plummeted. On the other hand, for Cassandra the throughput is bit up
but lower than the Hbase; surprisingly, Cassandra is consistent throughout
the workloads.
3. Average latency Vs Update Operations:
The above graph is for the average latency and update operations for HBase
and Cassandra for Workload A operation. It can be seen that for update
operations, average latency is for Cassndra is nearly higher than the HBase.
But with the increase in the workload Cassandra’s average latency goes on
decreasing whereas, for HBase from workload 2 (i.e. 250000 workload) the
average latency takes a leap and it goes above the Cassandra’s average
latency. Thus, we can say that with the high workloads the average latency
for Hbase is also high than Cassandra.
4. Load Latency Read Vs Load Operations:
The above graph shows the comparison of HBase and Cassandra on the basis
of Average load read and load operations for Workload A. In terms of average
latency, it can be clearly seen that Hbase is consistent than the Cassandra;
as with the increase in workload the average load latency for Hbase decreases.
On the other hand, the Hbase shows consistency throughout the workloads.
5. Load Overall Throughput Vs Load Operations-Run
Throughput means the measure of material or things going through a
framework or procedure. From the below comparison graph of Hbase and
Cassandra that, number of operations performed in Hbase is comparatively
higher than the Cassandra. But Cassandra shows improvement in performing
tasks with the increase in the workload but it is not able to cross over the
Hbase throughput operations. So, HBase is efficient in terms of the operations
performing per second than Cassandra.
For Workload B,
Comparison of Hbase and Cassandra workload on the basis of Average
Latency, Read Operation, Throughput, number of operations from Run files
and Overall Load throughput, Average load latency and load operations from
Load files are shown below. The workload 1, workload 2, workload 3 and 4
are for 10000, 250000, 350000 and 500000 respectively.
6. Average Operations Vs Record Read Operations for Workload B
The above shown graph is for the average read latency and read operations
where we can clearly see that, average is consistent for Cassandra similar to
as Workload A, whereas Hbase shows a steep increase after workload 2. Thus
we can say that with the increase in the read operations Average latency
increases for HBase whereas Cassandra remains constant for Workload B.
7. Overall Throughput Vs Operations for Workload B
The graph shown below shows the performance regarding overall throughput
and operations performed for workload B. It can be seen that, for HBase with
the increase in the number of operation the throughput (i.e. operations
performed per second) decreases in Workload B whereas there is not much
difference in the performance of Cassandra as with the increase in the
workload operations performed per second increases a bit until workload 3
but there is a slight decrease after that.
8. Avg. Latency Vs Update Operations for Workload B
The above graph is for the performance evaluation of Hbase and Cassandra
on the basis of average latency and number of update operations for workload
B. As we have seen earlier the average latency of Casandra is consistent
throughout the operations. On the other hand, Average latency is little higher
than the Cassandra for the less operations but as the load increases the
average latency increases and from workload 3 it goes on increasing. Thus we
can say that, Casandra is consistent throughout the workloads and for HBase
as the workload increases the Average latency also increases. Hence, Hbase
gives good performance for high workloads.
9. Average Latency Vs Number of Operations
The above graph shows the comparison of HBase and Cassandra on the basis
of Average load read and load operations for Workload A. In terms of average
latency, it can be clearly seen that Hbase is consistent than the Cassandra;
as with the increase in workload the average load latency for Hbase decreases.
On the other hand, the Hbase shows consistency throughout the workloads.
Thus we can say that Cassandra is more stable in handling operations
ranging from small to large workloads. On the other hand Hbase is good with
the large workloads.
10. Overall Load Throughput Vs Load Operations for Workload B
The above graph shows the performance of overall throughput and number of
operations for Hbase and Cassandra for workload B. It can be seen that
comparatively Hbase throughput is more than the Cassandra; both follows
the same consistency as the load increases. Moreover the output of workload
B is follows the same that of workload A. We can clearly see that as the
operation increases the throughput of Hbase compared to that of Cassandra
is higher but after the workload 2 it goes on decreasing, on the contrary,
Cassandra after workload 3 throughput increases slightly. Hence, based on
the facts presented in the graph, we can come to a conclusion that Hbase
performance is higher for less workload but it slightly drops for higher
workloads.
Conclusion and Discussion
In conclusion, the experiment carried out successfully for the performance of
Hbase and Cassandra on the basis of Average Latency, Read Operation,
Throughput, number of operations from Run files and Overall Load
throughput, Average load latency and load operations from Load. The
architecture of both the databases alongside their key qualities has been
investigated in the above examination for Workload A and Workload B. The
output has been displayed in the paper. From the above examination, it can
be seen that Cassandra is consistent throughout the tests and the for load
operations, Cassandra works well for throughput load operation that means
Cassandra has a good capability of loading the data when the workload
increases; Also, its performance has been consistent for availability and the
average latency throughout the examination. Which correctly justifies the
Cassandra's model for exchanging consistency for availability and latency has
been established by referencing the CAP Theorem. (Featherston, 2010). On
the other hand, it can be observed from the experiment that, in some
parameters Hbase is efficient in handling the higher workloads; Throughput
(operation per second) is good when it comes to perform for small operation
but the throughput performance decreases with the higher workloads.
Moreover, it is not proper to say that Hbase is efficient for only high
performance, but yes from the above findings there are some parameters i.e.
(update record and the average latency) where HBase works well on the high
workloads from the findings from the both workload A. In contrary, for
Average latency of workload B Hbase performance is lower than the
Cassandra. Thus, according to the past investigations, the execution of HBase
will in general show signs of improvement however from our examination for
Workload A and Workload B we can say that the HBase isn't executing
according to the desires as the outstanding burdens are expanding. From the
overall evaluation we can say that Cassandra is predictable and performed
much superior to the Hbase which is one of the requirements of any
organisations. And Hbase has demonstrated to high inconsistency all through
the analysis. This study can be further extended for the higher workloads to
check how these two NoSQL databases perform, and which one is more
suitable for the big organisations.
References
Brian F. Cooper, A. S. (n.d.). Benchmarking Cloud Serving Systems with YCSB. ACM
(https://www2.cs.duke.edu/courses/fall13/cps296.4/838-CloudPapers/ycsb.pdf), 143--154.
Chang, F. a. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on
Computer Systems (TOCS), 4.
Featherston, D. (2010). Cassandra: Principles and Application. unitn, 17. Retrieved from
http://disi.unitn.it/~montreso/ds/papers/Cassandra.pdf
Guru99. (2019). Cassandra Architecture & Replication Factor Strategy. Retrieved from
www.guru99.com: https://www.guru99.com/cassandra-architecture.html
KUMAR, G. (2018). EXPLORING THE DIFFERENT TYPES OF NOSQL DATABASES PART II. Retrieved from
3pillarglobal: https://www.3pillarglobal.com/insights/exploring-the-different-types-of-nosql-
databases
Lakshman, A. a. (35-40). Cassandra: a decentralized structured storage system. ACM SIGOPS
Operating Systems Review, 2010.
Mary, R. K. (2017). Comparative Performance Analysis of various NoSQL Databases: MongoDB,
Cassandra and HBase on Yahoo Cloud Server. Imperial Journal of Interdisciplinary Research
(IJIR), 5.
Mehra, A. (2015, june 06). Introduction to Apache Cassandra's Architecture. Retrieved from
dzone.com: https://dzone.com/articles/introduction-apache-cassandras
Patel, H. (2017). HBase: A NoSQL Database. 10.13140/RG.2.2.22974.28480. researchgate, 15.
Pethuru Raj, G. C. (2014). Handbook of research on cloud infrastructures for big data analytics. IGI
Global.
PFEIL, M. (2010, Oct 29). Why does Scalability matter, and how does Cassandra scale? Retrieved
from Datastax: https://www.datastax.com/dev/blog/why-does-scalability-matter-and-how-
does-cassandra-scale
Sinha, S. (2019, Feb). HBase Architecture: HBase Data Model & HBase Read/Write Mechanism.
Retrieved from www.edureka.com: https://www.edureka.co/blog/hbase-architecture/
Slee, M. a. (8). Thrift: Scalable cross-language services implementation. Facebook White Paper, 2007.
TEAM, D. (2018, June). HBase Architecture – Regions, Hmaster, Zookeeper. Retrieved from ata-
flair.training: https://data-flair.training/blogs/hbase-architecture/
TechTarget. (2017, march). NoSQL (Not Only SQL database). Retrieved from
searchdatamanagement.techtarget.com:
https://searchdatamanagement.techtarget.com/definition/NoSQL-Not-Only-SQL
Wikimedia Foundation, I. (2019, March). Apache Cassandra. Retrieved from en.wikipedia.org:
https://en.wikipedia.org/wiki/Apache_Cassandra
Wikipedia, t. f. (2019, Mar 19). Apache HBase. Retrieved from en.wikipedia.org:
https://en.wikipedia.org/wiki/Apache_HBase
Williamson, J. (2018). THE 4 V’S OF BIG DATA. Retrieved from Dummies :
https://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data/

Más contenido relacionado

La actualidad más candente

Processing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approachesProcessing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approachesLeMeniz Infotech
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopImpetus Technologies
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1RojaT4
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 

La actualidad más candente (20)

No sql database
No sql databaseNo sql database
No sql database
 
Processing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approachesProcessing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approaches
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Mongo db
Mongo dbMongo db
Mongo db
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 

Similar a Comparison of HBase and Cassandra for Data Storage

CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd Iaetsd
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraYashIyengar
 
2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptxRushikeshChikane2
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfajajkhan16
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentationSalma Gouia
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbsonalighai
 
Benchmarking Couchbase Server for Interactive Applications
Benchmarking Couchbase Server for Interactive ApplicationsBenchmarking Couchbase Server for Interactive Applications
Benchmarking Couchbase Server for Interactive ApplicationsAltoros
 
Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Sheena Crouch
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLbalwinders
 
Assignment_4
Assignment_4Assignment_4
Assignment_4Kirti J
 
HBase Mongo_DB Project
HBase Mongo_DB ProjectHBase Mongo_DB Project
HBase Mongo_DB ProjectSonali Gupta
 

Similar a Comparison of HBase and Cassandra for Data Storage (20)

CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and Cassandra
 
2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdf
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
Know what is NOSQL
Know what is NOSQL Know what is NOSQL
Know what is NOSQL
 
NoSQL Basics and MongDB
NoSQL Basics and  MongDBNoSQL Basics and  MongDB
NoSQL Basics and MongDB
 
Report 2.0.docx
Report 2.0.docxReport 2.0.docx
Report 2.0.docx
 
Cassandra tutorial
Cassandra tutorialCassandra tutorial
Cassandra tutorial
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentation
 
Report 1.0.docx
Report 1.0.docxReport 1.0.docx
Report 1.0.docx
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsb
 
Benchmarking Couchbase Server for Interactive Applications
Benchmarking Couchbase Server for Interactive ApplicationsBenchmarking Couchbase Server for Interactive Applications
Benchmarking Couchbase Server for Interactive Applications
 
Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Assignment_4
Assignment_4Assignment_4
Assignment_4
 
HBase Mongo_DB Project
HBase Mongo_DB ProjectHBase Mongo_DB Project
HBase Mongo_DB Project
 

Más de Shrikant Samarth

Thesis - Mechanizing optimization of warehouses by implementation of machine ...
Thesis - Mechanizing optimization of warehouses by implementation of machine ...Thesis - Mechanizing optimization of warehouses by implementation of machine ...
Thesis - Mechanizing optimization of warehouses by implementation of machine ...Shrikant Samarth
 
Infographic - Ireland: "A Beneficiary of Brexit"
Infographic - Ireland: "A Beneficiary of Brexit"Infographic - Ireland: "A Beneficiary of Brexit"
Infographic - Ireland: "A Beneficiary of Brexit"Shrikant Samarth
 
Data Visualization - A reality check Prisons in India
Data Visualization - A reality check Prisons in IndiaData Visualization - A reality check Prisons in India
Data Visualization - A reality check Prisons in IndiaShrikant Samarth
 
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales Shrikant Samarth
 
Advance Data Mining - Analysis and forecasting of power factor for optimum el...
Advance Data Mining - Analysis and forecasting of power factor for optimum el...Advance Data Mining - Analysis and forecasting of power factor for optimum el...
Advance Data Mining - Analysis and forecasting of power factor for optimum el...Shrikant Samarth
 
Statistics For Data Analytics - Multiple & logistic regression
Statistics For Data Analytics - Multiple & logistic regression Statistics For Data Analytics - Multiple & logistic regression
Statistics For Data Analytics - Multiple & logistic regression Shrikant Samarth
 
Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...Shrikant Samarth
 
DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland
DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of IrelandDWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland
DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of IrelandShrikant Samarth
 

Más de Shrikant Samarth (8)

Thesis - Mechanizing optimization of warehouses by implementation of machine ...
Thesis - Mechanizing optimization of warehouses by implementation of machine ...Thesis - Mechanizing optimization of warehouses by implementation of machine ...
Thesis - Mechanizing optimization of warehouses by implementation of machine ...
 
Infographic - Ireland: "A Beneficiary of Brexit"
Infographic - Ireland: "A Beneficiary of Brexit"Infographic - Ireland: "A Beneficiary of Brexit"
Infographic - Ireland: "A Beneficiary of Brexit"
 
Data Visualization - A reality check Prisons in India
Data Visualization - A reality check Prisons in IndiaData Visualization - A reality check Prisons in India
Data Visualization - A reality check Prisons in India
 
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
 
Advance Data Mining - Analysis and forecasting of power factor for optimum el...
Advance Data Mining - Analysis and forecasting of power factor for optimum el...Advance Data Mining - Analysis and forecasting of power factor for optimum el...
Advance Data Mining - Analysis and forecasting of power factor for optimum el...
 
Statistics For Data Analytics - Multiple & logistic regression
Statistics For Data Analytics - Multiple & logistic regression Statistics For Data Analytics - Multiple & logistic regression
Statistics For Data Analytics - Multiple & logistic regression
 
Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...
 
DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland
DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of IrelandDWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland
DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland
 

Último

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 

Último (20)

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 

Comparison of HBase and Cassandra for Data Storage

  • 1. Data Storage and Management Project on Comparison of HBase and Cassandra Shrikant Uday Samarth X18129137 MSc in Data Analytics {2019-2020} Submitted to: Muhammad Iqbal
  • 2. Contents Abstract................................................................................................................................................3 Introduction ............................................................................................................................................3 Key Characteristic: ..................................................................................................................................4 HBase: .................................................................................................................................................4 Cassandra:...........................................................................................................................................5 Architecture ........................................................................................................................................5 HBase Architecture: ............................................................................................................................5 Cassandra Architecture:.............................................................................................................7 Comparison - Hbase and Cassandra..........................................................................................8 Scalability:.......................................................................................................................................8 Availability:......................................................................................................................................8 Reliability:........................................................................................................................................9 Transaction Management:..........................................................................................................9 Learning from Literature Survey................................................................................................10 Performance Test Plan: .................................................................................................................12 Evaluation and Results:.........................................................................................................................13 Workload Result:...............................................................................................................................14 Conclusion and Discussion.........................................................................................................23 References ............................................................................................................................................24
  • 3. Comparison of HBase and Cassandra Shrikant Uday Samarth X18129137 Abstract In the current era with the heavy usage of internet, huge amount of data is being generated from various resources such as banking sectors, government organizations, IOT devices, web applications etc. The generated data is going above petabytes. Thus the term ‘Big Data’ came into picture. The term ‘Big Data' depicts inventive strategies and advancements to catch, store, appropriate, oversee and break down petabyte-or bigger measured datasets with high-speed and diverse structures. With the complexity of the Big Data new techniques, algorithms and architecture is required. The traditional MySQL is incapable of processing and managing and analysing of the enormous amount of data. So, companies are moving towards databases like Spark, Hadoop etc. which have many advantages like quick execution, real time analytics, parallel and distributed computing and many more in a cost effective manner. The popular NoSQL databases are HBase and Cassandra which works on Hadoop. In this undertaking, we have studied the performance of both the HBase and Cassandra databases using various operations which are performed on Ubuntu. Introduction In recent years, huge amount of data has been generated with the increase in the usage of internet. The data is going above petabyte. Huge datasets or combination of dataset collections are referred to as Big data collections whose size(volume), unpredictability (fluctuation), and rate of development (speed) makes them hard to be captured, overseen, handled or broke down by traditional or conventional technologies. Couple of years back, associations had the capacity to deal with the information with the assistance of relational database management system (RDBMS), but, as the social media and internet have emerged, it got beside inconceivable for these conventional databases to handle these humongous data; resulting into poor data transmitting rate, slow processing time and low scalability. Due to these disadvantages the data cost started to rise and it became difficulty for associations to survive. So when, the NoSQL started getting popular as they are more scalable, provide quick execution, real time analytics, parallel and distributed computing and
  • 4. cost effective. NoSQL can handle huge amount of structured, semi-structured and unstructured datasets. Moreover, NoSQL does not require to follow the established relational schema, which helps organizations to narrow down the operational goals (TechTarget, 2017). Types of NoSQL databases are as follows: • Key-Value Store - It has a Big Hash Table of keys & values {Example- Riak} • Document-based Store- Stores documents made up of tagged elements. • Column-based Store - Every storage block contains information from just a single column • Graph-based - A system database that utilizes edges and hubs to speak to and store information (KUMAR, 2018) Despite the fact that there was a tremendous goad in the prevalence of NoSQL information store, there has always been a doubt regarding the performance of NoSQL databases and which database in reasonable better to which database. In this project we are going to compare two of such popular databases i.e HBase and Cassandra with various parameters. Key Characteristic: HBase: HBase is developed by Apache Software Foundation’s Apache Hadoop project is an open-source database which is written in java (Wikipedia, 2019). Key Characteristic of Hbase are as follows: • HBase is based on Google’s BigTable. • It is a completely NoSQL data store which is built over the HDFS file system. • It can also use the MapReduce computational framework to retrieve and store data. • Support constant updates as crisp data streams. • It is completely column family oriented. • HBase has a master slave concept and it reduces the single point failure. • The zookeeper service in the HBase service discovery pattern keeps the HMaster and the region servers together. • It is used to process the huge amount of data. • Different tasks can be performed and the information can be adjusted inside the HBase database.
  • 5. Cassandra: Cassandra is also an efficient type of NoSQL database which is also developed by Facebook Foundation. It was then released as free open-source project on Google code. Later it was supported by Apache Foundation. (Wikimedia Foundation, 2019)Key Characteristic of Cassandra are as follows: • Cassandra is based on Java system which can be managed and monitored by JMX. • Cassandra is a free open source NoSQL database but it stores the values in the form of key value pairs. • Cassandra works on the principle of clustering (i.e master-less replication), so whenever query is given to Cassandra cluster, data can be taken from one of the nodes and has ability to handle huge amount of data which makes it as a highly robust database. Cassandra fault tolerant. • Two types of consistency: i.e Eventual and Strong consistency. The information will be predictable however with some delay in eventual consistency. Whatever conflicts emerges they are settled as write operation is important. • Cassandra runs on number of nodes and provides high tolerance with which we can sync the data. It pursues distributed design where all nodes speak with one another. • A bloom filter is an incredibly quick approach to test the presence of an information structure in a set. A bloom filter can tell if a thing may exist in a set or unquestionably does not exist in the set. • SSTable requested permanent key esteem map. It is fundamentally a proficient method for putting away huge arranged information sections in a document. • The data distribution in Cassandra has configurable nodes which utilizes replication and replication procedures to decide how information is repeated crosswise over DC's, racks and hubs (Mehra, 2015). Architecture HBase Architecture: Hbase architecture is a column oriented NoSQL data store which is built over the HDFS file system. The advantage over HDFS is that HBase uses master slave concept and it reduces the single point failure. One of the interesting abilities with regards to HBase is Auto-Sharding, which just implies that tables are powerfully conveyed by the framework when they become too large. In HDFS there is possibly one master, the data cannot be recovered from the slave if master fails, so, in HBase that condition is wiped out as HBase have HMaster and it can have various HMasters but only one can be active at a time (similar to Datanodes in HDFS). (Sinha, 2019).
  • 6. Zookeeper, HMaster and Regionserver along with HDFS (as the underlying storage for HBase) are the three fundamental components of HBase as shown in the below diagram: Region Servers: Region servers serve information for read and write purposes. That implies customers can legitimately speak with HBase Region Servers while getting to information. Further, the HBase Master process handles the region task just as DDL (create, delete tables) activities. A region comprises of the considerable number of rows between the start and the end key which are appointed to that Region. Those Regions which we assignees to the nodes in the HBase Cluster, is the thing that we call "Region Servers". (TEAM, 2018). It manages all the regions of a table in Hbase. Every region inside a HBase table may have various columnfamilies. Each columnfamily of the region is in the Store. Memstore are the memory modification to the store. Storefile (HFile) are the files where the actual information is put away as Key-Value sets of segments and its qualities. (Pethuru Raj, 2014) HMaster: HBase master is in charge of region task just as DDL (create, erase tables) activities. HMaster Coordinates with the region servers and keeps a track of admin functions. Moreover, Master monitors all instances of region server in the HBase cluster. Essentially, a master assigns Regions on the start-up. Likewise, with the end goal of recovery or load adjusting. It also acts as an interface for making, erasing and refreshing tables in HBase. Zookeeper: It is a centralized service administration that deals with the configuration of the HBase, organize the procedures between the HBase customers and the
  • 7. HMaster, and in charge of dealing with the conveyed synchronization when there are various Hbase customers associated with HBase and getting to the mutual resources. Cassandra Architecture: Cassandra is intended to deal with enormous information. Cassandra is designed on distributed architecture. Its primary element is to store information on various nodes with no single purpose of failure. Cannandra works on the concept that hardware failure, any node can be down at any time. To avoid such issues, stored data is used from another node. Cassandra stores information on various nodes with peer to peer distributed style architecture. Gossip protocol is used for interaction between nodes. (Guru99, 2019) The architecture for Cassandra is shown below: Node: This is the basic component of Cassandra where data is stored in the cluster. In case of node failure, other nodes take its place. Datacenter: It is the unified spot to house PC and systems administration frameworks to help meet an association's data innovation needs. Cluster: It is the collection of all data center. All nodes taking an interest in a cluster have a similar name. Seed nodes are utilized amid start up to help find every single participating node. Seeds hubs have no exceptional reason other than helping bootstrap the group utilizing the tattle convention. At the point when a node begins up it looks to its seed rundown to get data about different nodes in the group. Commit-Log: Each write activity is written in the commit log. Commit log is utilized for accident recuperation. Mem-table- At the point when the substance which is safeguarded in mem table reaches a threshold value, it is flushed to plate document which is called SSTables. A memTable is a compose back reserve living in memory which has not been flushed to disk yet.
  • 8. Data Replication- This procedure is the place the replication of nodes is done as such that there is no loss failure. Partitioning of information on a mutual nothing framework results in a single point of failure for example if one of the nodes goes down, the information would be inaccessible. This confinement overcomes duplicating the information which is known as replicas. Replication of information guarantees adaptation to non-critical failure and dependability. Comparison - Hbase and Cassandra All the NoSQL databases generally pursue CAP hypothesis of execution. Top represents Consistency, Availability and Partition resilience. The essential viewpoint to post here is that the databases can accomplish any two out of three at a time. Scalability: The scalability for any database, is the capacity to add computational assets to a database so as to acquire throughput. Two types of scalability measures are available i.e. Horizontal and Vertical. Vertical scalability is the movement from one machine to another which has more RAM or CPU. This scalable approach for vertical database is expensive process. While managing a lot of information, moving to a type of storage infrastructure is necessary. In terms of asset commitment, if there are any necessities for looking after uptime, critical operational arranging and exertion are normally required to relocate to the new framework. If the volume of information is huge, at that point the physical exchange from the old framework to the new can take alot of time depending upon the load. In Horizontal scalable, hardware can be added incrementally. a framework is level versatile if equipment can be included steadily. If greater limits are needed, extra hardware can be included. Ideally for a hardware, a linear increment should be provided in limit accessible without reconfiguration or personal time expected of existing nodes. Apache Cassandra meets the prerequisites of an ideal horizontally scalable framework by considering consistent expansion of nodes. As you need more capacity, you add nodes to the cluster and the group will use the new assets automatically. The HBase is exceedingly scalable as the information when develops in the database it is appropriated horizontally along the tables. It very well may be advocated as the HBase depends on Googles Big Table. Level versatility in Hbase can be observed over the Region Servers which goes about as the slaves in the cluster. HBase likewise offers strong row-level consistency, and "coprocessors" that give the counterparts of triggers and stored methods. (PFEIL, 2010) Availability: This condition expresses that each request gets a reaction on progress/failure. Accomplishing accessibility in an appropriated framework necessitates that the framework stays operational 100% of the time. Each customer gets a reaction, whatever the condition of any individual nodes in the framework. This measurement is inconsequential to measure: it is possible that you can submit read or write commands, or you can't. Consequently, the databases are time free as the nodes should be accessible online consistently.
  • 9. Since Hbase uses master slave relationship. One of the interesting abilities with regards to HBase is Auto-Sharding, which just implies that tables are powerfully conveyed by the framework when they become too large. In HDFS there is possibly one master, the data cannot be recovered from the slave if master fails, so, in HBase that condition is wiped out as HBase have HMaster and it can have various HMasters but only one can be active at a time (similar to Datanodes in HDFS) (Sinha, 2019) In Hbase whatever data comes it will keep a copy into the secondary node. So, in case of Master node failure secondary node is available which act as a master. All the data nodes are just for processing the data. And failure of these node is not a big concern, as because of the presence of synch, scheduling and communication between the other node makes the data retrieval easily possible. Cassandra has many nodes and if anyone is working with one node, by default it will take three replicas. One replica is in same cluster and second copy is in another cluster; so in case of node or cluster failure, the data could be available from the other rack or other node. Since the copies is in another node of the same rack this helps to decrease the communication problem hence it decreases the cost. Hence, Cassandra framework stays operational all the time. Reliability: Reliability of a database can be set apart by its execution of the expectations which demonstrates consistency and according to its characterized specification. At the point when the framework condition changes or any fault happens in the framework and still the database demonstrates the equivalent or much-improved execution then we can say that the database is profoundly solid(Reliable). In HBase reliability is measured by zookeeper which is a centralized service administration that deals with the configuration of the HBase, organize the procedures between the HBase customers and the HMaster, and in charge of dealing with the conveyed synchronization when there are various Hbase customers associated with HBase and getting to the mutual resources. Hence, Hbase is up to the mark on all parameters of consistency. Cassandra is designed on distributed architecture. Its primary element is to store information on various nodes with no single purpose of failure. Cassandra works on the concept that hardware failure, any node can be down at any time. To avoid such issues, stored data is used from another node. Cassandra stores information on various nodes with peer to peer distributed style architecture. Gossip protocol is used for interaction between nodes. Hence, Cassandra is highly consistent. Transaction Management: HBase gives atomicity of transformations (puts/composes) on a for each row basis, regardless of whether the put task ranges over different column families. Nonetheless, no value-based assurance is given to changes crosswise over rows.
  • 10. In HBase, despite the fact that the information is put away on storage like HDFS, the write dependably experience a lot of servers called region servers. Each table is isolated into key space allotments called regions and a region server is in charge of serving traffic for a subset of regions. At the point when a write occurs for a specific line in a region served by a region server, that region server takes a line lock over all column families for that line and prevents some other synchronous writes to that line. At that point it continues the write task to a WAL on HDFS. Simply from that point forward, the put task is connected to every section family associated with the Put. This is the way HBase ensures push level atomicity of write activities. In Cassandra, the write and update activity of row which could be in excess of two will be treated as one single write task. It implies that write task is nuclear at partition level. The atomicity is upheld at the row level. Cassandra writes and makes the replica of information in every one of the nodes and wait for affirmation from the nodes. At whatever point any progressions are made in the column Cassandra utilizes timestamp to think about the change. At isolation level, Cassandra performs activity at full row level. At the point when write is performed at the node, the entrance is given to client or customer. Durability implies when the write is finished it will endure regardless of whether the server crashes down. In the event that server crashes before mem-tables updation flush the disk, the commit log is recovered on the reboot of the node to recuperate lost transaction. Learning from Literature Survey Big Data is known by 3Vs namely Velocity, Variety and Volume. Fifteen years back, the size of data was manageable. There were data centres that could store data in structured format and the data was not so scalable. After the introduction of mobiles and technological advancements in internet, there has been a growth in amount of data in last 5 years. The data generated in last 5 years was equivalent to data generated 10 years before the last 5 years. No sooner did the quantity of data increased than the realization has struck humanity that data is not only in structured format but also unstructured and semi-structured. Volume of the data is increasing day by day where Thomas Reuter stated in the annual report that in 2010 the world data accounted to be more than 800 Exabyte’s and more counting. In last decade, we can have considered the volume increase in the data after the rigorous use of social media platforms on daily basis. Not only the volume, the data has been in different formats as well. The traditional data types were structured which accounted of info like time, date, amount if we are to consider an example of banking scenario. But today, we have data in all formats namely image, reader, writer, movies and so on. The accounts to be unstructured data which brings about variety in Big Data. Velocity is defined as the speed with which the data is recorded in the databases. The frequency with which the data is registered is considered into the Velocity of
  • 11. data. The best example is the live streaming data on Amazon prime or Netflix which telecasts the shows and also records the live feeds of the comments as a feedback to the shows which is used for sentimental analysis and decides the rating of the show. One more V that plays its part here is the term called Veracity which literally means the authenticity of the data. We have always seen that the data coming from an unknown source might cause issues to the user resulting into falsehoodness. This will in turn cause unreliable patterns which will give incorrect results to the data analyst. So even if this doesn't stand into the core Vs of Big Data, veracity also plays an equal and important role in Big Data Analysis. (Williamson, 2018) As we go through some researches online, a technical report submitted by Hiren Patel by the name "HBase: A NoSQL Database", he mentioned his views about HBase as it is the strong NoSQL storage system which is inspired by Google’s BigTable. Each row in HBase table is associated with the key and this key is used for access mechanism in a row. Its architecture is very strong and capable of storing sparse data in real time. The three main components the master server, region server and zookeeper are an essential part of its architecture which enables fast operations on data. In the views of Hiren, HBase has lost the grip over the ever increasing data and it is not recommended to use it for present Big Data operations. For some feature, it has upper hand over the MongoDB and Cassandra but overall both are performing well than the HBase (Patel, 2017). Cassandra is a distributed database which is highly scalable when it comes to storage and throughput. Cassandra’s design brings together the data model described in Google’s Bigtable paper and the behaviour of Amazon Dynamo (Ref: Dynamo: Amazons Highly Available Key-value Store In Proceedings of twenty-first ACMSIGOPS symposium on Operating systems prin-ciples (2008) (Chang, 2008). Cassandra, along with its remote ThriftAPI (Ref: Thrift: Scalable Cross-Language Services Implementa-tionFacebook) (Slee, 8), were initially developed by Facebook as a data platform to build services such as Inbox Search that scale to serve hundreds of millions of users (Lakshman, 35-40). After it was released to the Apache Software Foundation Incubator in 2009, Cassandra was then accepted as a top-level Apache project in March of 2010 (Featherston, 2010). Whenever HBase and Cassandra were compared on same grounds, many data analysts notably Rajith Kumar and Roseline Mary, in their research of comparison of different NoSQL databases like Cassandra, HBase and MongoDB, they evaluated the databases on Yahoo Cloud Serving Benchmark with different variations in read and update workloads and found out that Cassandra works better and faster than HBase (Mary, 2017). This motivates my research to perform an experiment with different amounts of workload on HBase and Cassandra and validate the performance.
  • 12. Performance Test Plan: The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source determination and program suite for assessing recovery and support abilities of PC programs. Usually used to look at relative execution of NoSQL database the board frameworks. The first benchmark was created by workers in the examination division of Yahoo! who released it in 2010. In this project we have used this YCSB benchmarking tool for evaluation of Hbase and Cassandra performance. The figure below shoes the architecture of YCSB benchmarking structure. The above figure shows the architecture of the YCSB benchmarking tool. The YCSB client is producing the information to be stacked to the database, and creating the tasks which make up the workload. The essential task is that the workload executor drives different threads. Each string executes a successive arrangement of activities by making calls to the database interface layer, both to stack the database and to execute the outstanding workload. The thread additionally measures the inactivity and accomplished throughput of their tasks, and report these estimations to the statistics module. At last the statistics module totals the estimations and reports normal, 95th and 99th percentile latencies, and either a histogram or time arrangement of the latencies. Workload executor: Workload executor contains the unique outstanding tasks at hand situations which are collection of write, Read, and update activities. (Brian F. Cooper) For the further process, the system configurations on which the test are performed are as follows: Processor 1.7GHz dual-core Intel Core i5-4210U CPU, 2 Cores. 12 GB Ram memory, Windows 8 Operating System, System type: 64-bit Operating System, x64-based processor.
  • 13. • Visualation Tool: Open Stack Cloud Server. • HBase Virtual Machine: Ubuntu (64-bit) Operating System, 4GB Ram Memory, 2 Core virtual processor. • Cassandra Virtual Machine: Ubuntu (64-bit) Operating System, 4GB Ram Memory, 2 Core virtual processor. • Workload Parameters: Workload A & Workload B The Database selected for this projects are HBase and Cassandra. After successful installation of HBase and Cassandra, YSCB benchmarking tool is installed to perform the performance evaluation for the selected databases. Testharness tool was used to evaluate the various operations to perform the performance test. Testharness is an automation tool to perform tests. It refers to the framework test drivers and other supporting devices that requires to execute tests. It gives stubs and drivers which are little projects that connect with the product under test. Workload A: Update heavy workload, this workload gives mix 50/50 read and write operations. Workload B: Read mostly workload, this workload gives 95/5 read and write mix. The operation counts taken for running the performance evaluations are 100000, 250000, 350000, 500000. Tests are performed for both Workload A and Workload B for Hbase and Cassandra database. Evaluation and Results: For the Hbase and Cassandra performance evaluation we have been performed test for the Workload A and Workload B using the benchmarking tool (YCSB). Specifications are as follows: Workload A: Update heavy workload, the workload gives 50% Read and 50% Update Workload B: Read mostly workload, this workload gives 95% Read and 5% update mix.
  • 14. Workload Result: Below are the findings for Workload A for both HBase and Cassandra: Below are the findings for Workload B for both HBase and Cassandra: Database Workload B Operation Average latency (Read) Record Read Average latency (Update) Update Operations Overall Throughput Load Average latency (Insert) Load Overall Throughput Hbase Workload 1 100000 225.89 95005 522.51 4995 3920.95 373.27 2330.19 Hbase Workload 2 250000 328.28 237454 489.27 12546 2793.89 341.53 2858.48 Hbase Workload 3 350000 464.38 332546 492.77 17454 2061.19 359.49 2727.11 Hbase Workload 4 500000 2321.91 475098 769.06 24902 442.97 382.34 2535.05 Cassandra Workload 1 100000 428.17 60518 435.19 4983 2075.08 482.66 1547.50 Cassandra Workload 2 250000 408.05 237434 373.82 12566 2298.66 459.94 2044.25 Cassandra Workload 3 350000 401.77 214270 375.02 17491 2463.87 458.73 1929.41 Cassandra Workload 4 500000 401.39 328559 364.79 25237 2456.92 455.86 2038.63 Database Workload A Operation Average latency (Read) Read Operations Average latency (Update) Update Operations Overall Throughput Load Average latency (Insert) Load Overall Throughput Hbase Workload 1 100000 248.88 50237 304.18 49763 3000.57 355.02 2449.06 Hbase Workload 2 250000 254.00 125120 294.21 124880 3527.78 359.82 2600.59 Hbase Workload 3 350000 1163.17 174384 351.78 175616 1261.51 347.59 2817.90 Hbase Workload 4 500000 2367.44 249951 550.20 250049 674.19 353.89 2687.72 Cassandra Workload 1 100000 445.95 37399 392.75 50090 2201.19 534.64 1282.09 Cassandra Workload 2 250000 436.84 125285 369.35 124715 2310.28 476.96 1969.90 Cassandra Workload 3 350000 444.61 127737 371.16 175480 2345.36 474.47 1881.75 Cassandra Workload 4 500000 509.43 249917 368.29 250083 2193.01 454.61 2122.84
  • 15. For Workload A Comparison of Hbase and Cassandra workload on the basis of Average Latency, Read Operation, Throughput, number of operations from Run files and Overall Load throughput, Average load latency and load operations from Load files are shown below. The workload 1, workload 2, workload 3 and 4 are for 10000, 250000, 350000 and 500000 respectively. 1. Average Latency Vs Record Read Operations for Workload A: From the above figure shows the Hbase and Cassandra comparison with respect to record read and average read latency. The average latency of Cassandra is consistent with respect to read record counts. Whereas the Average latency for HBase a spike can be seen for the fourth read record. So, we can say that the average latency for HBase is consistent for low read operations than the high read operations. In contrast, there is not major difference in the average latency for Cassandra. 2. Overall Throughput Vs Operations for Workload A: Throughput is only the recorded transmission of information starting with one node then onto the next in the given timeframe. From the below graph of Throughput Vs Operations it can be seen that, Hbase performs differently than Cassandra database. For low workloads throughput for HBase is high and it continues to increase until 250000, but for high workloads the graph
  • 16. gets plummeted. On the other hand, for Cassandra the throughput is bit up but lower than the Hbase; surprisingly, Cassandra is consistent throughout the workloads. 3. Average latency Vs Update Operations:
  • 17. The above graph is for the average latency and update operations for HBase and Cassandra for Workload A operation. It can be seen that for update operations, average latency is for Cassndra is nearly higher than the HBase. But with the increase in the workload Cassandra’s average latency goes on decreasing whereas, for HBase from workload 2 (i.e. 250000 workload) the average latency takes a leap and it goes above the Cassandra’s average latency. Thus, we can say that with the high workloads the average latency for Hbase is also high than Cassandra. 4. Load Latency Read Vs Load Operations: The above graph shows the comparison of HBase and Cassandra on the basis of Average load read and load operations for Workload A. In terms of average latency, it can be clearly seen that Hbase is consistent than the Cassandra; as with the increase in workload the average load latency for Hbase decreases. On the other hand, the Hbase shows consistency throughout the workloads. 5. Load Overall Throughput Vs Load Operations-Run Throughput means the measure of material or things going through a framework or procedure. From the below comparison graph of Hbase and Cassandra that, number of operations performed in Hbase is comparatively higher than the Cassandra. But Cassandra shows improvement in performing tasks with the increase in the workload but it is not able to cross over the
  • 18. Hbase throughput operations. So, HBase is efficient in terms of the operations performing per second than Cassandra.
  • 19. For Workload B, Comparison of Hbase and Cassandra workload on the basis of Average Latency, Read Operation, Throughput, number of operations from Run files and Overall Load throughput, Average load latency and load operations from Load files are shown below. The workload 1, workload 2, workload 3 and 4 are for 10000, 250000, 350000 and 500000 respectively. 6. Average Operations Vs Record Read Operations for Workload B The above shown graph is for the average read latency and read operations where we can clearly see that, average is consistent for Cassandra similar to as Workload A, whereas Hbase shows a steep increase after workload 2. Thus we can say that with the increase in the read operations Average latency increases for HBase whereas Cassandra remains constant for Workload B. 7. Overall Throughput Vs Operations for Workload B The graph shown below shows the performance regarding overall throughput and operations performed for workload B. It can be seen that, for HBase with the increase in the number of operation the throughput (i.e. operations performed per second) decreases in Workload B whereas there is not much difference in the performance of Cassandra as with the increase in the
  • 20. workload operations performed per second increases a bit until workload 3 but there is a slight decrease after that. 8. Avg. Latency Vs Update Operations for Workload B The above graph is for the performance evaluation of Hbase and Cassandra on the basis of average latency and number of update operations for workload
  • 21. B. As we have seen earlier the average latency of Casandra is consistent throughout the operations. On the other hand, Average latency is little higher than the Cassandra for the less operations but as the load increases the average latency increases and from workload 3 it goes on increasing. Thus we can say that, Casandra is consistent throughout the workloads and for HBase as the workload increases the Average latency also increases. Hence, Hbase gives good performance for high workloads. 9. Average Latency Vs Number of Operations The above graph shows the comparison of HBase and Cassandra on the basis of Average load read and load operations for Workload A. In terms of average latency, it can be clearly seen that Hbase is consistent than the Cassandra; as with the increase in workload the average load latency for Hbase decreases. On the other hand, the Hbase shows consistency throughout the workloads. Thus we can say that Cassandra is more stable in handling operations ranging from small to large workloads. On the other hand Hbase is good with the large workloads.
  • 22. 10. Overall Load Throughput Vs Load Operations for Workload B The above graph shows the performance of overall throughput and number of operations for Hbase and Cassandra for workload B. It can be seen that comparatively Hbase throughput is more than the Cassandra; both follows the same consistency as the load increases. Moreover the output of workload B is follows the same that of workload A. We can clearly see that as the operation increases the throughput of Hbase compared to that of Cassandra is higher but after the workload 2 it goes on decreasing, on the contrary, Cassandra after workload 3 throughput increases slightly. Hence, based on the facts presented in the graph, we can come to a conclusion that Hbase performance is higher for less workload but it slightly drops for higher workloads.
  • 23. Conclusion and Discussion In conclusion, the experiment carried out successfully for the performance of Hbase and Cassandra on the basis of Average Latency, Read Operation, Throughput, number of operations from Run files and Overall Load throughput, Average load latency and load operations from Load. The architecture of both the databases alongside their key qualities has been investigated in the above examination for Workload A and Workload B. The output has been displayed in the paper. From the above examination, it can be seen that Cassandra is consistent throughout the tests and the for load operations, Cassandra works well for throughput load operation that means Cassandra has a good capability of loading the data when the workload increases; Also, its performance has been consistent for availability and the average latency throughout the examination. Which correctly justifies the Cassandra's model for exchanging consistency for availability and latency has been established by referencing the CAP Theorem. (Featherston, 2010). On the other hand, it can be observed from the experiment that, in some parameters Hbase is efficient in handling the higher workloads; Throughput (operation per second) is good when it comes to perform for small operation but the throughput performance decreases with the higher workloads. Moreover, it is not proper to say that Hbase is efficient for only high performance, but yes from the above findings there are some parameters i.e. (update record and the average latency) where HBase works well on the high workloads from the findings from the both workload A. In contrary, for Average latency of workload B Hbase performance is lower than the Cassandra. Thus, according to the past investigations, the execution of HBase will in general show signs of improvement however from our examination for Workload A and Workload B we can say that the HBase isn't executing according to the desires as the outstanding burdens are expanding. From the overall evaluation we can say that Cassandra is predictable and performed much superior to the Hbase which is one of the requirements of any organisations. And Hbase has demonstrated to high inconsistency all through the analysis. This study can be further extended for the higher workloads to check how these two NoSQL databases perform, and which one is more suitable for the big organisations.
  • 24. References Brian F. Cooper, A. S. (n.d.). Benchmarking Cloud Serving Systems with YCSB. ACM (https://www2.cs.duke.edu/courses/fall13/cps296.4/838-CloudPapers/ycsb.pdf), 143--154. Chang, F. a. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 4. Featherston, D. (2010). Cassandra: Principles and Application. unitn, 17. Retrieved from http://disi.unitn.it/~montreso/ds/papers/Cassandra.pdf Guru99. (2019). Cassandra Architecture & Replication Factor Strategy. Retrieved from www.guru99.com: https://www.guru99.com/cassandra-architecture.html KUMAR, G. (2018). EXPLORING THE DIFFERENT TYPES OF NOSQL DATABASES PART II. Retrieved from 3pillarglobal: https://www.3pillarglobal.com/insights/exploring-the-different-types-of-nosql- databases Lakshman, A. a. (35-40). Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010. Mary, R. K. (2017). Comparative Performance Analysis of various NoSQL Databases: MongoDB, Cassandra and HBase on Yahoo Cloud Server. Imperial Journal of Interdisciplinary Research (IJIR), 5. Mehra, A. (2015, june 06). Introduction to Apache Cassandra's Architecture. Retrieved from dzone.com: https://dzone.com/articles/introduction-apache-cassandras Patel, H. (2017). HBase: A NoSQL Database. 10.13140/RG.2.2.22974.28480. researchgate, 15. Pethuru Raj, G. C. (2014). Handbook of research on cloud infrastructures for big data analytics. IGI Global. PFEIL, M. (2010, Oct 29). Why does Scalability matter, and how does Cassandra scale? Retrieved from Datastax: https://www.datastax.com/dev/blog/why-does-scalability-matter-and-how- does-cassandra-scale Sinha, S. (2019, Feb). HBase Architecture: HBase Data Model & HBase Read/Write Mechanism. Retrieved from www.edureka.com: https://www.edureka.co/blog/hbase-architecture/ Slee, M. a. (8). Thrift: Scalable cross-language services implementation. Facebook White Paper, 2007. TEAM, D. (2018, June). HBase Architecture – Regions, Hmaster, Zookeeper. Retrieved from ata- flair.training: https://data-flair.training/blogs/hbase-architecture/ TechTarget. (2017, march). NoSQL (Not Only SQL database). Retrieved from searchdatamanagement.techtarget.com: https://searchdatamanagement.techtarget.com/definition/NoSQL-Not-Only-SQL Wikimedia Foundation, I. (2019, March). Apache Cassandra. Retrieved from en.wikipedia.org: https://en.wikipedia.org/wiki/Apache_Cassandra Wikipedia, t. f. (2019, Mar 19). Apache HBase. Retrieved from en.wikipedia.org: https://en.wikipedia.org/wiki/Apache_HBase
  • 25. Williamson, J. (2018). THE 4 V’S OF BIG DATA. Retrieved from Dummies : https://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data/