Devise and implement a test strategy in order to perform a comparative analysis of the capabilities of two database management systems (Cassandra and HBase) in terms of performance.
Approach: Installation and implementation of instances of the two data storage and management systems. The Yahoo Cloud Serving Benchmark is used to compare the performances of HBase and Cassandra. Average latency and throughput were considered for analyzing the comparison of the two databases. The results obtained from YCSB are then analyzed and visualized with the help of Tableau.
Findings: HBase performs insertion, reading, and updating of records faster than Cassandra but only when the operations count is less. At heavier loads, Cassandra performs better than Hbase.
Tools: Hbase, Cassandra, Hadoop, Tableau, YCSB
Web & Social Media Analytics Previous Year Question Paper.pdf
Comparison of HBase and Cassandra for Data Storage
1. Data Storage and Management
Project
on
Comparison of HBase and Cassandra
Shrikant Uday Samarth
X18129137
MSc in Data Analytics {2019-2020}
Submitted to: Muhammad Iqbal
2. Contents
Abstract................................................................................................................................................3
Introduction ............................................................................................................................................3
Key Characteristic: ..................................................................................................................................4
HBase: .................................................................................................................................................4
Cassandra:...........................................................................................................................................5
Architecture ........................................................................................................................................5
HBase Architecture: ............................................................................................................................5
Cassandra Architecture:.............................................................................................................7
Comparison - Hbase and Cassandra..........................................................................................8
Scalability:.......................................................................................................................................8
Availability:......................................................................................................................................8
Reliability:........................................................................................................................................9
Transaction Management:..........................................................................................................9
Learning from Literature Survey................................................................................................10
Performance Test Plan: .................................................................................................................12
Evaluation and Results:.........................................................................................................................13
Workload Result:...............................................................................................................................14
Conclusion and Discussion.........................................................................................................23
References ............................................................................................................................................24
3. Comparison of HBase and Cassandra
Shrikant Uday Samarth
X18129137
Abstract
In the current era with the heavy usage of internet, huge amount of
data is being generated from various resources such as banking
sectors, government organizations, IOT devices, web applications etc.
The generated data is going above petabytes. Thus the term ‘Big Data’
came into picture. The term ‘Big Data' depicts inventive strategies and
advancements to catch, store, appropriate, oversee and break down
petabyte-or bigger measured datasets with high-speed and diverse
structures. With the complexity of the Big Data new techniques,
algorithms and architecture is required. The traditional MySQL is
incapable of processing and managing and analysing of the enormous
amount of data. So, companies are moving towards databases like
Spark, Hadoop etc. which have many advantages like quick execution,
real time analytics, parallel and distributed computing and many more
in a cost effective manner. The popular NoSQL databases are HBase
and Cassandra which works on Hadoop. In this undertaking, we have
studied the performance of both the HBase and Cassandra databases
using various operations which are performed on Ubuntu.
Introduction
In recent years, huge amount of data has been generated with the increase in
the usage of internet. The data is going above petabyte. Huge datasets or
combination of dataset collections are referred to as Big data collections
whose size(volume), unpredictability (fluctuation), and rate of development
(speed) makes them hard to be captured, overseen, handled or broke down by
traditional or conventional technologies. Couple of years back, associations
had the capacity to deal with the information with the assistance of relational
database management system (RDBMS), but, as the social media and internet
have emerged, it got beside inconceivable for these conventional databases to
handle these humongous data; resulting into poor data transmitting rate,
slow processing time and low scalability. Due to these disadvantages the data
cost started to rise and it became difficulty for associations to survive. So
when, the NoSQL started getting popular as they are more scalable, provide
quick execution, real time analytics, parallel and distributed computing and
4. cost effective. NoSQL can handle huge amount of structured, semi-structured
and unstructured datasets. Moreover, NoSQL does not require to follow the
established relational schema, which helps organizations to narrow down the
operational goals (TechTarget, 2017).
Types of NoSQL databases are as follows:
• Key-Value Store - It has a Big Hash Table of keys & values {Example-
Riak}
• Document-based Store- Stores documents made up of tagged
elements.
• Column-based Store - Every storage block contains information from
just a single column
• Graph-based - A system database that utilizes edges and hubs to
speak to and store information (KUMAR, 2018)
Despite the fact that there was a tremendous goad in the prevalence of
NoSQL information store, there has always been a doubt regarding the
performance of NoSQL databases and which database in reasonable
better to which database. In this project we are going to compare two of
such popular databases i.e HBase and Cassandra with various
parameters.
Key Characteristic:
HBase:
HBase is developed by Apache Software Foundation’s Apache Hadoop
project is an open-source database which is written in java (Wikipedia,
2019). Key Characteristic of Hbase are as follows:
• HBase is based on Google’s BigTable.
• It is a completely NoSQL data store which is built over the HDFS file
system.
• It can also use the MapReduce computational framework to retrieve
and store data.
• Support constant updates as crisp data streams.
• It is completely column family oriented.
• HBase has a master slave concept and it reduces the single point
failure.
• The zookeeper service in the HBase service discovery pattern keeps
the HMaster and the region servers together.
• It is used to process the huge amount of data.
• Different tasks can be performed and the information can be adjusted
inside the HBase database.
5. Cassandra:
Cassandra is also an efficient type of NoSQL database which is also
developed by Facebook Foundation. It was then released as free open-source
project on Google code. Later it was supported by Apache Foundation.
(Wikimedia Foundation, 2019)Key Characteristic of Cassandra are as
follows:
• Cassandra is based on Java system which can be managed and
monitored by JMX.
• Cassandra is a free open source NoSQL database but it stores the values
in the form of key value pairs.
• Cassandra works on the principle of clustering (i.e master-less
replication), so whenever query is given to Cassandra cluster, data can be
taken from one of the nodes and has ability to handle huge amount of
data which makes it as a highly robust database. Cassandra fault
tolerant.
• Two types of consistency: i.e Eventual and Strong consistency. The
information will be predictable however with some delay in eventual
consistency. Whatever conflicts emerges they are settled as write
operation is important.
• Cassandra runs on number of nodes and provides high tolerance with
which we can sync the data. It pursues distributed design where all
nodes speak with one another.
• A bloom filter is an incredibly quick approach to test the presence of an
information structure in a set. A bloom filter can tell if a thing may exist
in a set or unquestionably does not exist in the set.
• SSTable requested permanent key esteem map. It is fundamentally a
proficient method for putting away huge arranged information sections in
a document.
• The data distribution in Cassandra has configurable nodes which utilizes
replication and replication procedures to decide how information is
repeated crosswise over DC's, racks and hubs (Mehra, 2015).
Architecture
HBase Architecture:
Hbase architecture is a column oriented NoSQL data store which is built over
the HDFS file system. The advantage over HDFS is that HBase uses master
slave concept and it reduces the single point failure. One of the interesting
abilities with regards to HBase is Auto-Sharding, which just implies that
tables are powerfully conveyed by the framework when they become too large.
In HDFS there is possibly one master, the data cannot be recovered from the
slave if master fails, so, in HBase that condition is wiped out as HBase have
HMaster and it can have various HMasters but only one can be active at a
time (similar to Datanodes in HDFS). (Sinha, 2019).
6. Zookeeper, HMaster and Regionserver along with HDFS (as the underlying
storage for HBase) are the three fundamental components of HBase as shown
in the below diagram:
Region Servers:
Region servers serve information for read and write purposes. That implies
customers can legitimately speak with HBase Region Servers while getting to
information. Further, the HBase Master process handles the region task just
as DDL (create, delete tables) activities.
A region comprises of the considerable number of rows between the start and
the end key which are appointed to that Region. Those Regions which we
assignees to the nodes in the HBase Cluster, is the thing that we call "Region
Servers". (TEAM, 2018). It manages all the regions of a table in Hbase. Every
region inside a HBase table may have various columnfamilies. Each
columnfamily of the region is in the Store. Memstore are the memory
modification to the store. Storefile (HFile) are the files where the actual
information is put away as Key-Value sets of segments and its qualities.
(Pethuru Raj, 2014)
HMaster:
HBase master is in charge of region task just as DDL (create, erase tables)
activities. HMaster Coordinates with the region servers and keeps a track of
admin functions. Moreover, Master monitors all instances of region server in
the HBase cluster. Essentially, a master assigns Regions on the start-up.
Likewise, with the end goal of recovery or load adjusting. It also acts as an
interface for making, erasing and refreshing tables in HBase.
Zookeeper:
It is a centralized service administration that deals with the configuration of
the HBase, organize the procedures between the HBase customers and the
7. HMaster, and in charge of dealing with the conveyed synchronization when
there are various Hbase customers associated with HBase and getting to the
mutual resources.
Cassandra Architecture:
Cassandra is intended to deal with enormous information. Cassandra is
designed on distributed architecture. Its primary element is to store
information on various nodes with no single purpose of failure. Cannandra
works on the concept that hardware failure, any node can be down at any
time. To avoid such issues, stored data is used from another node. Cassandra
stores information on various nodes with peer to peer distributed style
architecture. Gossip protocol is used for interaction between nodes. (Guru99,
2019)
The architecture for Cassandra is shown below:
Node: This is the basic component of Cassandra where data is stored in the
cluster. In case of node failure, other nodes take its place.
Datacenter: It is the unified spot to house PC and systems administration
frameworks to help meet an association's data innovation needs.
Cluster: It is the collection of all data center. All nodes taking an interest in
a cluster have a similar name. Seed nodes are utilized amid start up to help
find every single participating node. Seeds hubs have no exceptional reason
other than helping bootstrap the group utilizing the tattle convention. At the
point when a node begins up it looks to its seed rundown to get data about
different nodes in the group.
Commit-Log: Each write activity is written in the commit log. Commit log is
utilized for accident recuperation.
Mem-table- At the point when the substance which is safeguarded in mem
table reaches a threshold value, it is flushed to plate document which is called
SSTables. A memTable is a compose back reserve living in memory which has
not been flushed to disk yet.
8. Data Replication- This procedure is the place the replication of nodes is done
as such that there is no loss failure. Partitioning of information on a mutual
nothing framework results in a single point of failure for example if one of the
nodes goes down, the information would be inaccessible. This confinement
overcomes duplicating the information which is known as replicas.
Replication of information guarantees adaptation to non-critical failure and
dependability.
Comparison - Hbase and Cassandra
All the NoSQL databases generally pursue CAP hypothesis of execution. Top
represents Consistency, Availability and Partition resilience. The essential viewpoint
to post here is that the databases can accomplish any two out of three at a time.
Scalability:
The scalability for any database, is the capacity to add computational assets to a
database so as to acquire throughput. Two types of scalability measures are available
i.e. Horizontal and Vertical. Vertical scalability is the movement from one machine
to another which has more RAM or CPU. This scalable approach for vertical database
is expensive process. While managing a lot of information, moving to a type of storage
infrastructure is necessary. In terms of asset commitment, if there are any
necessities for looking after uptime, critical operational arranging and exertion are
normally required to relocate to the new framework. If the volume of information is
huge, at that point the physical exchange from the old framework to the new can
take alot of time depending upon the load. In Horizontal scalable, hardware can be
added incrementally. a framework is level versatile if equipment can be included
steadily. If greater limits are needed, extra hardware can be included. Ideally for a
hardware, a linear increment should be provided in limit accessible without
reconfiguration or personal time expected of existing nodes.
Apache Cassandra meets the prerequisites of an ideal horizontally scalable
framework by considering consistent expansion of nodes. As you need more capacity,
you add nodes to the cluster and the group will use the new assets automatically.
The HBase is exceedingly scalable as the information when develops in the database
it is appropriated horizontally along the tables. It very well may be advocated as the
HBase depends on Googles Big Table. Level versatility in Hbase can be observed over
the Region Servers which goes about as the slaves in the cluster. HBase likewise
offers strong row-level consistency, and "coprocessors" that give the counterparts of
triggers and stored methods. (PFEIL, 2010)
Availability:
This condition expresses that each request gets a reaction on progress/failure.
Accomplishing accessibility in an appropriated framework necessitates that the
framework stays operational 100% of the time. Each customer gets a reaction,
whatever the condition of any individual nodes in the framework. This measurement
is inconsequential to measure: it is possible that you can submit read or write
commands, or you can't. Consequently, the databases are time free as the nodes
should be accessible online consistently.
9. Since Hbase uses master slave relationship. One of the interesting abilities with
regards to HBase is Auto-Sharding, which just implies that tables are powerfully
conveyed by the framework when they become too large. In HDFS there is possibly
one master, the data cannot be recovered from the slave if master fails, so, in HBase
that condition is wiped out as HBase have HMaster and it can have various HMasters
but only one can be active at a time (similar to Datanodes in HDFS) (Sinha, 2019)
In Hbase whatever data comes it will keep a copy into the secondary node. So, in
case of Master node failure secondary node is available which act as a master. All
the data nodes are just for processing the data. And failure of these node is not a big
concern, as because of the presence of synch, scheduling and communication
between the other node makes the data retrieval easily possible.
Cassandra has many nodes and if anyone is working with one node, by default it will
take three replicas. One replica is in same cluster and second copy is in another
cluster; so in case of node or cluster failure, the data could be available from the
other rack or other node. Since the copies is in another node of the same rack this
helps to decrease the communication problem hence it decreases the cost. Hence,
Cassandra framework stays operational all the time.
Reliability:
Reliability of a database can be set apart by its execution of the expectations which
demonstrates consistency and according to its characterized specification. At the
point when the framework condition changes or any fault happens in the framework
and still the database demonstrates the equivalent or much-improved execution then
we can say that the database is profoundly solid(Reliable).
In HBase reliability is measured by zookeeper which is a centralized service
administration that deals with the configuration of the HBase, organize the
procedures between the HBase customers and the HMaster, and in charge of dealing
with the conveyed synchronization when there are various Hbase customers
associated with HBase and getting to the mutual resources. Hence, Hbase is up to
the mark on all parameters of consistency.
Cassandra is designed on distributed architecture. Its primary element is to store
information on various nodes with no single purpose of failure. Cassandra works on
the concept that hardware failure, any node can be down at any time. To avoid such
issues, stored data is used from another node. Cassandra stores information on
various nodes with peer to peer distributed style architecture. Gossip protocol is used
for interaction between nodes. Hence, Cassandra is highly consistent.
Transaction Management:
HBase gives atomicity of transformations (puts/composes) on a for each row basis,
regardless of whether the put task ranges over different column families.
Nonetheless, no value-based assurance is given to changes crosswise over rows.
10. In HBase, despite the fact that the information is put away on storage like HDFS,
the write dependably experience a lot of servers called region servers. Each table is
isolated into key space allotments called regions and a region server is in charge of
serving traffic for a subset of regions. At the point when a write occurs for a specific
line in a region served by a region server, that region server takes a line lock over all
column families for that line and prevents some other synchronous writes to that
line. At that point it continues the write task to a WAL on HDFS. Simply from that
point forward, the put task is connected to every section family associated with the
Put. This is the way HBase ensures push level atomicity of write activities.
In Cassandra, the write and update activity of row which could be in excess of two
will be treated as one single write task. It implies that write task is nuclear at
partition level. The atomicity is upheld at the row level. Cassandra writes and makes
the replica of information in every one of the nodes and wait for affirmation from the
nodes. At whatever point any progressions are made in the column Cassandra
utilizes timestamp to think about the change. At isolation level, Cassandra performs
activity at full row level. At the point when write is performed at the node, the
entrance is given to client or customer. Durability implies when the write is finished
it will endure regardless of whether the server crashes down. In the event that server
crashes before mem-tables updation flush the disk, the commit log is recovered on
the reboot of the node to recuperate lost transaction.
Learning from Literature Survey
Big Data is known by 3Vs namely Velocity, Variety and Volume. Fifteen years back,
the size of data was manageable. There were data centres that could store data in
structured format and the data was not so scalable. After the introduction of mobiles
and technological advancements in internet, there has been a growth in amount of
data in last 5 years. The data generated in last 5 years was equivalent to data
generated 10 years before the last 5 years. No sooner did the quantity of data
increased than the realization has struck humanity that data is not only in
structured format but also unstructured and semi-structured.
Volume of the data is increasing day by day where Thomas Reuter stated in the
annual report that in 2010 the world data accounted to be more than 800 Exabyte’s
and more counting. In last decade, we can have considered the volume increase in
the data after the rigorous use of social media platforms on daily basis. Not only the
volume, the data has been in different formats as well. The traditional data types
were structured which accounted of info like time, date, amount if we are to consider
an example of banking scenario. But today, we have data in all formats namely
image, reader, writer, movies and so on. The accounts to be unstructured data which
brings about variety in Big Data.
Velocity is defined as the speed with which the data is recorded in the databases.
The frequency with which the data is registered is considered into the Velocity of
11. data. The best example is the live streaming data on Amazon prime or Netflix which
telecasts the shows and also records the live feeds of the comments as a feedback to
the shows which is used for sentimental analysis and decides the rating of the show.
One more V that plays its part here is the term called Veracity which literally means
the authenticity of the data. We have always seen that the data coming from an
unknown source might cause issues to the user resulting into falsehoodness. This
will in turn cause unreliable patterns which will give incorrect results to the data
analyst. So even if this doesn't stand into the core Vs of Big Data, veracity also plays
an equal and important role in Big Data Analysis. (Williamson, 2018)
As we go through some researches online, a technical report submitted by Hiren
Patel by the name "HBase: A NoSQL Database", he mentioned his views about HBase
as it is the strong NoSQL storage system which is inspired by Google’s BigTable. Each
row in HBase table is associated with the key and this key is used for access
mechanism in a row. Its architecture is very strong and capable of storing sparse
data in real time. The three main components the master server, region server and
zookeeper are an essential part of its architecture which enables fast operations on
data. In the views of Hiren, HBase has lost the grip over the ever increasing data and
it is not recommended to use it for present Big Data operations. For some feature, it
has upper hand over the MongoDB and Cassandra but overall both are performing
well than the HBase (Patel, 2017).
Cassandra is a distributed database which is highly scalable when it comes to
storage and throughput. Cassandra’s design brings together the data model
described in Google’s Bigtable paper and the behaviour of Amazon Dynamo (Ref:
Dynamo: Amazons Highly Available Key-value Store In Proceedings of twenty-first
ACMSIGOPS symposium on Operating systems prin-ciples (2008) (Chang, 2008).
Cassandra, along with its remote ThriftAPI (Ref: Thrift: Scalable Cross-Language
Services Implementa-tionFacebook) (Slee, 8), were initially developed by Facebook
as a data platform to build services such as Inbox Search that scale to serve
hundreds of millions of users (Lakshman, 35-40). After it was released to the Apache
Software Foundation Incubator in 2009, Cassandra was then accepted as a top-level
Apache project in March of 2010 (Featherston, 2010).
Whenever HBase and Cassandra were compared on same grounds, many data
analysts notably Rajith Kumar and Roseline Mary, in their research of comparison
of different NoSQL databases like Cassandra, HBase and MongoDB, they evaluated
the databases on Yahoo Cloud Serving Benchmark with different variations in read
and update workloads and found out that Cassandra works better and faster than
HBase (Mary, 2017). This motivates my research to perform an experiment with
different amounts of workload on HBase and Cassandra and validate the
performance.
12. Performance Test Plan:
The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source
determination and program suite for assessing recovery and support abilities
of PC programs. Usually used to look at relative execution of NoSQL database
the board frameworks. The first benchmark was created by workers in the
examination division of Yahoo! who released it in 2010. In this project we have
used this YCSB benchmarking tool for evaluation of Hbase and Cassandra
performance. The figure below shoes the architecture of YCSB benchmarking
structure.
The above figure shows the architecture of the YCSB benchmarking tool. The
YCSB client is producing the information to be stacked to the database, and
creating the tasks which make up the workload. The essential task is that the
workload executor drives different threads. Each string executes a successive
arrangement of activities by making calls to the database interface layer, both
to stack the database and to execute the outstanding workload. The thread
additionally measures the inactivity and accomplished throughput of their
tasks, and report these estimations to the statistics module. At last the
statistics module totals the estimations and reports normal, 95th and 99th
percentile latencies, and either a histogram or time arrangement of the
latencies. Workload executor: Workload executor contains the unique
outstanding tasks at hand situations which are collection of write, Read, and
update activities. (Brian F. Cooper)
For the further process, the system configurations on which the test are
performed are as follows:
Processor 1.7GHz dual-core Intel Core i5-4210U CPU, 2 Cores. 12 GB Ram
memory, Windows 8 Operating System, System type: 64-bit Operating
System, x64-based processor.
13. • Visualation Tool: Open Stack Cloud Server.
• HBase Virtual Machine: Ubuntu (64-bit) Operating System, 4GB Ram
Memory, 2 Core virtual processor.
• Cassandra Virtual Machine: Ubuntu (64-bit) Operating System, 4GB
Ram Memory, 2 Core virtual processor.
• Workload Parameters: Workload A & Workload B
The Database selected for this projects are HBase and Cassandra. After
successful installation of HBase and Cassandra, YSCB benchmarking tool is
installed to perform the performance evaluation for the selected databases.
Testharness tool was used to evaluate the various operations to perform the
performance test. Testharness is an automation tool to perform tests. It refers
to the framework test drivers and other supporting devices that requires to
execute tests. It gives stubs and drivers which are little projects that connect
with the product under test.
Workload A: Update heavy workload, this workload gives mix 50/50 read and
write operations.
Workload B: Read mostly workload, this workload gives 95/5 read and write
mix.
The operation counts taken for running the performance evaluations are
100000, 250000, 350000, 500000. Tests are performed for both Workload A
and Workload B for Hbase and Cassandra database.
Evaluation and Results:
For the Hbase and Cassandra performance evaluation we have been
performed test for the Workload A and Workload B using the benchmarking
tool (YCSB). Specifications are as follows:
Workload A: Update heavy workload, the workload gives 50% Read and 50%
Update
Workload B: Read mostly workload, this workload gives 95% Read and 5%
update mix.
15. For Workload A
Comparison of Hbase and Cassandra workload on the basis of Average
Latency, Read Operation, Throughput, number of operations from Run files
and Overall Load throughput, Average load latency and load operations from
Load files are shown below. The workload 1, workload 2, workload 3 and 4
are for 10000, 250000, 350000 and 500000 respectively.
1. Average Latency Vs Record Read Operations for Workload A:
From the above figure shows the Hbase and Cassandra comparison with
respect to record read and average read latency. The average latency of
Cassandra is consistent with respect to read record counts. Whereas the
Average latency for HBase a spike can be seen for the fourth read record. So,
we can say that the average latency for HBase is consistent for low read
operations than the high read operations. In contrast, there is not major
difference in the average latency for Cassandra.
2. Overall Throughput Vs Operations for Workload A:
Throughput is only the recorded transmission of information starting with one
node then onto the next in the given timeframe. From the below graph of
Throughput Vs Operations it can be seen that, Hbase performs differently
than Cassandra database. For low workloads throughput for HBase is high
and it continues to increase until 250000, but for high workloads the graph
16. gets plummeted. On the other hand, for Cassandra the throughput is bit up
but lower than the Hbase; surprisingly, Cassandra is consistent throughout
the workloads.
3. Average latency Vs Update Operations:
17. The above graph is for the average latency and update operations for HBase
and Cassandra for Workload A operation. It can be seen that for update
operations, average latency is for Cassndra is nearly higher than the HBase.
But with the increase in the workload Cassandra’s average latency goes on
decreasing whereas, for HBase from workload 2 (i.e. 250000 workload) the
average latency takes a leap and it goes above the Cassandra’s average
latency. Thus, we can say that with the high workloads the average latency
for Hbase is also high than Cassandra.
4. Load Latency Read Vs Load Operations:
The above graph shows the comparison of HBase and Cassandra on the basis
of Average load read and load operations for Workload A. In terms of average
latency, it can be clearly seen that Hbase is consistent than the Cassandra;
as with the increase in workload the average load latency for Hbase decreases.
On the other hand, the Hbase shows consistency throughout the workloads.
5. Load Overall Throughput Vs Load Operations-Run
Throughput means the measure of material or things going through a
framework or procedure. From the below comparison graph of Hbase and
Cassandra that, number of operations performed in Hbase is comparatively
higher than the Cassandra. But Cassandra shows improvement in performing
tasks with the increase in the workload but it is not able to cross over the
18. Hbase throughput operations. So, HBase is efficient in terms of the operations
performing per second than Cassandra.
19. For Workload B,
Comparison of Hbase and Cassandra workload on the basis of Average
Latency, Read Operation, Throughput, number of operations from Run files
and Overall Load throughput, Average load latency and load operations from
Load files are shown below. The workload 1, workload 2, workload 3 and 4
are for 10000, 250000, 350000 and 500000 respectively.
6. Average Operations Vs Record Read Operations for Workload B
The above shown graph is for the average read latency and read operations
where we can clearly see that, average is consistent for Cassandra similar to
as Workload A, whereas Hbase shows a steep increase after workload 2. Thus
we can say that with the increase in the read operations Average latency
increases for HBase whereas Cassandra remains constant for Workload B.
7. Overall Throughput Vs Operations for Workload B
The graph shown below shows the performance regarding overall throughput
and operations performed for workload B. It can be seen that, for HBase with
the increase in the number of operation the throughput (i.e. operations
performed per second) decreases in Workload B whereas there is not much
difference in the performance of Cassandra as with the increase in the
20. workload operations performed per second increases a bit until workload 3
but there is a slight decrease after that.
8. Avg. Latency Vs Update Operations for Workload B
The above graph is for the performance evaluation of Hbase and Cassandra
on the basis of average latency and number of update operations for workload
21. B. As we have seen earlier the average latency of Casandra is consistent
throughout the operations. On the other hand, Average latency is little higher
than the Cassandra for the less operations but as the load increases the
average latency increases and from workload 3 it goes on increasing. Thus we
can say that, Casandra is consistent throughout the workloads and for HBase
as the workload increases the Average latency also increases. Hence, Hbase
gives good performance for high workloads.
9. Average Latency Vs Number of Operations
The above graph shows the comparison of HBase and Cassandra on the basis
of Average load read and load operations for Workload A. In terms of average
latency, it can be clearly seen that Hbase is consistent than the Cassandra;
as with the increase in workload the average load latency for Hbase decreases.
On the other hand, the Hbase shows consistency throughout the workloads.
Thus we can say that Cassandra is more stable in handling operations
ranging from small to large workloads. On the other hand Hbase is good with
the large workloads.
22. 10. Overall Load Throughput Vs Load Operations for Workload B
The above graph shows the performance of overall throughput and number of
operations for Hbase and Cassandra for workload B. It can be seen that
comparatively Hbase throughput is more than the Cassandra; both follows
the same consistency as the load increases. Moreover the output of workload
B is follows the same that of workload A. We can clearly see that as the
operation increases the throughput of Hbase compared to that of Cassandra
is higher but after the workload 2 it goes on decreasing, on the contrary,
Cassandra after workload 3 throughput increases slightly. Hence, based on
the facts presented in the graph, we can come to a conclusion that Hbase
performance is higher for less workload but it slightly drops for higher
workloads.
23. Conclusion and Discussion
In conclusion, the experiment carried out successfully for the performance of
Hbase and Cassandra on the basis of Average Latency, Read Operation,
Throughput, number of operations from Run files and Overall Load
throughput, Average load latency and load operations from Load. The
architecture of both the databases alongside their key qualities has been
investigated in the above examination for Workload A and Workload B. The
output has been displayed in the paper. From the above examination, it can
be seen that Cassandra is consistent throughout the tests and the for load
operations, Cassandra works well for throughput load operation that means
Cassandra has a good capability of loading the data when the workload
increases; Also, its performance has been consistent for availability and the
average latency throughout the examination. Which correctly justifies the
Cassandra's model for exchanging consistency for availability and latency has
been established by referencing the CAP Theorem. (Featherston, 2010). On
the other hand, it can be observed from the experiment that, in some
parameters Hbase is efficient in handling the higher workloads; Throughput
(operation per second) is good when it comes to perform for small operation
but the throughput performance decreases with the higher workloads.
Moreover, it is not proper to say that Hbase is efficient for only high
performance, but yes from the above findings there are some parameters i.e.
(update record and the average latency) where HBase works well on the high
workloads from the findings from the both workload A. In contrary, for
Average latency of workload B Hbase performance is lower than the
Cassandra. Thus, according to the past investigations, the execution of HBase
will in general show signs of improvement however from our examination for
Workload A and Workload B we can say that the HBase isn't executing
according to the desires as the outstanding burdens are expanding. From the
overall evaluation we can say that Cassandra is predictable and performed
much superior to the Hbase which is one of the requirements of any
organisations. And Hbase has demonstrated to high inconsistency all through
the analysis. This study can be further extended for the higher workloads to
check how these two NoSQL databases perform, and which one is more
suitable for the big organisations.
24. References
Brian F. Cooper, A. S. (n.d.). Benchmarking Cloud Serving Systems with YCSB. ACM
(https://www2.cs.duke.edu/courses/fall13/cps296.4/838-CloudPapers/ycsb.pdf), 143--154.
Chang, F. a. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on
Computer Systems (TOCS), 4.
Featherston, D. (2010). Cassandra: Principles and Application. unitn, 17. Retrieved from
http://disi.unitn.it/~montreso/ds/papers/Cassandra.pdf
Guru99. (2019). Cassandra Architecture & Replication Factor Strategy. Retrieved from
www.guru99.com: https://www.guru99.com/cassandra-architecture.html
KUMAR, G. (2018). EXPLORING THE DIFFERENT TYPES OF NOSQL DATABASES PART II. Retrieved from
3pillarglobal: https://www.3pillarglobal.com/insights/exploring-the-different-types-of-nosql-
databases
Lakshman, A. a. (35-40). Cassandra: a decentralized structured storage system. ACM SIGOPS
Operating Systems Review, 2010.
Mary, R. K. (2017). Comparative Performance Analysis of various NoSQL Databases: MongoDB,
Cassandra and HBase on Yahoo Cloud Server. Imperial Journal of Interdisciplinary Research
(IJIR), 5.
Mehra, A. (2015, june 06). Introduction to Apache Cassandra's Architecture. Retrieved from
dzone.com: https://dzone.com/articles/introduction-apache-cassandras
Patel, H. (2017). HBase: A NoSQL Database. 10.13140/RG.2.2.22974.28480. researchgate, 15.
Pethuru Raj, G. C. (2014). Handbook of research on cloud infrastructures for big data analytics. IGI
Global.
PFEIL, M. (2010, Oct 29). Why does Scalability matter, and how does Cassandra scale? Retrieved
from Datastax: https://www.datastax.com/dev/blog/why-does-scalability-matter-and-how-
does-cassandra-scale
Sinha, S. (2019, Feb). HBase Architecture: HBase Data Model & HBase Read/Write Mechanism.
Retrieved from www.edureka.com: https://www.edureka.co/blog/hbase-architecture/
Slee, M. a. (8). Thrift: Scalable cross-language services implementation. Facebook White Paper, 2007.
TEAM, D. (2018, June). HBase Architecture – Regions, Hmaster, Zookeeper. Retrieved from ata-
flair.training: https://data-flair.training/blogs/hbase-architecture/
TechTarget. (2017, march). NoSQL (Not Only SQL database). Retrieved from
searchdatamanagement.techtarget.com:
https://searchdatamanagement.techtarget.com/definition/NoSQL-Not-Only-SQL
Wikimedia Foundation, I. (2019, March). Apache Cassandra. Retrieved from en.wikipedia.org:
https://en.wikipedia.org/wiki/Apache_Cassandra
Wikipedia, t. f. (2019, Mar 19). Apache HBase. Retrieved from en.wikipedia.org:
https://en.wikipedia.org/wiki/Apache_HBase
25. Williamson, J. (2018). THE 4 V’S OF BIG DATA. Retrieved from Dummies :
https://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data/