Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESijdpsjournal
By programming both the data plane and the control plane, network operators can adapt their networks to
their needs. Thanks to research over the past decade, this concept has more formulized and more
technologically feasible. However, since control plane programmability came first, it has already been
successfully implemented in the real network and is beginning to pay off. Today, the data plane
programmability is evolving very rapidly to reach this level, attracting the attention of researchers and
developers: Designing data plane languages, application development on it, formulizing software switches
and architecture that can run data plane codes and the applications, increasing performance of software
switch, and so on. As the control plane and data plane become more open, many new innovations and
technologies are emerging, but some experts warn that consumers may be confused as to which of the many
technologies to choose. This is a testament to how much innovation is emerging in the network. This paper
outlines some emerging applications on the data plane and offers opportunities for further improvement
and optimization. Our observations show that most of the implementations are done in a test environment
and have not been tested well enough in terms of performance, but there are many interesting works, for
example, previous control plane solutions are being implemented in the data plane.
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
This document proposes a new HDFS architecture that eliminates the single point of failure of the NameNode by distributing metadata storage using blockchain technology. In the traditional HDFS, the NameNode stores all metadata, but in the new architecture this is replaced by blockchain miners that securely store encrypted metadata across data nodes. Blockchain links data blocks in a serial manner with cryptographic hashes to ensure integrity. The key components are HDFS clients, data nodes for storage, and specially designated miner nodes that help create and store metadata blocks in an encrypted and distributed fashion similar to how transactions are recorded in a blockchain. This architecture aims to provide reliable, secure and faster metadata access without a single point of failure.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
This document summarizes a research paper that proposes using in-node combiners to improve the performance of Hadoop MapReduce jobs. It discusses how MapReduce jobs are I/O intensive and describes two common bottlenecks: during the map phase when data is loaded from disks, and during the shuffle phase when intermediate results are transferred over the network. The paper introduces an in-node combiner approach to optimize I/O by locally aggregating intermediate results within nodes to reduce network traffic between mappers and reducers. It evaluates this approach through an experiment counting word occurrences in Twitter messages.
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
This document proposes HadoopXML, a system for efficiently processing massive XML data and multiple twig pattern queries in parallel using Hadoop. Key features of HadoopXML include:
1) It partitions large XML files and processes them in parallel across nodes while preserving structural information.
2) It simultaneously processes multiple twig pattern queries with a shared input scan, without needing separate MapReduce jobs for each query.
3) It enables query processing tasks to share input scans and intermediate results like path solutions, reducing redundant processing and I/O.
4) It provides load balancing to fairly distribute twig join operations across nodes.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESijdpsjournal
By programming both the data plane and the control plane, network operators can adapt their networks to
their needs. Thanks to research over the past decade, this concept has more formulized and more
technologically feasible. However, since control plane programmability came first, it has already been
successfully implemented in the real network and is beginning to pay off. Today, the data plane
programmability is evolving very rapidly to reach this level, attracting the attention of researchers and
developers: Designing data plane languages, application development on it, formulizing software switches
and architecture that can run data plane codes and the applications, increasing performance of software
switch, and so on. As the control plane and data plane become more open, many new innovations and
technologies are emerging, but some experts warn that consumers may be confused as to which of the many
technologies to choose. This is a testament to how much innovation is emerging in the network. This paper
outlines some emerging applications on the data plane and offers opportunities for further improvement
and optimization. Our observations show that most of the implementations are done in a test environment
and have not been tested well enough in terms of performance, but there are many interesting works, for
example, previous control plane solutions are being implemented in the data plane.
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
This document proposes a new HDFS architecture that eliminates the single point of failure of the NameNode by distributing metadata storage using blockchain technology. In the traditional HDFS, the NameNode stores all metadata, but in the new architecture this is replaced by blockchain miners that securely store encrypted metadata across data nodes. Blockchain links data blocks in a serial manner with cryptographic hashes to ensure integrity. The key components are HDFS clients, data nodes for storage, and specially designated miner nodes that help create and store metadata blocks in an encrypted and distributed fashion similar to how transactions are recorded in a blockchain. This architecture aims to provide reliable, secure and faster metadata access without a single point of failure.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
This document summarizes a research paper that proposes using in-node combiners to improve the performance of Hadoop MapReduce jobs. It discusses how MapReduce jobs are I/O intensive and describes two common bottlenecks: during the map phase when data is loaded from disks, and during the shuffle phase when intermediate results are transferred over the network. The paper introduces an in-node combiner approach to optimize I/O by locally aggregating intermediate results within nodes to reduce network traffic between mappers and reducers. It evaluates this approach through an experiment counting word occurrences in Twitter messages.
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
This document proposes HadoopXML, a system for efficiently processing massive XML data and multiple twig pattern queries in parallel using Hadoop. Key features of HadoopXML include:
1) It partitions large XML files and processes them in parallel across nodes while preserving structural information.
2) It simultaneously processes multiple twig pattern queries with a shared input scan, without needing separate MapReduce jobs for each query.
3) It enables query processing tasks to share input scans and intermediate results like path solutions, reducing redundant processing and I/O.
4) It provides load balancing to fairly distribute twig join operations across nodes.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
This document discusses geo-distributed parallelization of MapReduce jobs across multiple datacenters. It introduces GEO-PACT, a Hadoop-based framework that can efficiently process sequences of parallelization contracts jobs on geo-distributed input data. GEO-PACT uses a group manager to determine optimal execution paths and job managers at each datacenter to execute tasks locally using Hadoop. It employs copy managers to transfer data between datacenters and aggregation managers to combine results. The goal is to optimize execution time by leveraging data locality across geographically distributed data sources.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
This document discusses frameworks for processing big data that is distributed across geographic locations. It begins by introducing the challenges of geo-distributed big data processing and then describes several MapReduce-based frameworks like G-Hadoop and G-MR that can process pre-located geo-distributed data. It also covers Spark-based systems like Iridium and frameworks that partition data across geographic locations, such as KOALA grid-based systems. The document analyzes key aspects of geo-distributed big data processing systems like data distribution, task scheduling, and fault tolerance.
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
In present times, updated information and knowledge has become readily accessible to researchers, enthusiasts, developers, and academics through the Internet on many different subjects for wider areas of application. The underlying framework facilitating such possibilities is networking of servers, nodes, and personal computers. However, such setups, comprising of mainframes, servers and networking devices are inaccessible to many, costly, and are not portable. In addition, students and lab-level enthusiasts do not have the requisite access to modify the functionality to suit specific purposes. The Raspberry-Pi (R-Pi) is a small device capable of many functionalities akin to super-computing while being portable, economical and flexible. It runs on open source Linux, making it a preferred choice for lab-level research and studies. Users have started using the embedded networking capability to design portable clusters that replace the costlier machines. This paper introduces new users to the most commonly used frameworks and some recent developments that best exploit the capabilities of R-Pi when used in clusters. This paper also introduces some of the tools and measures that rate efficiencies of clusters to help users assess the quality of cluster design. The paper aims to make users aware of the various parameters in a cluster environment.
A DHT Chord-like mannered overlay-network to store and retrieve dataAndrea Tino
The document describes a distributed data backup and recovery system implemented as a distributed hash table on a ring topology overlay network. Servers in the network store and retrieve data units which are addressed and routed through the network based on their content hashes. The system was designed for scalability, flexibility, and extensibility through a library of reusable "proto-services" that can connect to each other to enable various functionality and behaviors.
This document provides an overview of Hadoop, including its core components HDFS, MapReduce, and YARN. It describes how HDFS stores and replicates data across nodes for reliability. MapReduce is used for distributed processing of large datasets by mapping data to key-value pairs, shuffling, and reducing results. YARN was introduced to improve scalability by separating job scheduling and resource management from MapReduce. The document also gives examples of using MapReduce on a movie ratings dataset to demonstrate Hadoop functionality and running simple MapReduce jobs via the command line.
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET Journal
This document provides an overview of Hadoop and compares it to Spark. It discusses the key components of Hadoop including HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS stores large datasets across clusters in a fault-tolerant manner. MapReduce allows parallel processing of large datasets using a map and reduce model. YARN was later added to improve resource management. The document also summarizes Spark, which can perform both batch and stream processing more efficiently than Hadoop for many workloads. A comparison of Hadoop and Spark highlights their different processing models.
Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com
In this video, Dr. Umit Catalyurek from Georgia Institute of Technology presents: Modern Computing: Cloud, Distributed, & High Performance.
Ümit V. Çatalyürek is a Professor in the School of Computational Science and Engineering in the College of Computing at the Georgia Institute of Technology. He received his Ph.D. in 2000 from Bilkent University. He is a recipient of an NSF CAREER award and is the primary investigator of several awards from the Department of Energy, the National Institute of Health, and the National Science Foundation. He currently serves as an Associate Editor for Parallel Computing, and as an editorial board member for IEEE Transactions on Parallel and Distributed Computing, and the Journal of Parallel and Distributed Computing.
Learn more: http://www.bigdatau.org/data-science-seminars
Watch the video presentation: http://wp.me/p3RLHQ-ghU
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
This document proposes bitmap indexes to improve the processing of XML twig queries on shredded XML data stored in relational database tables. It introduces several bitmap indexes built on the tag, path, and tag+level domains that provide quick access to relevant XML elements. A hybrid "bitTwig" index is also presented that can find all instances matching a twig query using only a pair of cursor-enabled bitmap indexes, without accessing the actual XML data stored in the tables. Experiments show this bitTwig index outperforms existing indexing approaches in most cases.
This document discusses a proposed data-aware caching framework called Dache that could be used with big data applications built on MapReduce. Dache aims to cache intermediate data generated during MapReduce jobs to avoid duplicate computations. When tasks run, they would first check the cache for existing results before running the actual computations. The goal is to improve efficiency by reducing redundant work. The document outlines the objectives and scope of extending MapReduce with Dache, provides background on MapReduce and Hadoop, and concludes that initial experiments show Dache can eliminate duplicate tasks in incremental jobs.
The document discusses two papers about MapReduce. The first paper describes Google's implementation of MapReduce (Hadoop) which uses a master-slave model. The second paper proposes a peer-to-peer MapReduce architecture to handle dynamic node failures including master failures. It compares the two approaches, noting that the P2P model provides better fault tolerance against master failures.
The study expands upon previous work in format extension. The initial research purposed extra space provided by an unrefined format to store metadata about the file in question. This process does not negatively impact the original intent of the format and allows for the creation of new derivative file types with both backwards compatibility and new features. The file format extension algorithm has been rewritten entirely in C++ and is now being distributed as an open source C/C++ static library, roughdraftlib. The files from our previous research are essentially binary compatible though a few extra fields have been added for developer convenience. The new data represents the current and oldest compatible versions of the binary and values representing the scaling ratio of the image. These new fields are statically included in every file and take only a few bytes to encode, so they have a trivial effect on the overall encoding density.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
This document discusses big data processing options for optimizing analytical workloads using Hadoop. It provides an overview of Hadoop and its core components HDFS and MapReduce. It also discusses the Hadoop ecosystem including tools like Pig, Hive, HBase, and ecosystem projects. The document compares building Hadoop clusters to using appliances or Hadoop-as-a-Service offerings. It also briefly mentions some Hadoop competitors for real-time processing use cases.
This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
This document discusses geo-distributed parallelization of MapReduce jobs across multiple datacenters. It introduces GEO-PACT, a Hadoop-based framework that can efficiently process sequences of parallelization contracts jobs on geo-distributed input data. GEO-PACT uses a group manager to determine optimal execution paths and job managers at each datacenter to execute tasks locally using Hadoop. It employs copy managers to transfer data between datacenters and aggregation managers to combine results. The goal is to optimize execution time by leveraging data locality across geographically distributed data sources.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
This document discusses frameworks for processing big data that is distributed across geographic locations. It begins by introducing the challenges of geo-distributed big data processing and then describes several MapReduce-based frameworks like G-Hadoop and G-MR that can process pre-located geo-distributed data. It also covers Spark-based systems like Iridium and frameworks that partition data across geographic locations, such as KOALA grid-based systems. The document analyzes key aspects of geo-distributed big data processing systems like data distribution, task scheduling, and fault tolerance.
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
In present times, updated information and knowledge has become readily accessible to researchers, enthusiasts, developers, and academics through the Internet on many different subjects for wider areas of application. The underlying framework facilitating such possibilities is networking of servers, nodes, and personal computers. However, such setups, comprising of mainframes, servers and networking devices are inaccessible to many, costly, and are not portable. In addition, students and lab-level enthusiasts do not have the requisite access to modify the functionality to suit specific purposes. The Raspberry-Pi (R-Pi) is a small device capable of many functionalities akin to super-computing while being portable, economical and flexible. It runs on open source Linux, making it a preferred choice for lab-level research and studies. Users have started using the embedded networking capability to design portable clusters that replace the costlier machines. This paper introduces new users to the most commonly used frameworks and some recent developments that best exploit the capabilities of R-Pi when used in clusters. This paper also introduces some of the tools and measures that rate efficiencies of clusters to help users assess the quality of cluster design. The paper aims to make users aware of the various parameters in a cluster environment.
A DHT Chord-like mannered overlay-network to store and retrieve dataAndrea Tino
The document describes a distributed data backup and recovery system implemented as a distributed hash table on a ring topology overlay network. Servers in the network store and retrieve data units which are addressed and routed through the network based on their content hashes. The system was designed for scalability, flexibility, and extensibility through a library of reusable "proto-services" that can connect to each other to enable various functionality and behaviors.
This document provides an overview of Hadoop, including its core components HDFS, MapReduce, and YARN. It describes how HDFS stores and replicates data across nodes for reliability. MapReduce is used for distributed processing of large datasets by mapping data to key-value pairs, shuffling, and reducing results. YARN was introduced to improve scalability by separating job scheduling and resource management from MapReduce. The document also gives examples of using MapReduce on a movie ratings dataset to demonstrate Hadoop functionality and running simple MapReduce jobs via the command line.
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET Journal
This document provides an overview of Hadoop and compares it to Spark. It discusses the key components of Hadoop including HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS stores large datasets across clusters in a fault-tolerant manner. MapReduce allows parallel processing of large datasets using a map and reduce model. YARN was later added to improve resource management. The document also summarizes Spark, which can perform both batch and stream processing more efficiently than Hadoop for many workloads. A comparison of Hadoop and Spark highlights their different processing models.
Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com
In this video, Dr. Umit Catalyurek from Georgia Institute of Technology presents: Modern Computing: Cloud, Distributed, & High Performance.
Ümit V. Çatalyürek is a Professor in the School of Computational Science and Engineering in the College of Computing at the Georgia Institute of Technology. He received his Ph.D. in 2000 from Bilkent University. He is a recipient of an NSF CAREER award and is the primary investigator of several awards from the Department of Energy, the National Institute of Health, and the National Science Foundation. He currently serves as an Associate Editor for Parallel Computing, and as an editorial board member for IEEE Transactions on Parallel and Distributed Computing, and the Journal of Parallel and Distributed Computing.
Learn more: http://www.bigdatau.org/data-science-seminars
Watch the video presentation: http://wp.me/p3RLHQ-ghU
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
This document proposes bitmap indexes to improve the processing of XML twig queries on shredded XML data stored in relational database tables. It introduces several bitmap indexes built on the tag, path, and tag+level domains that provide quick access to relevant XML elements. A hybrid "bitTwig" index is also presented that can find all instances matching a twig query using only a pair of cursor-enabled bitmap indexes, without accessing the actual XML data stored in the tables. Experiments show this bitTwig index outperforms existing indexing approaches in most cases.
This document discusses a proposed data-aware caching framework called Dache that could be used with big data applications built on MapReduce. Dache aims to cache intermediate data generated during MapReduce jobs to avoid duplicate computations. When tasks run, they would first check the cache for existing results before running the actual computations. The goal is to improve efficiency by reducing redundant work. The document outlines the objectives and scope of extending MapReduce with Dache, provides background on MapReduce and Hadoop, and concludes that initial experiments show Dache can eliminate duplicate tasks in incremental jobs.
The document discusses two papers about MapReduce. The first paper describes Google's implementation of MapReduce (Hadoop) which uses a master-slave model. The second paper proposes a peer-to-peer MapReduce architecture to handle dynamic node failures including master failures. It compares the two approaches, noting that the P2P model provides better fault tolerance against master failures.
The study expands upon previous work in format extension. The initial research purposed extra space provided by an unrefined format to store metadata about the file in question. This process does not negatively impact the original intent of the format and allows for the creation of new derivative file types with both backwards compatibility and new features. The file format extension algorithm has been rewritten entirely in C++ and is now being distributed as an open source C/C++ static library, roughdraftlib. The files from our previous research are essentially binary compatible though a few extra fields have been added for developer convenience. The new data represents the current and oldest compatible versions of the binary and values representing the scaling ratio of the image. These new fields are statically included in every file and take only a few bytes to encode, so they have a trivial effect on the overall encoding density.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
This document discusses big data processing options for optimizing analytical workloads using Hadoop. It provides an overview of Hadoop and its core components HDFS and MapReduce. It also discusses the Hadoop ecosystem including tools like Pig, Hive, HBase, and ecosystem projects. The document compares building Hadoop clusters to using appliances or Hadoop-as-a-Service offerings. It also briefly mentions some Hadoop competitors for real-time processing use cases.
This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
1. The document discusses the evolution of computing from mainframes to smaller commodity servers and PCs. It then introduces cloud computing as an emerging technology that is changing the technology landscape, with examples like Google File System and Amazon S3.
2. It discusses the need for large data processing due to increasing amounts of data from sources like the stock exchange, Facebook, genealogy sites, and scientific experiments.
3. Hadoop is introduced as a framework for distributed computing and reliable shared storage and analysis of large datasets using its Hadoop Distributed File System (HDFS) for storage and MapReduce for analysis.
Map Reduce based on Cloak DHT Data Replication Evaluationijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of un-structured, semi-structured and structured data to extract new knowledge. Replications offer better performance and greater availability of data. With the advent of Big Data, new storage and processing challenges are emerging.
This document provides an introduction to Hadoop, including its programming model, distributed file system (HDFS), and MapReduce framework. It discusses how Hadoop uses a distributed storage model across clusters, racks, and data nodes to store large datasets in a fault-tolerant manner. It also summarizes the core components of Hadoop, including HDFS for storage, MapReduce for processing, and YARN for resource management. Hadoop provides a scalable platform for distributed processing and analysis of big data.
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERijdpsjournal
With the interconnection of one to many computers online, the data shared by users is multiplying daily. As
a result, the amount of data to be processed by dedicated servers rises very quickly. However, the
instantaneous increase in the volume of data to be processed by the server comes up against latency during
processing. This requires a model to manage the distribution of tasks across several machines. This article
presents a study of load balancing for large data sets on a cluster of Hadoop nodes. In this paper, we use
Mapreduce to implement parallel programming and Yarn to monitor task execution and submission in a
node cluster.
This document provides an overview of Hadoop, including:
- Prerequisites for getting the most out of Hadoop include programming skills in languages like Java and Python, SQL knowledge, and basic Linux skills.
- Hadoop is a software framework for distributed processing of large datasets across computer clusters using MapReduce and HDFS.
- Core Hadoop components include HDFS for storage, MapReduce for distributed processing, and YARN for resource management.
- The Hadoop ecosystem also includes components like HBase, Pig, Hive, Mahout, Sqoop and others that provide additional functionality.
This document provides an overview of Apache Hadoop, an open source framework for distributed storage and processing of large datasets across clusters of computers. It discusses big data and the need for solutions like Hadoop, describes the key components of Hadoop including HDFS for storage and MapReduce for processing, and outlines some applications and pros and cons of the Hadoop framework.
This document discusses Hadoop and its core components HDFS and MapReduce. It provides an overview of how Hadoop addresses the challenges of big data by allowing distributed processing of large datasets across clusters of computers. Key points include: Hadoop uses HDFS for distributed storage and MapReduce for distributed processing; HDFS works on a master-slave model with a Namenode and Datanodes; MapReduce utilizes a map and reduce programming model to parallelize tasks. Fault tolerance is built into Hadoop to prevent single points of failure.
This document provides an overview of Hadoop and how it addresses the challenges of big data. It discusses how Hadoop uses a distributed file system (HDFS) and MapReduce programming model to allow processing of large datasets across clusters of computers. Key aspects summarized include how HDFS works using namenodes and datanodes, how MapReduce leverages mappers and reducers to parallelize processing, and how Hadoop provides fault tolerance.
This document provides an overview of Hadoop and how it addresses the challenges of big data. It discusses how Hadoop uses a distributed file system (HDFS) and MapReduce programming model to allow processing of large datasets across clusters of computers. Key aspects summarized include how HDFS works using namenodes and datanodes, how MapReduce leverages mappers and reducers to parallelize processing, and how Hadoop provides fault tolerance.
Big Data Storage System Based on a Distributed Hash Tables Systemijdms
The Big Data is unavoidable considering the place of the digital is the predominant form of communication in the daily life of the consumer. The control of its stakes and the quality its data must be a priority in order not to distort the strategies arising from their treatment in the aim to derive profit. In order to achieve this, a lot of research work has been carried out companies and several platforms created. MapReduce, is one of the enabling technologies, has proven to be applicable to a wide range of fields. However, despite its importance recent work has shown its limitations. And to remedy this, the Distributed Hash Tables (DHT) has been used. Thus, this document not only analyses the and MapReduce implementations and Top-Level Domain (TLD)s in general, but it also provides a description of a model of DHT as well as some guidelines for the planification of the future research
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
This document describes a proposed architecture for improving data retrieval performance in a Hadoop Distributed File System (HDFS) deployed in a cloud environment. The key aspects are:
1) A web server would replace the map phase of MapReduce to provide faster searching of data. The web server uses multi-level indexing for real-time processing on HDFS.
2) An Apache load balancer distributes requests across backend application servers to improve throughput and scalability.
3) The NameNode is divided into master and slave servers, with the master containing the multi-level index and slaves storing data and lower-level indexes. This allows distributed data retrieval.
Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima
The document discusses integrating database management systems (DBMSs) as a read-only execution layer into Hadoop. It proposes a new system architecture that incorporates modified DBMS engines augmented with a customized storage engine capable of directly accessing data from the Hadoop Distributed File System (HDFS) and using global indexes. This allows DBMS to efficiently provide read-only operators while not being responsible for data management. The system is designed to address limitations of HadoopDB by improving performance, fault tolerance, and data loading speed.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage, which partitions data into blocks and replicates them across nodes for fault tolerance. The master node tracks where data blocks are stored and worker nodes execute tasks like mapping and reducing data. Hadoop provides scalability and fault tolerance but is slower for iterative jobs compared to Spark, which keeps data in memory. The Lambda architecture also informs Hadoop's ability to handle batch and speed layers separately for scalability.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Similar a PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE (20)
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
artificial intelligence and data science contents.pptxGauravCar
What is artificial intelligence? Artificial intelligence is the ability of a computer or computer-controlled robot to perform tasks that are commonly associated with the intellectual processes characteristic of humans, such as the ability to reason.
› ...
Artificial intelligence (AI) | Definitio
The CBC machine is a common diagnostic tool used by doctors to measure a patient's red blood cell count, white blood cell count and platelet count. The machine uses a small sample of the patient's blood, which is then placed into special tubes and analyzed. The results of the analysis are then displayed on a screen for the doctor to review. The CBC machine is an important tool for diagnosing various conditions, such as anemia, infection and leukemia. It can also help to monitor a patient's response to treatment.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
1. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
DOI: 10.5121/ijdps.2022.13102 13
PERFORMANCE EVALUATION OF BIG DATA
PROCESSING OF CLOAK-REDUCE
Mamadou Diarra and Telesphore B. Tiendrebeogo
Department of Mathematics and Computer Science, Nazi Boni University,
Bobo-Dioulasso, Burkina Faso
ABSTRACT
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
KEYWORDS
Big Data, DHT, MapReduce, CLOAK-Reduce.
1. INTRODUCTION
Big Data [1] showed the limits of traditional storage and processing frameworks. To solve its
problems, many projects have been developed. Hadoop [2] and MapReduce [3] are pioneering
software frameworks dedicated to this field. The parallel programming model of Hadoop
MapReduce designed for scalability and fault tolerance assumes that calculations occur in a
centralized environment with a master/slave architecture.
Also the lack of storage for incoming queries that can be processed at a later time and the
problems associated with the churn in early versions of Hadoop, have led Hadoop designers to
move towards new resource management policies [4]. But with the heterogeneity of computing
frameworks, Apache Hadoop still faces problems of load balancing and data replication [2, 5].
Thus to overcome the storage problem, Distributed Hash Tables (DHTs) were used. However,
with the proliferation of data and the heterogeneity of storage and processing nodes, this
approach was faced with the same problem of load balancing and replication [6, 7].
While many studies have been conducted on the dynamic load balancing strategy [8, 9] and data
replication mechanisms [10, 11] of Hadoop and DHTs, new and innovative hybrid storage and
processing frameworks have emerged. Our objective is to show the efficiency of our Big Data
framework: CLOAK Reduce. We will evaluate its performance by comparing its load balancing
strategy and replication mechanism to some existing solutions. This paper is structured as
2. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
14
follows: in section 2, an overview of different Hadoop approaches for efficient management of
massive data. In section 3, some approaches of DHT-MapReduce frameworks used in Big Data
management are discussed. Experimental results are presented in section 4. Finally, conclusions
are drawn and future work is described in section 5.
2. RELATED WORKS
2.1. Hadoop’s Different Approaches
Its scheduler influences the performance of Hadoop greatly. Originally, the main purpose of
Hadoop was to perform large batch jobs such as web indexing and log mining. Since its first
deployment in 2006, four versions of Hadoop frameworks have been released, namely Hadoop
0.x, Hadoop 1.x, Hadoop 2.x and in 2017 Hadoop 3.x. In Hadoop 0.x and Hadoop 1.x, slot-based
resource management was used, while Hadoop 2.x and Hadoop 3.x use a resource management
system known as Yet Another Resource Negotiator (YARN) [12].
How does Hadoop 3.x add value over Apache Hadoop? First, let us introduce MapReduce, our
dedicated processing framework.
2.2. MapReduce
MapReduce is a framework that enables automatic parallelization and distribution of large-scale
computations, while abstracting the “disjointed details of parallelization, fault tolerance, data
distribution and load balancing." Its principle consists in splitting the initial volume of data V into
smaller volumes vi, of couples (key, value), which will be handled separately on several
machines [3]. To simplify, a user can take a large problem, divide it into smaller equivalent tasks
and send these tasks to other processors for computation. The results are returned to the user and
combined into a single answer [2].
2.3. Hadoop 1
Hadoop 1.x has a limited scalability of 4,000 nodes per cluster and only supports the MapReduce
processing model [5]. Its "master/slave" architecture, consisting of the NameNode and
JobTracker masters, is its point of failure. The failure of the NameNode daemon eliminates all
accessibility to the Hadoop Distributed File System (HDFS) file manager. JobTracker failure also
will cancel MapReduce jobs running and all new jobs to be submitted [12, 13]. In short, the
Single Point Of Failure (SPOF) in a Hadoop 1 cluster is the NameNode. Any other churn does
not cause data loss or the results of the cluster’s NameNode. Hence the need to go to another step
in this configuration to save the NameNode metadata.
2.4. Hadoop 2
Hadoop 2.x (Figure 1) has resolved most of the restrictions of Hadoop 1.x. Its resource and node
management system, Yet Another Resource Negotiator (YARN) serves as a framework for a
wide range of data analytics such as event management, streaming and realtime operations [5].
Hadoop 2.x has therefore combined Hadoop 1.x with other distributed computing models such as
Spark, Hama, Giraph, messaging interface, MPI coprocessors and HBase [14]. Hadoop 2.x has
scalability of up to 10,000 nodes per cluster and Secondary NameNode with functionality to
overcome SPOF [15].
3. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
15
However, Hadoop 2.x has some limits. Its tolerance for failures is managed by replication. HDFS
replicates each block three times for multiple purposes by default. A single data node manages
many disks. These disks fill up during a normal write operation. Adding or replacing disks can
cause significant problems within a data node. For data management, Hadoop 2.x has an HDFS
balancer that distributes data across the disks of a data node. Replication is the only way to
manage fault tolerance that is not space-optimised. Because if HDFS has six blocks to store, there
will be 18 occupied blocks, which is 200% overhead in the storage space. In addition, Hadoop
2.x only supports one active and one Secondary NameNode for the entire namespace.
Figure 1. Hadoop 1 vs Hadoop 2
2.5. Hadoop 3
Hadoop version 3.x (Figure 2) was released in December 2017 [15] incorporating a number of
significant improvements over the previous major release, Hadoop 2.x. Apache Hadoop 3.x with
its Erasure Coding (EC) in place of data replication provides the same level of fault tolerance
with much less storage space. In typical erasure coding configurations, the storage overhead is no
more than 50%. Integrating EC with HDFS improves storage efficiency while providing data
durability similar to that of traditional replication based HDFS deployments. As an illustration, a
three time replicated file with six blocks will consume 6*3 = 18 blocks of disk space. But with
EC deployment (six data, three parities) it will only consume nine blocks of disk space [5].
Hadoop 3.x improved scalability, usability by introducing streams and aggregation and reliability
of Timeline service [10].
With Hadoop 3 we have some disadvantages. Node heterogeneity factors have a negative impact
on its performance and limit the overall throughput of the system raising the issue of load
balancing and data replication. To address these challenges, Hadoop also introduces the use of
intra-node data balancing.
4. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
16
Figure 2. Some Hadoop History [25]
2.6. DHT - MapReduce
Although Hadoop 3.x supports more than two (02) NameNodes, the initial HDFS high
availability implementation provided for a single active NameNode and a single SecondaryNode.
By replicating edits across a quorum of three JournalNodes, this architecture is able to tolerate
the failure of any node in the system. For the deployment of infrastructures that require higher
levels of fault tolerance, JournalNodes can be used to allow users to run multiple backup
NameNodes. For example, by configuring three NameNodes and five JournalNodes, the cluster is
able to tolerate the failure of two nodes rather than just one. In parallel with Hadoop, other
projects to design distributed storage and parallelized processing frameworks to develop high-
performance computing using heterogeneous resources have emerged. We present some
frameworks dedicated to this type of computation exploiting DHTs as organisational
mechanisms. ChordMR[16] and ChordReduce[17] have focused their work on the failure of the
master. The use of Chord [18] allows a set of backup nodes to be available for the resumption of
outstanding work. When the active node fails, the backup nodes run an algorithm to elect a new
node. Most DHTs focus their efforts on preventing record loss due to churn with scalability of
more than 106
nodes and strategies to increase their storage performance [19].
2.7. EclipseMR
EclipseMR (Figure 3) consists of double-layered consistent hash rings - a decentralized DHT-
based file system and an in-memory key-value store that employs consistent hashing. The in-
memory key-value store in EclipseMR is designed to not only cache local data but also remote
data as well so that globally popular data can be distributed across cluster servers and found by
consistent hashing [20]. EclipseMR framework’s signature is the usage of a double-ring to store
data and metadata. In the upper layer of the ring, it utilizes a distributed in-memory key-value
cache implemented using consistent hashing, and at its lower layer it uses a DHT type of
distributed file system.
5. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
17
Figure 3. EclipseMR Architecture [19]
In order to leverage large distributed memories and increase the cache-hit ratio, EclipseMR
proposes a Locality-Aware Fair (LAF) job scheduler that works as the load balancer for the
distributed in-memory caches [15]. An experimental study shows that each design and
implementation of EclipseMR components contributes to improving the performance of
distributed MapReduce processing. Simulation shows EclipseMR outperforms Hadoop and Spark
for several representative benchmark applications including iterative applications [20].
2.8. P2P-Cloud
P2P-Cloud offers innovative solutions to some of these concerns. Its architecture has four layers
(Figure 4). A physical layer consisting of volunteer nodes connected via the Internet, above
which is an overlay layer for logically connecting the nodes. The third layer performs the Map
and Reduce functions on the data set. The outermost layer serves as an interface to receive user
jobs.
Figure 4. P2P-Cloud Layered Architecture [21]
6. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
18
P2P-Cloud uses Chord as an overlay protocol to organise nodes. The bootstrapper acts as an
interface for joining nodes and submitting tasks. A node that wishes to join the network, contacts
the bootstrapper. Each new node announces its success rate from previous runs. The success rate
values correspond to the same domain as the nodeId. For each new task, the bootstrapper assigns
a new identifier that is equal to the current number of tasks + 1.The new task is composed of
input data and the Map and Reduce functions. Figure 4 illustrates the P2P-Cloud model. The
slave nodes download the input chunks and run the Map task on them (step 2 of Figure 5). The
intermediate results are replicated on the successor node (step 3 of the Figure 5) and transferred
to the reduction nodes (step 4 of the Figure 5).The reducer performs reduce function on the
intermediate data and transfers the final output to the primary master (step 5 in Figure 5). The
primary master then copies the result to the bootstrap program so that the user can download the
results.
Figure 5. P2P-Cloud model [21]
3. PERFORMANCE ANALYSIS
In this section, we prove that CLOAK-Reduce [19, 22, 23] can play its part in a world where the
main characteristics of computing frameworks are the efficient exploitation of storage and
computing resources. We also take into account MapReduce job management using a load
balancing strategy and peer failure management support to improve fault tolerance in the Internet
environment.
3.1. Theoretical Review
Although the primary concern of distributed storage and parallelized computing is addressed in
EclipseMR, P2P-Cloud and Cloak-Reduce, there are significant differences in their
implementation. Challenges include managing node heterogeneity, task governance, churn and
load balancing.
In our analysis, the first approach is a comparative study of mathematical models based on the
search performance of P2P-Cloud (Equation 1) and EclipseMR (Equation 2):
Assuming that the probability of an arbitrary node at any location in P2P-Cloud at scale N = 2m
or EclipseMR is equal, and that it satisfies a uniform distribution, then the probability is: p = 1/2m
7. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
19
Assuming hop(n,key), the hop it takes for node n to find the identifier key, n(key) is the node
where the resource key is located, so hop(n,key) is the number of hops node n takes to reach
n(key).
Assuming Ehop, the expectation of jumps that a certain node takes to discover a random key,
knowing that the number of nodes at level k is, with a node discovery mechanism similar to an
m − ary tree. So the node on level number i means that it takes i steps to reach this node, the
number of nodes in number i is signed Level(i) here, so the Level(i) = m.
The probability that a random node can be found with i steps would be phi = Level(i) =
This quotient (Equation 3) means that the exception of hops in the lookup process in EclipseMR
could be decreased by about 1/(m-1) or that the efficiency could be improved by about 1/(m-1). In
sum, the dual-ring architecture of EclipseMR reduces the average number of hops between peers
in a search query and reduces query latency when peers in the same group are topologically close.
Despite the fact that the reliability of P2P-Cloud nodes is estimated using the Exponential
Weighted Moving Average (EWMA) algorithm [21].
However, we focused our comparative study on P2P-Cloud and CLOAK-Reduce, because
despite the very promising future of EclipseMR, adding trivial features had become almost
impossible [24].
3.2. Simulation
In this second approach, we compare the performance of CLOAK-Reduce with P2P-Cloud in
terms of fault tolerance to justify the robustness of CLOAK-Reduce. For hardware constraints,
we will set the churn rate to 10%, during the execution period of the Map and Reduce tasks and
choose file sizes of 128, 256, 512 and 1024 MB respectively. All experiments were performed on
a 2.60 GHz Intel Core i5 CPU PC with 8 GB of memory running Windows 10 Professional. The
churns are independent and occur randomly. The comparison of the two frameworks will focus
on the following performance indicators.
1. The total elapsed time with and without churn during the map and reduce phases.
2. Message exchanges between the Resource Manager (RM) and the slave executing nodes for
P2P-Cloud and the Schedulers, JobManagers and JobBuilders for CLOAK-Reduce with and
without churn.
3.3. Results of the Map and Reduce Submissions
3.3.1. Total Elapsed Time Without Churn
The CLOAK-Reduce C5R5 replication framework and the P2P-Cloud have almost the same
trend when there are no churns. However, the simulation results give a slight performance to
CLOAK-Reduce replication C5R5 which has an average time of about 24.805 seconds compared
to 27.540 seconds for the P2P-Cloud framework when the data volume increases from 128 MB to
1024 MB. The mean processing time for the same data volumes of the C1R5 and C5R1
8. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
20
replications of CLOAK-Reduce, under the simulation conditions are 31.513 seconds and 30.251
seconds respectively. Figure 6 illustrates this behaviour.
Figure 6. Total elapsed time in case of no churn
3.3.2. Map Phase with 10% Churn
Figure 7 compares the total elapsed time during a Map submission phase with a 10% churn rate.
The results of the different simulations give the best performance to CLOAK-Reduce with C5R5
replication, with an average job submission time of 45.114 seconds, when we vary the data
volume from 128 MB to 1024 MB. Then we have 55.687 seconds for P2P-Cloud and 57.801 and
66.988 for CLOAK-Reduce with C1R5 and C5R1 replications respectively. However, the time
difference between P2P-Cloud and CLOAK-Reduce with C1R5 replication is 2.115.
Figure 7. Map phase with 10% churn
3.3.3. Reduce Phase with 10% Churn
Figure 8 compares the total elapsed time during a Reduce submission phase with a 10% churn
rate. The results of the different simulations give the best performance to CLOAK-Reduce with
C5R5 replication, with an average job submission time of 50.944 seconds, when we vary the data
9. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
21
volume from 128 MB to 1024 MB. Then we have 67.200 seconds for P2P-Cloud and 116.086
and 68.522 seconds for CLOAK-Reduce with C1R5 and C5R1 replications
respectively.However, we find a small variation in submission time between the Map and Reduce
phases of CLOAK-Reduce with C5R1 replication, which is 66.988 for the Map phase and 68.522
for the Reduce phase. In contrast to CLOAK-Reduce with C5R1 replication, C1R5 replication
often results in a much higher time with almost 50% failure in Reduce submissions.
Figure 8. Reduce phase with 10% churn
4. CONCLUSIONS AND PERSPECTIVES
This study shows the flexibility of the address space and the efficiency of the CLOAK-Reduce
load balancing strategy. The proposed hierarchical architecture uses multiple management nodes
to achieve scalability and availability without significant downtime.
The performance of the proposed approach is verified using two phases of MapReduce. For each
phase, we performed experiments to observe the performance of our framework. Compared to
existing approaches that only consider load-based block placement or block redistribution. Our
proposed global approach achieves good performance in terms of global load submission, load
balancing and distributed task restitution. Its approaches have been solved by our mechanism of
radial replication for intermediate jobs, circular replication for job submission and storage of
unsubmitted jobs at the root level.
In the future, we will focus on optimising the cost of communication between idle and active
nodes. That is, reduce the number of exchanges between candidate JobManagers, inactive
JobBuilders and their elected JobManager.
REFERENCES
[1] Thomas Bourany, 2018, "Les 5v du big data. Regards croises sur l’economie", (2):27–31. read 05 sept
2019.
[2] Than Than Htay et Sabai Phyu, 2020, "Improving the performance of Hadoop MapReduce
Applications via Optimization of concurrent containers per Node", In : 2020 IEEE Conference on
Computer Applications (ICCA). IEEE, p. 1-5.
[3] Jeffrey Dean and Sanjay Ghemawat, "Mapreduce simplified data processing on large clusters".
Communications of the ACM, 51(1), pp:107–113, 2008
10. International Journal of Distributed and Parallel systems (IJDPS) Vol 13, No. 1, January 2022
22
[4] Awaysheh, F. M., Alazab, M., Garg, S., Niyato, D., & Verikoukis, C., 2021, “Big data resource
management & networks: Taxonomy, survey, and future directions”, IEEE Communications Surveys
& Tutorials.
[5] Xue, R., Guan, Z., Dong, Z., & Su, W., 2018, “Dual-Scheme Block Management to Trade Off
Storage Overhead, Performance and Reliability”, In International Conference of Pioneering
Computer Scientists, Engineers and Educators (pp. 462-476). Springer, Singapore, September
[6] Ping, Y., 2020, “Load balancing algorithms for big data flow classification based on heterogeneous
computing in software definition networks”. Journal of Grid Computing, 1-17.
[7] Mansouri, N., Javidi, M. M., 2020, “A review of data replication based on metaheuristics approach in
cloud computing and data grid”. Soft computing, 1-28.
[8] Bhushan, K., 2020, “Load Balancing in Cloud Through Task Scheduling”, In Recent Trends in
Communication and Intelligent Systems (pp. 195-204). Springer, Singapore..
[9] Gao, X., Liu, R., Kaushik, A., 2020, “Hierarchical multi-agent optimization for resource allocation in
cloud computing”. IEEE Transactions on Parallel and Distributed Systems, 32(3), 692-707.
[10] Yahya Hassanzadeh-Nazarabadi, Alptekin Küpçü, et Öznur Özkasap, 2018, “Decentralized and
locality aware replication method for DHT-based P2P storage systems”, Future Generation Computer
Systems, vol. 84, p. 32-46.
[11] Monnet, S., 2015, Contributions à la réplication de données dans les systèmes distribués à grande
échelle (Doctoral dissertation.
[12] Pandey, Vaibhav, and Poonam Saini, 2018, "How heterogeneity affects the design of Hadoop
MapReduce schedulers : a state-of-the-art survey and challenges." Big data 6.2 : 72-95.
[13] Prabhu, Yogesh, and Sachin Deshpande, 2018, "Transformation of Hadoop : A Survey.", IJSTE -
International Journal of Science Technology & Engineering | Volume 4 | Issue 8 | February ISSN
(online) : 2349-784X.
[14] Pathak, A. R., Pandey, M., & Rautaray, S. S., 2020, “Approaches of enhancing interoperations among
high performance computing and big data analytics via augmentation”. Cluster Computing, 23(2),
953-988.
[15] Sewal, P., & Singh, H., 2021, “A Critical Analysis of Apache Hadoop and Spark for Big Data
Processing; In 2021 6th International Conference on Signal Processing, Computing and Control
(ISPCC) (pp. 308-313). IEEE, October
[16] Jiagao Wu, Hang Yuan, Ying He, and Zhiqiang Zou, 2014, “Chordmr : A p2p-based job management
scheme in cloud”. Journal of Networks, 9(3) :541.
[17] Andrew Rosen, Brendan Benshoof, Robert W Harrison, and Anu G Bourgeois, (2016),"Mapreduce on
a chord distributed hash table". In 2nd International IBM Cloud Academy Conference, volume 1,
page 1.
[18] Ion Stoica, Robert Morris, David Liben-Nowell, David R Karger, M Frans Kaashoek, Frank Dabek,
and Hari Balakrishnan. “Chord : a scalable peer-to-peer lookup protocol for internet applications”.
IEEE/ACM Transactions on Networking (TON), 11(1) :17–32, 2003.
[19] Tiendrebeogo, T., Diarra, M., 2020, "Big Data Storage System based on a Distributed Hash Tables
System",International Journal of Database Management Systems (IJDMS) Vol.12, No.4/5, October.
[20] Sanchez, Vicente AB, et al., 2017, "EclipseMR: distributed and parallel task processing with
consistent hashing", In 2017 IEEE International Conference on Cluster Computing (CLUSTER) (pp.
322-332). IEEE
[21] Banerjea, Shashwati, Mayank Pandey et Manoj Madhava Gore., 2016, "Vers une solution peer-to-
peer pour l’utilisation des ressources bénévoles afin de fournir un calcul en tant que service",. En
2016, IEEE 30th International Conference on Advanced Information Networking and Applications
(AINA) (pp. 1146-1153). IEEE, mars.
[22] Mamadou Diarra and Telesphore Tiendrebeogo, 2021, "MapReduce based on CLOAK DHT Data
replication evaluation", International Journal of Database Management Systems (IJDMS) Vol.13,
No.4, August [23]
[23] Mamadou Diarra and Telesphore Tiendrebeogo, 2021, "CLOAK-Reduce Laod balancing strategy for
MapReduce", International Journal of Computer Science and Information Technology (IJCSIT) Vol
13, No 4, August.
[24] Bolea Sanchez, V. A, 2019, "VELOXDFS : Elastic blocks in distributed file systems for big data
frameworks".
[25] Shachi Marathe (April 29, 2017) AN INTRODUCTION TO HADOOP
https://www.mindtory.com/an-introduction-to-hadoop/ rea