Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
This document discusses geo-distributed parallelization of MapReduce jobs across multiple datacenters. It introduces GEO-PACT, a Hadoop-based framework that can efficiently process sequences of parallelization contracts jobs on geo-distributed input data. GEO-PACT uses a group manager to determine optimal execution paths and job managers at each datacenter to execute tasks locally using Hadoop. It employs copy managers to transfer data between datacenters and aggregation managers to combine results. The goal is to optimize execution time by leveraging data locality across geographically distributed data sources.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
This document compares approaches to large-scale data analysis using MapReduce and parallel database management systems (DBMSs). It presents results from running a benchmark of tasks on an open-source MapReduce system (Hadoop) and two parallel DBMSs using a cluster of 100 nodes. The parallel DBMSs showed significantly better performance than MapReduce for the tasks, but took much longer to load data and tune executions. The document discusses architectural differences between the approaches and their performance implications.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
This document summarizes a survey on parallel data processing with MapReduce. It provides an overview of the MapReduce framework, including its architecture, key concepts of Map and Reduce functions, and how it handles parallel processing. It also discusses some inherent pros and cons of MapReduce, such as its simplicity but also performance limitations. Finally, it outlines approaches studied in recent literature to improve and optimize the MapReduce framework.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESijdpsjournal
By programming both the data plane and the control plane, network operators can adapt their networks to
their needs. Thanks to research over the past decade, this concept has more formulized and more
technologically feasible. However, since control plane programmability came first, it has already been
successfully implemented in the real network and is beginning to pay off. Today, the data plane
programmability is evolving very rapidly to reach this level, attracting the attention of researchers and
developers: Designing data plane languages, application development on it, formulizing software switches
and architecture that can run data plane codes and the applications, increasing performance of software
switch, and so on. As the control plane and data plane become more open, many new innovations and
technologies are emerging, but some experts warn that consumers may be confused as to which of the many
technologies to choose. This is a testament to how much innovation is emerging in the network. This paper
outlines some emerging applications on the data plane and offers opportunities for further improvement
and optimization. Our observations show that most of the implementations are done in a test environment
and have not been tested well enough in terms of performance, but there are many interesting works, for
example, previous control plane solutions are being implemented in the data plane.
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
This document proposes HadoopXML, a system for efficiently processing massive XML data and multiple twig pattern queries in parallel using Hadoop. Key features of HadoopXML include:
1) It partitions large XML files and processes them in parallel across nodes while preserving structural information.
2) It simultaneously processes multiple twig pattern queries with a shared input scan, without needing separate MapReduce jobs for each query.
3) It enables query processing tasks to share input scans and intermediate results like path solutions, reducing redundant processing and I/O.
4) It provides load balancing to fairly distribute twig join operations across nodes.
This document discusses geo-distributed parallelization of MapReduce jobs across multiple datacenters. It introduces GEO-PACT, a Hadoop-based framework that can efficiently process sequences of parallelization contracts jobs on geo-distributed input data. GEO-PACT uses a group manager to determine optimal execution paths and job managers at each datacenter to execute tasks locally using Hadoop. It employs copy managers to transfer data between datacenters and aggregation managers to combine results. The goal is to optimize execution time by leveraging data locality across geographically distributed data sources.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
This document compares approaches to large-scale data analysis using MapReduce and parallel database management systems (DBMSs). It presents results from running a benchmark of tasks on an open-source MapReduce system (Hadoop) and two parallel DBMSs using a cluster of 100 nodes. The parallel DBMSs showed significantly better performance than MapReduce for the tasks, but took much longer to load data and tune executions. The document discusses architectural differences between the approaches and their performance implications.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
This document summarizes a survey on parallel data processing with MapReduce. It provides an overview of the MapReduce framework, including its architecture, key concepts of Map and Reduce functions, and how it handles parallel processing. It also discusses some inherent pros and cons of MapReduce, such as its simplicity but also performance limitations. Finally, it outlines approaches studied in recent literature to improve and optimize the MapReduce framework.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESijdpsjournal
By programming both the data plane and the control plane, network operators can adapt their networks to
their needs. Thanks to research over the past decade, this concept has more formulized and more
technologically feasible. However, since control plane programmability came first, it has already been
successfully implemented in the real network and is beginning to pay off. Today, the data plane
programmability is evolving very rapidly to reach this level, attracting the attention of researchers and
developers: Designing data plane languages, application development on it, formulizing software switches
and architecture that can run data plane codes and the applications, increasing performance of software
switch, and so on. As the control plane and data plane become more open, many new innovations and
technologies are emerging, but some experts warn that consumers may be confused as to which of the many
technologies to choose. This is a testament to how much innovation is emerging in the network. This paper
outlines some emerging applications on the data plane and offers opportunities for further improvement
and optimization. Our observations show that most of the implementations are done in a test environment
and have not been tested well enough in terms of performance, but there are many interesting works, for
example, previous control plane solutions are being implemented in the data plane.
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
This document proposes HadoopXML, a system for efficiently processing massive XML data and multiple twig pattern queries in parallel using Hadoop. Key features of HadoopXML include:
1) It partitions large XML files and processes them in parallel across nodes while preserving structural information.
2) It simultaneously processes multiple twig pattern queries with a shared input scan, without needing separate MapReduce jobs for each query.
3) It enables query processing tasks to share input scans and intermediate results like path solutions, reducing redundant processing and I/O.
4) It provides load balancing to fairly distribute twig join operations across nodes.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
This document summarizes a research paper that proposes using in-node combiners to improve the performance of Hadoop MapReduce jobs. It discusses how MapReduce jobs are I/O intensive and describes two common bottlenecks: during the map phase when data is loaded from disks, and during the shuffle phase when intermediate results are transferred over the network. The paper introduces an in-node combiner approach to optimize I/O by locally aggregating intermediate results within nodes to reduce network traffic between mappers and reducers. It evaluates this approach through an experiment counting word occurrences in Twitter messages.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
This document presents two variations of a job-driven scheduling scheme called JOSS for efficiently executing MapReduce jobs on remote outsourced data across multiple data centers. The goal of JOSS is to improve data locality for map and reduce tasks, avoid job starvation, and improve job performance. Extensive experiments show that the two JOSS variations, called JOSS-T and JOSS-J, outperform other scheduling algorithms in terms of data locality and network overhead without significant overhead. JOSS-T performs best for workloads of small jobs, while JOSS-J provides the shortest workload time for jobs of varying sizes distributed across data centers.
A New Architecture for Group Replication in Data GridEditor IJCATR
Nowadays, grid systems are vital technology for programs running with high performance and problems solving with largescale
in scientific, engineering and business. In grid systems, heterogeneous computational resources and data should be shared
between independent organizations that are scatter geographically. A data grid is a kind of grid types that make relations computational
and storage resources. Data replication is an efficient way in data grid to obtain high performance and high availability by saving
numerous replicas in different locations e.g. grid sites. In this research, we propose a new architecture for dynamic Group data
replication. In our architecture, we added two components to OptorSim architecture: Group Replication Management component
(GRM) and Management of Popular Files Group component (MPFG). OptorSim developed by European Data Grid projects for
evaluate replication algorithm. By using this architecture, popular files group will be replicated in grid sites at the end of each
predefined time interval.
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexingijdms
Nowadays, the structure of cloud uses the data processing of simple key value, which cannot adapted in
search of closeness effectively because of the lack of structures of effective indexes, and with the increase of
dimension, the structures of indexes similar to the tree existing could lead to the problem " of the curse of
dimension”. In this paper, we define a new cloud computing architecture in such global index, storage
nodes is based on a Distributed Hash Table (DHT). In order to reduce the cost of calculation, store
different services on the hyperbolic tree by using virtual coordinates thus that greedy routing based on
hyperbolic space properties in order to improve query storage and retrieving performance effectively. Next,
we perform and evaluate our cloud computing indexing structure based on a hyperbolic tree using virtual
coordinates taken in the hyperbolic plane. We show through our experimental results that we compare with
others clouds systems to show our solution ensures consistence and scalability for Cloud platform.
An efficient data mining framework on hadoop using java persistence apiJoão Gabriel Lima
This document describes a proposed framework for efficient data mining on Hadoop using Java Persistence API (JPA) and MySQL Cluster. The framework indexes data using ZSCORE binning and inverted indexing to enable efficient querying. It then stores mining models and results persistently in MySQL Cluster using JPA for object-relational mapping. The document evaluates the performance of the framework by implementing a decision tree algorithm on Hadoop and comparing the JPA approach to a traditional JDBC implementation.
The document proposes a Modified Pure Radix Sort algorithm for large heterogeneous datasets. The algorithm divides the data into numeric and string processes that work simultaneously. The numeric process further divides data into sublists by element length and sorts them simultaneously using an even/odd logic across digits. The string process identifies common patterns to convert strings to numbers that are then sorted. This optimizes problems with traditional radix sort through a distributed computing approach.
This document summarizes a research paper that proposes a new density-based clustering technique called Triangle-Density Based Clustering Technique (TDCT) to efficiently cluster large spatial datasets. TDCT uses a polygon approach where the number of data points inside each triangle of a polygon is calculated to determine triangle densities. Triangle densities are used to identify clusters based on a density confidence threshold. The technique aims to identify clusters of arbitrary shapes and densities while minimizing computational costs. Experimental results demonstrate the technique's superiority in terms of cluster quality and complexity compared to other density-based clustering algorithms.
This document summarizes a research paper on developing an improved LEACH (Low-Energy Adaptive Clustering Hierarchy) communication protocol for energy efficient data mining in multi-feature sensor networks. It begins with background on wireless sensor networks and issues like energy efficiency. It then discusses the existing LEACH protocol and its drawbacks. The proposed improved LEACH protocol includes cluster heads, sub-cluster heads, and cluster nodes to address LEACH's limitations. This new version aims to minimize energy consumption during cluster formation and data aggregation in multi-feature sensor networks.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
This document presents an implementation of the p-PIC clustering algorithm using the MapReduce framework to handle big data. P-PIC is a parallel version of the Power Iteration Clustering (PIC) algorithm that is able to cluster large datasets in a distributed environment. The document first provides background on PIC and challenges with scaling to big data. It then describes how p-PIC addresses these challenges using MPI for parallelization. The design of implementing p-PIC within MapReduce is presented, including the map and reduce functions. Experimental results on synthetic datasets up to 100,000 records show that p-PIC using MapReduce has increased performance and scalability compared to the original p-PIC implementation using MPI.
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
This document proposes bitmap indexes to improve the processing of XML twig queries on shredded XML data stored in relational database tables. It introduces several bitmap indexes built on the tag, path, and tag+level domains that provide quick access to relevant XML elements. A hybrid "bitTwig" index is also presented that can find all instances matching a twig query using only a pair of cursor-enabled bitmap indexes, without accessing the actual XML data stored in the tables. Experiments show this bitTwig index outperforms existing indexing approaches in most cases.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage, which partitions data into blocks and replicates them across nodes for fault tolerance. The master node tracks where data blocks are stored and worker nodes execute tasks like mapping and reducing data. Hadoop provides scalability and fault tolerance but is slower for iterative jobs compared to Spark, which keeps data in memory. The Lambda architecture also informs Hadoop's ability to handle batch and speed layers separately for scalability.
International Refereed Journal of Engineering and Science (IRJES)irjes
a leading international journal for publication of new ideas, the state of the art research results and fundamental advances in all aspects of Engineering and Science. IRJES is a open access, peer reviewed international journal with a primary objective to provide the academic community and industry for the submission of half of original research and applications.
This document is a curriculum seminar report on Hadoop submitted by a computer science student to their professor. It includes sections on the need for new technologies to handle large and diverse datasets, the history and origin of Hadoop, descriptions of the key Hadoop components like HDFS and MapReduce, and comparisons of Hadoop to RDBMS systems and discussions of its disadvantages. The report provides an overview of Hadoop for educational purposes.
Big Data Storage System Based on a Distributed Hash Tables Systemijdms
The Big Data is unavoidable considering the place of the digital is the predominant form of communication in the daily life of the consumer. The control of its stakes and the quality its data must be a priority in order not to distort the strategies arising from their treatment in the aim to derive profit. In order to achieve this, a lot of research work has been carried out companies and several platforms created. MapReduce, is one of the enabling technologies, has proven to be applicable to a wide range of fields. However, despite its importance recent work has shown its limitations. And to remedy this, the Distributed Hash Tables (DHT) has been used. Thus, this document not only analyses the and MapReduce implementations and Top-Level Domain (TLD)s in general, but it also provides a description of a model of DHT as well as some guidelines for the planification of the future research
This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
This document summarizes a research paper that proposes using in-node combiners to improve the performance of Hadoop MapReduce jobs. It discusses how MapReduce jobs are I/O intensive and describes two common bottlenecks: during the map phase when data is loaded from disks, and during the shuffle phase when intermediate results are transferred over the network. The paper introduces an in-node combiner approach to optimize I/O by locally aggregating intermediate results within nodes to reduce network traffic between mappers and reducers. It evaluates this approach through an experiment counting word occurrences in Twitter messages.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
This document presents two variations of a job-driven scheduling scheme called JOSS for efficiently executing MapReduce jobs on remote outsourced data across multiple data centers. The goal of JOSS is to improve data locality for map and reduce tasks, avoid job starvation, and improve job performance. Extensive experiments show that the two JOSS variations, called JOSS-T and JOSS-J, outperform other scheduling algorithms in terms of data locality and network overhead without significant overhead. JOSS-T performs best for workloads of small jobs, while JOSS-J provides the shortest workload time for jobs of varying sizes distributed across data centers.
A New Architecture for Group Replication in Data GridEditor IJCATR
Nowadays, grid systems are vital technology for programs running with high performance and problems solving with largescale
in scientific, engineering and business. In grid systems, heterogeneous computational resources and data should be shared
between independent organizations that are scatter geographically. A data grid is a kind of grid types that make relations computational
and storage resources. Data replication is an efficient way in data grid to obtain high performance and high availability by saving
numerous replicas in different locations e.g. grid sites. In this research, we propose a new architecture for dynamic Group data
replication. In our architecture, we added two components to OptorSim architecture: Group Replication Management component
(GRM) and Management of Popular Files Group component (MPFG). OptorSim developed by European Data Grid projects for
evaluate replication algorithm. By using this architecture, popular files group will be replicated in grid sites at the end of each
predefined time interval.
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexingijdms
Nowadays, the structure of cloud uses the data processing of simple key value, which cannot adapted in
search of closeness effectively because of the lack of structures of effective indexes, and with the increase of
dimension, the structures of indexes similar to the tree existing could lead to the problem " of the curse of
dimension”. In this paper, we define a new cloud computing architecture in such global index, storage
nodes is based on a Distributed Hash Table (DHT). In order to reduce the cost of calculation, store
different services on the hyperbolic tree by using virtual coordinates thus that greedy routing based on
hyperbolic space properties in order to improve query storage and retrieving performance effectively. Next,
we perform and evaluate our cloud computing indexing structure based on a hyperbolic tree using virtual
coordinates taken in the hyperbolic plane. We show through our experimental results that we compare with
others clouds systems to show our solution ensures consistence and scalability for Cloud platform.
An efficient data mining framework on hadoop using java persistence apiJoão Gabriel Lima
This document describes a proposed framework for efficient data mining on Hadoop using Java Persistence API (JPA) and MySQL Cluster. The framework indexes data using ZSCORE binning and inverted indexing to enable efficient querying. It then stores mining models and results persistently in MySQL Cluster using JPA for object-relational mapping. The document evaluates the performance of the framework by implementing a decision tree algorithm on Hadoop and comparing the JPA approach to a traditional JDBC implementation.
The document proposes a Modified Pure Radix Sort algorithm for large heterogeneous datasets. The algorithm divides the data into numeric and string processes that work simultaneously. The numeric process further divides data into sublists by element length and sorts them simultaneously using an even/odd logic across digits. The string process identifies common patterns to convert strings to numbers that are then sorted. This optimizes problems with traditional radix sort through a distributed computing approach.
This document summarizes a research paper that proposes a new density-based clustering technique called Triangle-Density Based Clustering Technique (TDCT) to efficiently cluster large spatial datasets. TDCT uses a polygon approach where the number of data points inside each triangle of a polygon is calculated to determine triangle densities. Triangle densities are used to identify clusters based on a density confidence threshold. The technique aims to identify clusters of arbitrary shapes and densities while minimizing computational costs. Experimental results demonstrate the technique's superiority in terms of cluster quality and complexity compared to other density-based clustering algorithms.
This document summarizes a research paper on developing an improved LEACH (Low-Energy Adaptive Clustering Hierarchy) communication protocol for energy efficient data mining in multi-feature sensor networks. It begins with background on wireless sensor networks and issues like energy efficiency. It then discusses the existing LEACH protocol and its drawbacks. The proposed improved LEACH protocol includes cluster heads, sub-cluster heads, and cluster nodes to address LEACH's limitations. This new version aims to minimize energy consumption during cluster formation and data aggregation in multi-feature sensor networks.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
This document presents an implementation of the p-PIC clustering algorithm using the MapReduce framework to handle big data. P-PIC is a parallel version of the Power Iteration Clustering (PIC) algorithm that is able to cluster large datasets in a distributed environment. The document first provides background on PIC and challenges with scaling to big data. It then describes how p-PIC addresses these challenges using MPI for parallelization. The design of implementing p-PIC within MapReduce is presented, including the map and reduce functions. Experimental results on synthetic datasets up to 100,000 records show that p-PIC using MapReduce has increased performance and scalability compared to the original p-PIC implementation using MPI.
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
This document proposes bitmap indexes to improve the processing of XML twig queries on shredded XML data stored in relational database tables. It introduces several bitmap indexes built on the tag, path, and tag+level domains that provide quick access to relevant XML elements. A hybrid "bitTwig" index is also presented that can find all instances matching a twig query using only a pair of cursor-enabled bitmap indexes, without accessing the actual XML data stored in the tables. Experiments show this bitTwig index outperforms existing indexing approaches in most cases.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage, which partitions data into blocks and replicates them across nodes for fault tolerance. The master node tracks where data blocks are stored and worker nodes execute tasks like mapping and reducing data. Hadoop provides scalability and fault tolerance but is slower for iterative jobs compared to Spark, which keeps data in memory. The Lambda architecture also informs Hadoop's ability to handle batch and speed layers separately for scalability.
International Refereed Journal of Engineering and Science (IRJES)irjes
a leading international journal for publication of new ideas, the state of the art research results and fundamental advances in all aspects of Engineering and Science. IRJES is a open access, peer reviewed international journal with a primary objective to provide the academic community and industry for the submission of half of original research and applications.
This document is a curriculum seminar report on Hadoop submitted by a computer science student to their professor. It includes sections on the need for new technologies to handle large and diverse datasets, the history and origin of Hadoop, descriptions of the key Hadoop components like HDFS and MapReduce, and comparisons of Hadoop to RDBMS systems and discussions of its disadvantages. The report provides an overview of Hadoop for educational purposes.
Big Data Storage System Based on a Distributed Hash Tables Systemijdms
The Big Data is unavoidable considering the place of the digital is the predominant form of communication in the daily life of the consumer. The control of its stakes and the quality its data must be a priority in order not to distort the strategies arising from their treatment in the aim to derive profit. In order to achieve this, a lot of research work has been carried out companies and several platforms created. MapReduce, is one of the enabling technologies, has proven to be applicable to a wide range of fields. However, despite its importance recent work has shown its limitations. And to remedy this, the Distributed Hash Tables (DHT) has been used. Thus, this document not only analyses the and MapReduce implementations and Top-Level Domain (TLD)s in general, but it also provides a description of a model of DHT as well as some guidelines for the planification of the future research
This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
The advent of Big Data has seen the emergence of new processing and storage challenges. These challenges are often solved by distributed processing. Distributed systems are inherently dynamic and unstable, so it is realistic to expect that some resources will fail during use. Load balancing and task scheduling is an important step in determining the performance of parallel applications. Hence the need to design load balancing algorithms adapted to grid computing. In this paper, we propose a dynamic and hierarchical load balancing strategy at two levels: Intrascheduler load balancing, in order to avoid the use of the large-scale communication network, and interscheduler load balancing, for a load regulation of our whole system. The strategy allows improving the average response time of CLOAK-Reduce application tasks with minimal communication. We first focus on the three performance indicators, namely response time, process latency and running time of MapReduce tasks.
CLOAK-REDUCE LOAD BALANCING STRATEGY FOR MAPREDUCEijcsit
The advent of Big Data has seen the emergence of new processing and storage challenges. These
challenges are often solved by distributed processing.
Distributed systems are inherently dynamic and unstable, so it is realistic to expect that some resources
will fail during use. Load balancing and task scheduling is an important step in determining the
performance of parallel applications. Hence the need to design load balancing algorithms adapted to grid
computing.
In this paper, we propose a dynamic and hierarchical load balancing strategy at two levels: Intrascheduler load balancing, in order to avoid the use of the large-scale communication network, and interscheduler load balancing, for a load regulation of our whole system. The strategy allows improving the
average response time of CLOAK-Reduce application tasks with minimal communication.
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
This document discusses a proposed data-aware caching framework called Dache that could be used with big data applications built on MapReduce. Dache aims to cache intermediate data generated during MapReduce jobs to avoid duplicate computations. When tasks run, they would first check the cache for existing results before running the actual computations. The goal is to improve efficiency by reducing redundant work. The document outlines the objectives and scope of extending MapReduce with Dache, provides background on MapReduce and Hadoop, and concludes that initial experiments show Dache can eliminate duplicate tasks in incremental jobs.
There is a growing trend of applications that ought to handle huge information. However, analysing huge information may be a terribly difficult drawback nowadays. For such data many techniques can be considered. The technologies like Grid Computing, Volunteering Computing, and RDBMS can be considered as potential techniques to handle such data. We have a still in growing phase Hadoop Tool to handle such data also. We will do a survey on all this techniques to find a potential technique to manage and work with Big Data.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Simple Load Rebalancing For Distributed Hash Tables In CloudIOSR Journals
This document summarizes a research paper on simple load rebalancing for distributed hash tables in cloud computing environments. It discusses how distributed file systems partition files into chunks stored across nodes to enable parallel processing, but this can lead to unbalanced loads. The paper proposes a fully distributed load rebalancing algorithm where underloaded nodes identify and "steal" work from overloaded nodes to improve load balancing without relying on a centralized coordinator. Simulation results show the distributed algorithm outperforms prior centralized and distributed approaches in terms of load imbalance, movement costs, and overhead.
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERijdpsjournal
With the interconnection of one to many computers online, the data shared by users is multiplying daily. As
a result, the amount of data to be processed by dedicated servers rises very quickly. However, the
instantaneous increase in the volume of data to be processed by the server comes up against latency during
processing. This requires a model to manage the distribution of tasks across several machines. This article
presents a study of load balancing for large data sets on a cluster of Hadoop nodes. In this paper, we use
Mapreduce to implement parallel programming and Yarn to monitor task execution and submission in a
node cluster.
Data-Intensive Technologies for CloudComputinghuda2018
This document provides an overview of data-intensive computing technologies for cloud computing. It discusses key concepts like data-parallelism and MapReduce architectures. It also summarizes several data-intensive computing systems including Google MapReduce, Hadoop, and LexisNexis HPCC. Hadoop is an open source implementation of MapReduce while HPCC provides distinct processing environments for batch and online query processing using its proprietary ECL programming language.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
This document provides an introduction to Hadoop, including its programming model, distributed file system (HDFS), and MapReduce framework. It discusses how Hadoop uses a distributed storage model across clusters, racks, and data nodes to store large datasets in a fault-tolerant manner. It also summarizes the core components of Hadoop, including HDFS for storage, MapReduce for processing, and YARN for resource management. Hadoop provides a scalable platform for distributed processing and analysis of big data.
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
Performance evaluation and estimation model using regression method for hadoop word count.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
This document discusses Hadoop and its core components HDFS and MapReduce. It provides an overview of how Hadoop addresses the challenges of big data by allowing distributed processing of large datasets across clusters of computers. Key points include: Hadoop uses HDFS for distributed storage and MapReduce for distributed processing; HDFS works on a master-slave model with a Namenode and Datanodes; MapReduce utilizes a map and reduce programming model to parallelize tasks. Fault tolerance is built into Hadoop to prevent single points of failure.
This document provides an overview of Hadoop and how it addresses the challenges of big data. It discusses how Hadoop uses a distributed file system (HDFS) and MapReduce programming model to allow processing of large datasets across clusters of computers. Key aspects summarized include how HDFS works using namenodes and datanodes, how MapReduce leverages mappers and reducers to parallelize processing, and how Hadoop provides fault tolerance.
Similar a MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION (20)
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
artificial intelligence and data science contents.pptxGauravCar
What is artificial intelligence? Artificial intelligence is the ability of a computer or computer-controlled robot to perform tasks that are commonly associated with the intellectual processes characteristic of humans, such as the ability to reason.
› ...
Artificial intelligence (AI) | Definitio
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
1. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
DOI:10.5121/ijdms.2021.13402 9
MAP REDUCE BASED ON CLOAK DHT DATA
REPLICATION EVALUATION
MamadouDiarra and TelesphoreTiendrebeogo
Department of Mathematics and Computer Science, Nazi Boni University, Bobo-
Dioulasso, Burkina Faso
ABSTRACT
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
KEYWORDS
Replication, Big Data, MapReduce, CLOAK DHT, Load balancing
1. INTRODUCTION
As the volume of information has exploded, new areas of technology have emerged that offer an
alternative to traditional database and analysis solutions. A few servers with very large storage
capacities and processing power are no longer feasible at reasonable costs.
This study focuses on a distributed processing approach to overcome the limitations of Hadoop
MapReduce, and to free ourselves from a cost equation in relation to the innovationneeds driven
by Big Data.
Our objective is to design a distributed data processing model inspired by the CLOAK DHT [1]
[2] [3] and MapReduce [4] [5] [6].
We will evaluate its performance through simulations by comparing it to existing solutions.
Our study focuses on the following aspects:
1. Our first contribution is a methodology to evaluate the replication mechanisms of the
CLOAK DHT. Our approach is based on a study of some performance indicators, a
simulator and an analysis of the collected measurements.
2. Our second contribution consists in comparing the circular and radial replication
mechanisms offered by the CLOAK DHT.
2. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
10
3. The third contribution consists in studying the quality of the results produced by the
replication analysis and proposing an adequate mechanism for our model.
Our article will be organized as follows:
1. A first step, we discuss the relative works;
2. A second step, we describe our model of the distributed processing platform;
3. The third part, we analyse the results of the different simulations.
2. RELATED WORKS
2.1. MapreducePlatform
Popularised by Jeffrey Dean and Sanjay Ghemawat, MapReduce is a massively parallel
programming model adapted to the processing of very large quantities of data which is based on
two main stages Map and Reduce. Its working principle consists of splitting the initial volume of
data V into smaller volumes vi, of pairs (key, value), which will be handled separately on several
machines [4] [5] [6].
Map (InputKey k, InputValue v) → (OutputKey * IntermediateValue list)
Reduce (OutputKey, IntermediateValue list) → OutputValue list
Figure 1. MapReduce Execution Flow [7]
2.2. Typology of a Hadoop cluster
Apache Hadoop is an ecosystem of the Apache Software Foundation [4]. Hadoop Data File
System (HDFS) is the distributed file storage system of Hadoop. It has a master/slave
architecture, allowing native file storage (image, video, text files etc...) by spreading the load over
several servers. An HDFS cluster architecture is based on two essentials functions [4], [5]:
1. Masters: NameNode, Job Tracker and SecondaryNode:
• The NameNode: Manages distributed blocks of data.
• The Job Tracker: Assigns MapReduce functions to the various “Task Trackers”.
• The SecondaryNode: Periodically makes a copy of the data of the “NameNode” and
replaces it in case of failure.
2. Slaves: DataNode and Task Trackers:
• The Task tracker: the execution of MapReduce orders.
• The DataNode: the storage of data blocks.
3. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
11
2.3. DHT –MapReduce
2.4. Distributed Hash Tables
DHT is a technology for constructing a hash table in a distributed system where each piece of
data is associated with a key and is distributed over the network. DHTs provide a consistent hash
function and efficient algorithms for storing STORE (key, value) and locating LOOKUP (key) of
the node responsible for a given (key, value) pair.
It is therefore important to specify that only a reference of the declared object (Object IDentifier
(OID)) is stored in the DHT. Concerning the search for an object, only by routing the lookup key
can the set of associated values be found [1], [2], [7].
In practice, DHTs are substituted by “overlay” networks on top of the physical networks, thus
reducing the size of the table at each node while considerably increasing the efficiency of the
lookup algorithm.
DHTs have been widely studied because of their attractive properties: efficiency and simplicity
with Chord [8], controlled data placement with SkipNet [9], Pastry [10] routing and localization,
and better consistency and reliable performance with Kademlia [11].
2.5. Architecture and Components of DHT-Mapreduce:
Although the SecondaryNode is a solution in the second version of Hadoop to take over in case
of NameNode failure, other solutions were proposed such as:
• Distributed MapReduce using DHT: [12] It is a framework inspired by P2P-based
MapReduce architecture built on JTXA [13]. The basic idea is to provide another master
node as the backup node. When the master mode is crashing, another master node (backup
node) could take over the remaining tasks of the job. In that distributed framework
approach the multiple master nodes not only are backup nodes, but also act as individual
master nodes and work collaboratively.
• ChordMR: [14] is a new MapReduce system, which is designed to use a Chord DHT to
manage master node churn and failure in a decentralized way. The architecture of
ChordMR includes three basic roles: User, Master and Slave. The User nodes are
responsible for submitting jobs. The master nodes, organised in Chord DHT, are
responsible for the allocation and execution of tasks. However, the slave nodes, which are
responsible for executing the Map/Reduce jobs, are still maintained in their traditional
function as on the Hadoop platform. ChordMR has no scheduler or central management
node, unlike HDFS and Apache Cassandra [15].
• ChordReduce: [16] ChordReduce is a simple, distributed and very robust architecture for
MapReduce. It uses Chord DHT to act as a fully distributed topology for MapReduce,
eliminating the need to assign explicit roles to nodes or to have a scheduler or coordinator.
ChordReduce does not need to assign specific nodes to the backup job; nodes backup their
jobs using the same process as for all other data sent around the ring. Finally, intermediate
data returns to a specified hash address, rather than to a specific hash node, eliminating any
single point of failure in the network.
• Chord-based MapReduce: [17] The present invention relates to a distributed file system and
more particularly to a Chord DHT based MapReduce system and method capable of
achieving load balancing and increasing a cache hit rate by managing data in a double-
layered ring structure having a file system layer and an in-memory cache layer based on a
Chord DHT.
4. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
12
2.6. Data replication
The objective of data replication is to increase data availability, reduce data access costs and
provide greater fault tolerance [18].
By replicating datasets to an off-site location, organisations can more easily recover data after
catastrophic events, such as outages, natural disasters, human errors or cyberattacks.
2.7. CLOAK DHT
CLOAK DHT, the basis of our work, is based on a hyperbolic tree constructed in hyperbolic
space (hyperboloid) and projected stereographically into a Poincaré disk (Euclidean plane) of
radius 1 centred at the origin where the tree node uses a virtual coordinate system [1].
2.7.1. Anatomy of Writes and Reads
CLOAK DHT uses local addressing and gluttonous routing, which is carried out by using
hyperbolic virtual distances. When a node η seeks to join the network, it calculates the
distancebetween its correspondent µ and each of its own neighbours, and selects, in fine, the
neighbour v with the shortest hyperbolic distance.
The distance between any two pointsu and v, in the hyperbolic space H, is determined by
equation 1:
Equation 1. Distance between any two points u and v
DHTs store information in pairs (key, value) in a balanced way on all nodes of the overlay
network.
However, in the CLOAK DHT, not all nodes in the overlay network necessarily store pairs. Only
the elected nodes store pairs.
In a CLOAK DHT, stored queries are named STORAGE and search queries are named
LOOKUP. The figure 3 shows how and where a given pair is stored in the overlay.
A node depth in the addressing tree is defined as the number of ascending nodes to be traversed
before reaching the root of the tree (including the root itself). When creating the overlay network,
a maximum depth pmaxis chosen. This value is defined as the maximum depth that any store in the
tree can have. All nodes that have a depth less than or equal to the pmax can store (key, value)
pairs and thus be starters. All nodes with a depth greater than pmax do not store pairs.
5. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
13
Figure 1. DHT Based on Hyperbolic Geometry
2.7.2. Storage Mechanism
When a node wants to send a storage request, the first server, it is connected to considers a
request as a Big Data object and therefore generates an OID using its name with the SHA-512
algorithm. It divides the 512-bit key into 16 equal-sized 32-bit subkeys for redundant storage.
Similarly, the division of the key into 16 subkeys is arbitrary and could be increased or decreased
depending on the redundancy required. The node, then selects the first subkey and calculates an
angle by a linear transformation is given by equation 3. From this angle, the node determines a
virtual point v on the Poincaré disk according to equation 4. Figure 3 shows how and where a
node is stored in the overlay network [2].
2.7.3. Replication Mechanisms
CLOAK DHT has two data replication mechanisms: circular and radial:
• Circular Replication: To have replicas, the OID is split into keys by the SHA-512
algorithm, so the 512-bit key is split into 16 subkeys of 32 bits. Thus, 16 different storage
angles are used and this improves the search success rate as well as the homogeneity of the
distribution of peers on the elected nodes. The division of the key into rc = 16 subkeys is
arbitrary and can be increased or decreased depending on the need for redundancy. We
refer to this redundancy mechanism as circular replication. In general, from a pair named P
and the hash of its key with the SHA 512 algorithm, we obtain a 512-bit key which we split
into an arbitrary number rc of subkeys.
• Radial replication: For redundancy reasons, a pair can be stored in more than one storage
device within the storage radius. A storage device can store a pair and then redirect its
storage request to its ascendant to store it in turn. The number of copies of a pair along the
storage spoke can be an arbitrary value, rr defined at the creation of the overlay network.
We use the term radiale replication to designate the replication of the pair P along the
storage radius. In effect, starting from the first existing storage closer to the circle, we
replicate the pair rr times on the ascending nodes towards the root of the addressing tree.
We stop, either when we reach a number of radial replications equal to rr = ⌊log(n)/log(q) ⌋
(where q corresponds to the degree of the hyperbolic addressing tree and n the size of the
overlapping networks), or when we reach the root of the addressing tree.
6. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
14
3. OUR CONTRIBUTIONS
3.1. Summary of the Invention
CLOAK-Reduce is built by implementing, on the CLOAK DHT, a module to create and execute
MapReduce operations. It is a distributed model that exploits the advantages of the CLOAK DHT
and MapReduce. CLOAK DHT to submit Map () and Reduce () jobs in a balanced way, thanks to
the replication mechanisms it offers in addition to the task scheduling strategy we provide.
The tree structure of the CLOAK DHT allowed us to define a hierarchical and distributed load
balancing strategy at two levels: Intra-scheduler or Inter- JobManagers and Inter-schedulers
Our model consists of a root node, three schedulers, several candidate nodes or JobManagers
candidates and builder nodes or JobBuilders:
• Root node: The root is a node of coordinate (0,0), the minimal depth (D0) tree, of the
Poincare´ disk. It allows to:
Keep tasks that cannot be executed immediately in the root task spool;
Maintain schedulers load information;
Decide to submit a task to a scheduler.
• Schedulers: The schedulers of depth (D1), have the function of:
Maintain the load information of all their JobManagers;
Decide global balancing of their JobManagers;
Synchronise their load information with their replicas to manage their failure and the
others schedulers for load balancing.
• JobManagers:JobManagers candidates have a minimum depth (≥ D2). They are elected
for:
Supervise MapReduce jobs running;
Manage the load information related to JobBuilders;
Maintain their load state;
Decide local load balancing with their circulars replicas JobManager candidate of the
same scheduler;
Inform JobBuilders of the load balancing decided for the backup intermediate jobs.
• JobBuilders:JobBuilders are of minimum depth (≥ D3). Their function is to:
Execute MapReduce jobs;
Synchronize the images of their different works as they go on their radial replicas of
the depth > D3.
Keep the information on the state of their charge up to date;
Update this workload information at the JobManager;
Perform the balancing necessities ordered by their JobManagers.
7. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
15
Figure 4. CLOAK-Reduce Architecture
3.2. Performance Analysis
3.2.1. Experimental setup
For the data collection, circular and radial replication mechanisms, in a dynamic network with
10% of churns, where necessary. In order to study the scalability of the system, we opted to
consider a random number of nodes as follows: 100, 300, 500 and 1000.
Each simulation phase lasted 2 hours with an observation of the results every 10 minutes, i.e. a
total of 12 observations per phase.
The data processed in this paper is the result of ten (10) independent runs of each simulation
phase in order to extract an arithmetic mean.
This collection allowed us to conduct a comparative study on the storage and lookup performance
of radial replication, circular replication and both replication mechanisms. Next, we compared the
mean number of storage and lookup hops as a function of scalability.
This limitation is due to the limited resources to executing the simulation.
• The first phase: We set the circular replication to (01) and varied the radial replication from
(02) to (05).
• The second phase: We set the radial replication to one (01) and varied the circular from two
(02) to five (05).
To achieve our simulation objectives, we opted to use the PeerSim simulator.
3.2.2. Simulator
• PeerSim [13] is a simulator whose main objective is to provide high scalability, with
network sizes up to 1000000 nodes, which characterises its dynamics and extensibility. Its
8. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
16
modularity facilitates the coding of new applications. The PeerSim configuration file has
three types of components: protocols, dynamics and observers.
• Simulation is a widely used method for performance evaluation. It consists of observing the
behaviour of a simplified model of the real system with the help of an appropriate
simulation program which will result in graphs that are easy to analyse and interpret. This
method is closer to the real model than analytical methods, and is used when evaluation of
direct measure becomes very expensive [17].
3.2.3. Some performance indicators
We are interested by the routing efficiency through the study success latency of storage, lookup
and their hops count of Big Data objects. The following performance metrics have been analysed
in order to measure the performance.
• Routing Efficiency (RE): Also known as lookup success rate is calculated by taking the
ratio of total number of lookup messages managed to succeed before timeout and the total
number of lookup messages broadcasted by the system. This metric is calculated using. RE
= (Successful storage or lookup / Total messages) × 100
• Hop count (HC): Or workload per nodes is calculated by taking the ratio of total
broadcasted complex query messages and total number of nodes required to successfully
forward the messages to the destination node. This is defined in:
HC = Total broadcast MSG / Total nodes required to forward MSG
• Success Latency: The success latency is defined as the time taken for a packet of data to
get from one designated point to another and sometimes measured as the time required for
a packet to be returned to its sender.
• Variance: The variance of a series, noted V, is the average of the squares of the deviations
of each value from the mean m of the series.
V= (n1 × (x1 − m)2
+ n2 × (x2 − m)2
+ ... + np × (xp − m)2
)/ N
• Standard deviation: The standard deviation of a series is the number, noted σ,
with σ =√V, where V is the variance of the series.
• Max / Min: Represented respectively the maximum and minimum mean of the results of
the ten (10) independent runs of each simulation phase.
3.2.4. Storage performance evaluation
3.2.4.1. Scalability analysis
Figure 5 shows the random generation of 10,000 nodes in the hyperbolic plane. It shows an
almost uniform distribution of nodes around the root. This scalability of the nodes implies that
our system builds a well-balanced tree which will make it easier to manage the MapReduce jobs.
9. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
17
Figure 5. Nodes distribution on Poincaré disk
3.2.4.2. Circular storage analysis
Table 1, Circular replication 5 shows a better result with a storage success mean of 67.38% with a
standard deviation of 11.66%.
Table 1. Circular storage success ratio (%)
Replicas
Storage Success
Mean σ
Max Min
2 82.04 32.76 54.37 14.73
3 84.74 40.12 58.04 14.26
4 89.75 49.25 62.48 12.34
5 90.27 54.78 67.38 11.66
Figure 6 shows the evolution of the successful storage ratio during a circular replication as a
function of time. These different curves are the arithmetic mean of the (10) independent
replications in each simulation phase. The curves are globally decreasing.
We can conclude that the storage success ratio increases with the number of replications.
10. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
18
Figure 6. Circular storage efficiency
3.2.4.3. Radial storage analysis
Table 2, Radial replication 5 presents a better result with a mean storage success ratio of 69.35%
with a standard deviation of 10.86%. We can conclude that the higher the number of replications,
the higher the storage success ratio.
Table 2. Radial storage success ratio (%)
Replicas
Storage Success
Mean σ
Max Min
2 82.91 40.05 57.41 13.42
3 84.41 42.64 59.67 13.37
4 87.83 50.94 64.75 11.69
5 91.27 56.07 69.35 10.86
Figure 7 shows the variation of the storage success ratio during radial replication as a function of
time. The curves are globally decreasing.
Figure 7. Radial storage efficiency
11. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
19
3.2.4.4. Circular and Radial storage analysis
Figure 8, the curves show broadly the same trend. The two replication mechanisms have a
standard deviation difference of 0.8%.
However, radial replication 5 is slightly better than circular replication 5 with an average of
1.97% more storage success.
Table 3. Storage success ratio (%)
Replicas
Storage Success
Mean σ
Max Min
C5R1 99.25 56.43 74.68 13.63
C1R5 91.27 56.07 69.35 10.86
Furthermore, increasing the number of replications generally increases the storage success ratio.
From the above analysis, we can say that radial replication 5 has a better storage ratio. Also, from
the previous comparisons we can confirm that as the number of replication increases, the success
ratio improves.
Figure 8. Storage efficiency
3.2.5. Lookup performance evaluation
3.2.5.1. Circular replication
Table 4, Circular replication 5 in Figure 9 shows a better result with an average search success of
77.73% with a standard deviation of 12.68%. We see that the higher the number of replications,
the higher the lookup success ratio.
Table 4. Circular lookup success ratio (%)
Replicas
Storage Success
Mean σ
Max Min
2 97.58 42.85 67.76 18.96
3 97.62 48.68 67.79 15.02
4 98.31 53.91 70.49 13.29
5 100 60.15 77.73 12.68
12. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
20
Figure 9 shows the evolution of the lookup success ratio during a circular replication as a
function of time. The curves are globally decreasing.
Figure 9. Circular lookup efficiency
3.2.5.2. Radial replication
Table 5, Circular replication 5 shows a better result with lookup success mean ratio of 74.68%
and a standard deviation of 13.63%.
Table 5. Radial lookup success ratio (%)
Replicas
Storage Success
Mean σ
Max Min
2 94.96 44.26 64.25 16.08
3 97.36 48.28 68.33 15.23
4 98.24 51.21 70.14 15.18
5 99.25 56.43 74.68 13.63
Figure 10, We see that the higher the number of replications, the higher the Lookup success ratio.
Figure 10. Radial lookup efficiency
13. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
21
3.2.5.3. Lookup Analysis
The comparative study in Figure 11 shows that circular replication 5 Lookup success ratio
77.73%, which is the highest of the results obtained, and a standard deviation of 12.68%, which is
the lowest of all replications. It is followed by radial replication 5 with Lookup success ratio
74.68% and a standard deviation 13.63%. Both replication mechanisms have curves that show
broadly the same trend with a slight advantage of circular replication 5.
Furthermore, increasing the number of replications generally increases the Lookup success ratio.
In the Figure 12, the radial replication hops count ratio increases from 3.31 to 5.70 with a
standard deviation of 1.10 % when the nodes increase from 100 to 1000, while the circular
replication increases from 3.54 to 5.73 with a standard deviation of 1.20% with the same
variation of nodes. Radial replication has a slightly better storage performance than circular
replication.
Table 6. Lookup success ratio (%)
Replicas
Storage Success
Mean σ
Max Min
C1R5 99.25 56.43 74.68 13.63
C5R1 100 60.15 77.73 12.68
Figure 11. Lookup efficiency
3.2.6. Hops mean by system size performance
This part of our paper highlights a comparative study of the storage and lookup hops mean,
according to the size of the network. The network size limitation of 1000 nodes is due to the
resources we have at our disposal.
3.2.6.1. Storage hops success
• Circular replication: The hops count mean, standard deviation is 1.40 between 100 and
300 nodes. It decreases to 0.62 between 300 and 500 nodes and is 0.37 between 500 and
1000 nodes.
14. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
22
• Radial replication: The hops count mean, standard deviation is 1.04 between 100 and 300
nodes. It drops to 0.89 between 300 and 500 nodes and 0.26 between 500 and 1000
nodes.
Table 7, Radial replication hops count ratio increases from 3.31 to 5.70 with a standard deviation
of 1.10 % when the nodes increase from 100 to 1000, while the circular replication increases
from 3.54 to 5.73 with a standard deviation of 1.20% with the same variation of nodes.
Table 7. Storage hops success ratio (%)
Replicas
Hops Mean
Mean σ
C5R1 C1R5
100 3.54 3.31 3.43 0.12
300 4.57 4.71 4.64 0.07
500 5.47 5.33 5.40 0.07
1000 5.73 5.70 5.72 0.02
Figure 12, Radial replication has a slightly better storage performance than circular replication.
Figure 12. Storage hops count mean
3.2.7. Lookup hops success
• Radial replication increases from 3.90 to 5.84 with a standard deviation of 0.97% when the
number of nodes increases from 100 to 1000. We see a small increase of hops number
means, which is 1.10 between 100 and 300 nodes, 0.58 between 300 and 500 nodes and
0.27 between 500 and 1000 nodes.
• Circular replication increases from 3.47 to 5.35 with a standard deviation of 0.94% when
the number of nodes increases from 100 to 1000. We find a small increase of hops number
means, which is 1.05 between 100 and 300 nodes, 0.59 between 300 and 500 nodes and
0.25 between 500 and 1000 nodes.
Table 8, the circular replication hops mean changes from 3.47 to 5.35 with a standard deviation
of 0.94% when the number of nodes increases from 100 to 1000, while that radial replication
changes from 3.90 to 5.84 with a standard deviation of 0.97% when the number of nodes
increases from 100 to 1000.
15. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
23
Table 8. Lookup hops success ratio (%)
Replicas
Hops Mean
Mean σ
C5R1 C1R5
100 3.47 3.90 3.60 0.22
300 4.51 4.99 4.75 0.24
500 5.10 5.57 5.37 0.24
1000 5.33 5.84 5.60 0.25
Figure 13, we conclude that radial replication has a slightly better storage performance than
circular one
Figure 13. Lookup hops count mean
3.2.8. Hops Success Ratio According to the Type of Replication
Figure 14, the circular replication lookup hops mean has ranged from 3.47 to 5.35 with a standard
deviation of 0.94% when the number of nodes increases from 100 to 1000.
It is the lower of the two replication mechanisms.
In the mean number of storage hops, the radial replication mechanism, grows from 3.31 to 5.70
with a standard deviation of 1.2% when the number of nodes increases from 100 to 1000. It has a
slightly better mean number of storage hops than circular replication.
Also, from the above, we can say that the standard deviation, mean hop count decreases as the
number of nodes increases.
16. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
24
Figure 14. Hops efficiency
4. CONCLUSIONS AND PERSPECTIVES
Our simulations have shown that replications allow a mean storage and lookup success ratio of
over 60%. The low mean number of lookups hops and the high lookup success ratio of circular
replication is in line. The high storage success ratio of radial replication and its low mean number
of storage hops are supported our simulation results.
The main contribution of our work was to find an ideal storage and lookup mechanism for
CLOAK-Reduce. Also, this comparative study of the replication mechanisms of the CLOAK
DHT allowed us to highlight the efficiency of radial replication storage and circular replication in
data lookup.
For future work, we will make a comparative study of the results of this paper with a replication
mechanism of five circular replications and five radial circular replications. After this
comparison, we will evaluate the dynamic load balancing mechanisms of CLOAK-Reduce.
Then we will implement a fully functional version of CLOAK-Reduce. Finally, we will perform
simulations to demonstrate its performance against the architectures mentioned in the related
works.
REFERENCES
[1] TelesphoreTiendrebeogo, DaoudaAhmat, and Damien Magoni, (2014) “Évaluation de la
fiabilitéd’une table de hachagedistribuéeconstruitedans un plan hyperbolique”, Technique et Science
Informatique, TSI, Volume 33 - n◦ 4/2014, Lavoisier, pages 311–341
[2] TelesphoreTiendrebeogo and Damien Magoni, (2015) “Virtual and consistent hyperbolic tree: A new
structure for distributed database management”. In International Conference on Networked Systems,
pages 411–425. Springer, 2015.
[3] Tiendrebeogo, Telesphore, (2015) “A New Spatial Database Management System Using an
Hyperbolic Tree”, DBKDA 2015: 53.
[4] Jeffrey Dean and Sanjay Ghemawat, (2008) “MapReduce simplified data processing on large
clusters”, Communications of the ACM, 51(1) :107–113, 2008.
[5] Neha Verma, Dheeraj Malhotra, etJatinder Singh (2020) “Big data analytics for retail industry using
MapReduce-Apriori framework”, Journal of Management Analytics, p. 1-19.
[6] Than ThanHtay and SabaiPhyu, (2020) “Improving the performance of Hadoop MapReduce
Applications via Optimization of concurrent containers per Node”, In: 2020 IEEE Conference on
Computer Applications (ICCA). IEEE, p. 1-5.
17. International Journal of Database Management Systems (IJDMS) Vol.13, No.4, August 2021
25
[7] TelesphoreTiendrebeogo, MamadouDiarra, (2020) “Big Data Storage System Based on a Distributed
Hash Tables system” International Journal of Database Management Systems (IJDMS) Vol.12,
No.4/5
[8] Ion Stoica, Robert Morris, David Liben-Nowell, David R Karger, M FransKaashoek, Frank Dabek,
and Hari Balakrishnan, (2003) “Chord: a scalable peer-to-peer lookup protocol for internet
applications”, IEEE/ACM Transactions on Networking (TON), 11(1) :17–32
[9] Nicholas JA Harvey, John Dunagan, Mike Jones, Stefan Saroiu, Marvin Theimer, and Alec Wolman,
(2002) “Skipnet: A scalable overlay network with practical locality properties”
[10] Antony Rowstron and Peter Druschel. Pastry, (2001) “Scalable, decentralized object location, and
routing for large-scale peer-to-peer systems”, In IFIP/ACM International Conference on Distributed
Systems Platforms and Open Distributed Processing, pages 329–350. Springer.
[11] PetarMaymounkov and David Mazieres, (2002) “Kademlia : A peer-to-peer information system
based on the xor metric”, In International Workshop on Peer-to-Peer Systems, pages 53–65. Springer
[12] Chiu, Chuan-Feng, Steen J. Hsu, and Sen-Ren Jan. (2013) "Distributed mapreduce framework using
distributed hash table." International Joint Conference on Awareness Science and Technology, Ubi-
Media Computing (iCAST 2013, UMEDIA 2013). IEEE.
[13] FabrizioMarozzo, Domenico Talia, and Paolo Trunfio. P2p-mapreduce: Parallel data processing in
dynamic cloud environments. Journal of Computer and System Sciences, 78(5) :1382–1402, 2012.
[14] Jiagao Wu, Hang Yuan, Ying He, and Zhiqiang Zou, (2014) “Chordmr : A p2p-based job
management scheme in cloud”. Journal of Networks, 9(3) :541.
[15] Carpenter, Jeff, and Eben Hewitt, (2020) Cassandra: the definitive guide: distributed data at web
scale. O'Reilly Media.
[16] Andrew Rosen, Brendan Benshoof, Robert W Harrison, and Anu G Bourgeois, (2016) “Mapreduce
on a chord distributed hash table”. In 2nd International IBM Cloud Academy Conference, volume 1,
page 1.
[17] Nam, Beomseok, (2019) “Chord distributed hash table-based map-reduce system and method” U.S.
Patent No. 10,394,782. 27 Aug.
[18] YahyaHassanzadeh-Nazarabadi, AlptekinKüpçü, etÖznurÖzkasap, (2018) “Decentralized and
locality aware replication method for DHT-based P2P storage systems”, Future Generation Computer
Systems, vol. 84, p. 32-46.
[19] Harold Scott Macdonald Coxeter and GJ Whitrow. (1950) “World-structure and non-euclidean
honeycombs”. Proceedings of the Royal Society of London. Series A. Mathematical and Physical
Sciences, 201(1066) :417–437.
[20] Harold Stephen Macdonald, (1954) Coxeter. Regular honeycombs in hyperbolic space.In Proceedings
of the International Congress of Mathematicians, volume 3, pages 155–169. Citeseer.
AUTHOR
DiarraMamadou, master degree in information systemsand decision support system in
nazi BONI universityBurkina Faso. My interest research topic is imagewatermarking for
decision support and multimedia system.
TelesphoreTiendrebeogo PhD and overlay network and assistant professor at Nazi Boni
University. I have a master's degree in multimedia and real time system. My current
research is on big data and image watermarking