SlideShare una empresa de Scribd logo
1 de 62
A performance analysis of OpenStack
Cloud vs
Real System on Hadoop Clusters
A performance analysis of OpenStack
Cloud vs
Real System on Hadoop Clusters
BY : KUMARI SURABHI
INDEX TERMS
• Openstack Cloud
• Hadoop
• Distributed system,
• Virtualization
• Big data.
INTRODUCTION
• Computers are revolutionizing the human era, especially the IT field.
• With new technologies and ideas, smart, efficient and faster computers
and frameworks are introduced in market every day.
• Advancement of intelligent machines and ideas makes computation,
storage and transaction of data faster and accurate which eventually
makes institutions, companies and individuals solve their problems with
ease.
• Among so many computation improvements, cloud computing and
distributed system are the main focus of this seminar.
CLOUD COMPUTING
• Cloud computing is the emerging technology which is being exploited in
every aspects of technology.
• Cloud computing is an abstract term describing the use of resources,
which do not belong to the user to perform required task and then
disconnect from the resources when not in use.
• Most obvious examples would be Gmail, Google doc, Amazon EC2 and
storage, social networks such as Facebook and many more.
WHY CLOUD COMPUTING ?
BIG DATA
• Describe the exponential growth, availability and use of information, both
structured and unstructured.
• Concept of Big Data has three basic dimensions: volume, variety, velocity
and other dimensions are veracity and complexity.
PROBLEM STATEMENTS AND PRELIMINARIES
The purpose of this seminar is to answer the following
questions:
●
Is performance of OpenStack cloud is better than real system on
Hadoop cluster?
●
Is it feasible to run image processing MapReduce job in a Hadoop
cluster on OpenStack cloud?
●
Are the technical difficulties of converting image files to pdf using
MapReduce framework in distributed system?
CLOUD COMPUTING & OPENSTACK CLOUD
• Cloud computing is the use of computing resources (hardware and
software) that are delivered as a service over a network (typically the
Internet).
• Cloud services include the delivery of software, infrastructure, and
storage over the Internet.
• Based on technology used, cloud computing can be sub categorized into:
Public, Private, Community and Hybrid clouds.
Cloud Computing continued...
cloud computing can be broadly divided into:
●
Software as a Service
(SaaS),
●
Platform as a Service
(PaaS),
●
Infrastructure as a
Service (IaaS).
CLOUD COMPUTING continued...
Based on technology used, cloud computing can be sub categorized into:
• Public,
• Private,
• Community
• Hybrid clouds.
CLOUD COMPUTING continued...
There are many platforms available to set up a cloud such as :
• CloudStak (Foundation)
• DevStack,
• Openstack,
• Eucalyptus,
• Nebula and many more.
Note : Openstack is chosen as the framework for implementing
cloud for the article.
OPENSTACK CLOUD
• It is open-source which is used to open to pick and mix any hardware
needs.
• Open to design own networks.
• Open to use any virtualization technology.
• Open to other needed features and so on.
OPENSTACK CLOUD
• OpenStack is an open source cloud framework, originally launched by
Rackspace and NASA, with the aim to promote cloud standards and
provide a solid foundation for cloud development.
• It is most widely used tool for setting a private and public cloud.
• Big companies like Dell, AMD, Cisco, HP and Rackspace are using it.
• Linux heavyweights like Red Hat and Ubuntu are implementing it.
• Amazon is available on OpenStack with compatible API with Amazon.
OPENSTACK CLOUD continued...
• OpenStack is an Infrastructure as a Service (IaaS) cloud computing project
that is free open source software.
• It is revolutionizing the cloud computing world
• Aim to create a system where storage, resources and performance would
scale everything up quickly and efficiently.
OPENSTACK CLOUD continued...
The OpenStack cloud currently consists
of six projects:
• Nova,
• Swift, Glance,
• Keystone,
• Quantum,
• Horizon.
OPENSTACK CLOUD continued...
●
Nova :
Nova is the computing fabric controller for the OpenStack cloud.
●
Swift :
Swift is the storage system for OpenStack which is analogous to Amazon
Web Service Simples Storage system.
●
Glance :
It is an imaging service for OpenStack which is responsible for discovery,
registration and delivery services for disk and server images.
OPENSTACK CLOUD continued...
●
Keystone :
Keystone is OpenStack identity service which provides authentication
and authorization for all components of OpenStack.
●
Horizon :
Horizon is web based dashboard that provides administrators and users a
graphical interface to access, provision and automate cloud-based
resources.
OPENSTACK CLOUD ARCHITECTURE
HADOOP
• There are many distributed systems available to comply with the big data
problems faced by big companies.
• Hadoop is one of the available frameworks.
• Hadoop makes data mining, analytics, and processing of big data cheap
and fast.
• Hadoop is an open source project and is made to deal with terabytes of
data in minutes.
• Hadoop stores and processes any kind of data.
• Hadoop is natively written in Java but can be accessed using other
languages such as SQL-inspired language (Hive), c/c++, python and many
more.
HADOOP continued...
• An open source web search engine which was based on Googles
MapReduce.
• Hadoop works on commodity hardware.
HDFS(Hadoop Distributed File System)
●
Hadoop Distributed File System provides unrestricted, high-
speed access to the data application.
●
A scalable, fault tolerant, high performance distributed file
system.
●
Namenode holds filesystem metadata.
●
Files are broken up and spread over datanodes.
●
Data divided into 64MB(default) or 128 blocks, each block
replicated 3 times(default) .
ARCHITECTURE OF HDFS
MAPREDUCE
●
Map Reduce programs are executed in two main phases,
called mapping and reducing.
●
Each phase is defined by a data processing function, and
these functions are called mapper and reducer, respectively.
●
In the mapping phase, Map Reduce takes the input data and
feeds each data element to the mapper.
●
In the reducing phase, the reducer processes all the outputs
from the mapper and arrives at the final result.
●
In simple terms, the mapper is meant to filter and transform
the input into something that the reducer can aggregate over.
MAPREDUCE continued...
PERFORMANCE ANALYSIS MODEL
●
In the performance analysis model, we will discusse use of
two basic applications :
1. WordCount Application
2. Imagetopdf Conversion Application.
●
WordCount is a common MapReduce program which is used
to count total number of word found in the document.
●
Imagetopdf Conversion program is used for converting image
into pdf.
PERFORMANCE ANALYSIS MODEL continued...
●
These two programs are executed in commodity computers
cluster as well as OpenStack Cloud virtual instance cluster.
●
Analyse the performance by changing the number of nodes
and size of the data.
●
The performance analysis has been done for both the
applications.
WORD COUNT APPLICATION
●
WordCount is a simple application that counts the number of
occurrences of each word in a given input set.
●
The purpose of this program is to calculate the total number of repetition
of words in a particular document.
●
The pseudo code for the Mapper and Reducer for WordCount program is
outlined in Algorithm 1 and Algorithm 2 respectively.
Mapper function for WordCount Program
●
Input: String filename, String document
●
Output: String token , 1
1) Map(String filename, String document)
2) {
3) List<String> T = tokenize(document);
4) For each token in T
5) {
6) emit ((String) token, (Integer) 1);
7) }
8) }
Reducer function for WordCount Program
●
Input: String token, 1
●
Output: (String) token , sum
1) Reduce(String token, List<Integer> values)
2) {
3) Integer sum = 0;
4) For each value in values
5) {
6) Sum = sum + value;
7) }
8) emit ((String) token, (Integer) sum);
9) }
10) }
Image to Pdf Conversion Application
●
Hadoop is popular for processing textual big data so there are a lot of
materials available if an application related with texts is to be developed.
●
But few works has been done on image data processing in Hadoop.
●
So there were a lot of challenges while developing the application.
●
Some of the difficulties faced were serialization issues with images,
splitting of images by hadoop to its default blocks, image to pdf
conversion, text to pdf conversion and many more.
Work flow of the application
●
Under Map Reduce model, data processing primitives are called mappers
and reducers.
Mapper function for ImagetoPdf Program
●
Input: String key, KUPDF value
●
Output: filename, KUPDF value(pdf file)
1) Map(String key, KUPDF value)
2) {
3) For each bufferList in value
4) {
5) write (filename, value)
6) }
7) }
Reducer function for ImagetoPdf Program
●
Input: String key, KUPDF value
●
Output: filename, KUPDF value(pdf file)
1) Reduce(String key, KUPDF values)
2) {
3) For each value in values
4) {
5) concat value as separate page of pdf
6) }
7) write (key, final pdf)
8) }
9) }
MAPPER AND REDUCER
●
In the mapping phase, Map Reduce takes the input data and feeds each
data element to the mapper.
●
In the reducing phase, the reducer processes all the outputs from the
mapper and arrives at the final result.
●
The mapper is meant to filter and transform the input into something
that the reducer can aggregate over.
●
PDFMapper and PDFReducer class in the application does the above
mentioned jobs in the application developed with image files and pdf
files.
METHODS OF PERFORMANCE EVALUATIONS
Cloud Cluster Setup
• quad core Intel® Xenon(R) CPU of 64 bits
• 16GB RAM
• 1TB ATA disk
• 500GB ATA disk as storage
• 32 bits dual gigabit network interfaces was used.
• Ubuntu 14.04 server was installed for Operating System.
Cloud Cluster Setup continued...
●
Installation and configuration of Openstack Essex cloud was done by the
tutorial of Openstack.
●
Appropriate images were created using virtualization tool, such as
QEMU, supporting KVM or XEN and by using terminal commands.
●
virtual systems for cloud cluster setup, were created by using terminal
commands and by accessing web-interface of openstack after successful
network configuration for fixed and floating ips and other security
parameters by using the images created.
Commodity Computer Cluster Setup
The four-node cluster of commodity computers is setup on :
●
Intel i5 quad core 64-bit CPU with 2GB RAM
●
One with 160GB ATA disk
●
Other three with 80GB ATA disk
●
32 bit gigabit network interface.
2. Commodity Computer Cluster Setup continued...
●
Passwordless secure shell was confugured
●
Java 7 was installed and
●
Hadoop 0.20.2 was configured on all four instances.
Cloud Cluster Setup continued...
●
Four-node cluster, one acting as master/slave and the rest three as
slaves, was created by using Ubuntu 14.04 image.
●
Passwordless secure shell was confugured, java 7 was installed and
Hadoop 0.20.2 was configured on all four instances.
Configuration of Experiments
Commodity
Computer/Cloud Server
Commodity Computers
Details
Server Computer Details
Master vs
Cse-dcg
2 GB RAM | 2
VCPU | 160 GB
Storage
2 GB RAM | 2
VCPU | 80 GB
Storage
Slave1 vs
uesr1
2 GB RAM | 2
VCPU | 160 GB
Storage
2 GB RAM | 2
VCPU | 80 GB
Storage
Slavs2 vs
user2
2 GR BAM | 2
VCPU | 80 GB
Storage
2 GB RAM | 2
VCPU | 80 GB
Sotrage
Slave3 vs
user3
2 GB RAM | 2
VCPU | 80 GB
Storage
2 GB RAM |
2 VCPU |
80 GB Storage
Experimental Results
After the successful configuration of the clusters, three jobs:
●
Two jobs to convert image files to pdf files and
●
One job of word count were run on both systems.
●
The first two jobs were based on image and pdf files being serialized in
map reduce framework.
●
The last job was implemented based on the standard WordCount
program available in the Hadoop package.
●
The algorithms are run first on two node cluster with a Master and two
slaves and then scaled up to four node cluster of a Master and four slaves
(Master running slave machine as well).
1) Directory-wise Image to PDF:
The results of first job is summarized in the Table II and Table III :
●
TABLE II : SUMMARY OF FIRST JOB ON TWO NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
23 folder | 94
image files |
169 MB
23 folder |
94 pdf files|
90.1 MB
5 minute and
20 second
Cloud
Cluster
23 folder | 94
image files|
169 MB
1 folder |
94 pdf files|
90.1 MB
3 minute and
43 second
Directory-wise Image to PDF continued...
●
The first jobs algorithm is designed to search images directory-wise and
convert each image files to pdf file with same directory tree as the input
image files.
●
TABLE III : SUMMARY OF FIRST JOB ON FOUR NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
23 folder | 94
image files |
169 MB
23 folder |
94 pdf files|
90.1 MB
3 minute and
8 second
Cloud
Cluster
23 folder | 94
image files|
169 MB
1 folder |
94 pdf files|
90.1 MB
1 minute and
31 second
GRAPHICAL REPRESENTATION
Time taken for First Job
EXPLANATION
The input to the job contained :
●
23 folder
●
94 files and
●
In total 169 MB in size.
The output was the conversion of each image files to pdf with same
directory and file names and configured to generate 90.1 MB in size for
both the processing.
●
The processing were repeated three times to get the average.
2) Multiple Images to Single PDF:
●
Modified version of the first one explained above.
●
All images are converted to final single pdf output file.
●
This processing is also done first in two node and later scaled up to four
node cluster as done in first algorithm.
●
The processing are repeated three time to get the average.
●
The experiments are summarized in Table IV and Table V.
Multiple Images to Single PDF continued...
●
The input contained 476 image files in one directory and its size was 926
MB.
●
The output was a single pdf file of 200.1 MB in size.
TABLE IV. SUMMARY OF 2ND JOB ON TWO NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
11 minute and
29 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
12 minute and
28 second
Multiple Images to Single PDF continued...
TABLE V. SUMMARY OF 2ND JOB ON FOUR NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
7 minute and
51 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
9 minute and
22 second
GRAPHICAL REPRESENTATION
This Shows that commodity computer clusters are more efficient than
virtual node cluster in Openstack cloud.
Time taken for Second Job
EXPLANATION
●
The first two jobs were processed on mapping of small image files that
are no so effective in Hadoop system as hadoop performs really well with
large data sets as input.
●
So in order to test the real performance of Hadoop on big data, the
default word count job of Hadoop system was also run.
●
Hadoop was designed for text processing rather than image processing
so the textual processing was also chosen to analyse the Hadoop clusters.
TABLE VI. SUMMARY OF THIRD JOB ON TWO NODES
The inputs for the job was a text file of 1.1 GB and the output was a file
containing list of words which was 364.6KB in size.
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
7 minute and
51 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
9 minute and
22 second
TABLE VII. SUMMARY OF THIRD JOB ON FOUR NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 text file |
1.1 GB
1 text files
with counts |
361.6 KB
4 minute and
0 second
Cloud
Cluster
1 text file |
1.1 GB
1 text files
with counts |
361.6 KB
5 minute and
1 second
GRAPHICAL REPRESENTATION
It proves that Hadoop cluster in commodity computers performed well
than the Hadoop cluster in the cloud.
Time taken for Third Job
PERFORMANCE ANALYSIS
●
The Hadoop distributed system set up on personal computers is certain
to be more efficient and faster than the cloud system.
●
First reason is that Hadoop is developed with commodity machines in
mind.
●
Second obvious reason is that the processing is done in physical
hardware without any resources sharing as compared to cloud systems.
CONTRADICTION
●
The first job is contradictory with the points discussed above and with
other two jobs.
REASON :
●
The job has to recursively read and write files, thus has to cache all the
bytes read and to be written,which is faster in cloud as the nodes are in
one server and there is no wire communication between nodes.
CONCLUSION
●
An analysis of running a Hadoop cluster incloud and in real system and
identifying the best solution by running simple Hadoop jobs in the
configured clusters.
●
It concludes that running a Hadoop cluster in cloud for data storage and
analysis is more flexible and easily-scalable than the real system cluster.
●
The two nodes to four nodes experiments proved the easy-scalability
where cloud cluster scaled up with creation of an instance from already
configured image.
●
The case was not the same in real system where we needed to get the
machine, download the softwares, adjust configuration to join the new
machine to the cluster.
●
The failed nodes in cloud cluster could be terminated and can be
replaced with a new instance in seconds but the same is not possible in
real system.
●
The cluster in real system computers are faster than the cloud clusters.
●
But due to different advantageous features of the cloud computing
system such as quick termination of servers (nodes) if problems arise and
creation of the node from the same state the machine was terminated,
automatic networking, instant creation of nodes and cluster and many
such features cloud Hadoop clusterwould be more favorable.
●
Despite the difficulties in writing algorithms related with images in map
reduce framework and serialization errors of images, and despite the
popularity of text processings in Hadoop, It is still possible to perform
image processings in distributed framework as hadoop.
FUTURE SCOPE
To perform the same algorithms using different cloud frameworks and
comparing the commodity cluster performance versus new cloud virtual
cluster Or analysis and comparison of Openstack cloud virtual cluster
versus new cloud virtual cluster.
REFERENCES
●
[1] Jinesh Varia, Sajee Mathew, “Overview of Amazon Web
Services,”Amazon Web Services, 2014.
●
[2] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg,
Ivona Brandic, “Cloud Computing and emerging IT platforms: Vision, hype
and reality for delivering computing as the 5th utility,” Future Generation
Computer Systems, 2009, Elsevier.
●
[3] Dai Yuefa, Wu Bo, Gu Yaqiang, Zhang Quan, Tang Chaojin. “Data
Security Model for Cloud Computing, Proceedings of the 2009
International Workshop on Information Security and Application ,”
(IWISA 2009), (pp. 21-22). China.
●
[4] Qiao Lian, Wei Chen, Zheng Zhang, “On the impact of replica
placement to the reliability of distributed brick storage systems,”
Proceedings of the 25th IEEE International Conference on Distributed
Computing Systems, (pp. 187-196), 2005, IEEE.
●
[5] Daniel Ford et al., “Availability in globally distributed storage system,”
Google Inc.
●
[6] HDFS Architecture Guide,“from Address:,
”http://hadoop.apache.org/docs/r1.0.4/hdfs design.html.
THANK YOU

Más contenido relacionado

La actualidad más candente

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 

La actualidad más candente (20)

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Spark at-hackthon8jan2014
Spark at-hackthon8jan2014Spark at-hackthon8jan2014
Spark at-hackthon8jan2014
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 

Similar a A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters

project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 

Similar a A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters (20)

Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Hadoop
HadoopHadoop
Hadoop
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
hadoop
hadoophadoop
hadoop
 
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
 
Data scientist a perfect job
Data scientist a perfect jobData scientist a perfect job
Data scientist a perfect job
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters

  • 1. A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters BY : KUMARI SURABHI
  • 2. INDEX TERMS • Openstack Cloud • Hadoop • Distributed system, • Virtualization • Big data.
  • 3. INTRODUCTION • Computers are revolutionizing the human era, especially the IT field. • With new technologies and ideas, smart, efficient and faster computers and frameworks are introduced in market every day. • Advancement of intelligent machines and ideas makes computation, storage and transaction of data faster and accurate which eventually makes institutions, companies and individuals solve their problems with ease. • Among so many computation improvements, cloud computing and distributed system are the main focus of this seminar.
  • 4. CLOUD COMPUTING • Cloud computing is the emerging technology which is being exploited in every aspects of technology. • Cloud computing is an abstract term describing the use of resources, which do not belong to the user to perform required task and then disconnect from the resources when not in use. • Most obvious examples would be Gmail, Google doc, Amazon EC2 and storage, social networks such as Facebook and many more.
  • 6.
  • 7. BIG DATA • Describe the exponential growth, availability and use of information, both structured and unstructured. • Concept of Big Data has three basic dimensions: volume, variety, velocity and other dimensions are veracity and complexity.
  • 8. PROBLEM STATEMENTS AND PRELIMINARIES The purpose of this seminar is to answer the following questions: ● Is performance of OpenStack cloud is better than real system on Hadoop cluster? ● Is it feasible to run image processing MapReduce job in a Hadoop cluster on OpenStack cloud? ● Are the technical difficulties of converting image files to pdf using MapReduce framework in distributed system?
  • 9. CLOUD COMPUTING & OPENSTACK CLOUD • Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet). • Cloud services include the delivery of software, infrastructure, and storage over the Internet. • Based on technology used, cloud computing can be sub categorized into: Public, Private, Community and Hybrid clouds.
  • 10. Cloud Computing continued... cloud computing can be broadly divided into: ● Software as a Service (SaaS), ● Platform as a Service (PaaS), ● Infrastructure as a Service (IaaS).
  • 11. CLOUD COMPUTING continued... Based on technology used, cloud computing can be sub categorized into: • Public, • Private, • Community • Hybrid clouds.
  • 12. CLOUD COMPUTING continued... There are many platforms available to set up a cloud such as : • CloudStak (Foundation) • DevStack, • Openstack, • Eucalyptus, • Nebula and many more. Note : Openstack is chosen as the framework for implementing cloud for the article.
  • 13. OPENSTACK CLOUD • It is open-source which is used to open to pick and mix any hardware needs. • Open to design own networks. • Open to use any virtualization technology. • Open to other needed features and so on.
  • 14. OPENSTACK CLOUD • OpenStack is an open source cloud framework, originally launched by Rackspace and NASA, with the aim to promote cloud standards and provide a solid foundation for cloud development. • It is most widely used tool for setting a private and public cloud. • Big companies like Dell, AMD, Cisco, HP and Rackspace are using it. • Linux heavyweights like Red Hat and Ubuntu are implementing it. • Amazon is available on OpenStack with compatible API with Amazon.
  • 15. OPENSTACK CLOUD continued... • OpenStack is an Infrastructure as a Service (IaaS) cloud computing project that is free open source software. • It is revolutionizing the cloud computing world • Aim to create a system where storage, resources and performance would scale everything up quickly and efficiently.
  • 16. OPENSTACK CLOUD continued... The OpenStack cloud currently consists of six projects: • Nova, • Swift, Glance, • Keystone, • Quantum, • Horizon.
  • 17. OPENSTACK CLOUD continued... ● Nova : Nova is the computing fabric controller for the OpenStack cloud. ● Swift : Swift is the storage system for OpenStack which is analogous to Amazon Web Service Simples Storage system. ● Glance : It is an imaging service for OpenStack which is responsible for discovery, registration and delivery services for disk and server images.
  • 18. OPENSTACK CLOUD continued... ● Keystone : Keystone is OpenStack identity service which provides authentication and authorization for all components of OpenStack. ● Horizon : Horizon is web based dashboard that provides administrators and users a graphical interface to access, provision and automate cloud-based resources.
  • 20. HADOOP • There are many distributed systems available to comply with the big data problems faced by big companies. • Hadoop is one of the available frameworks. • Hadoop makes data mining, analytics, and processing of big data cheap and fast. • Hadoop is an open source project and is made to deal with terabytes of data in minutes. • Hadoop stores and processes any kind of data. • Hadoop is natively written in Java but can be accessed using other languages such as SQL-inspired language (Hive), c/c++, python and many more.
  • 21. HADOOP continued... • An open source web search engine which was based on Googles MapReduce. • Hadoop works on commodity hardware.
  • 22. HDFS(Hadoop Distributed File System) ● Hadoop Distributed File System provides unrestricted, high- speed access to the data application. ● A scalable, fault tolerant, high performance distributed file system. ● Namenode holds filesystem metadata. ● Files are broken up and spread over datanodes. ● Data divided into 64MB(default) or 128 blocks, each block replicated 3 times(default) .
  • 24. MAPREDUCE ● Map Reduce programs are executed in two main phases, called mapping and reducing. ● Each phase is defined by a data processing function, and these functions are called mapper and reducer, respectively. ● In the mapping phase, Map Reduce takes the input data and feeds each data element to the mapper. ● In the reducing phase, the reducer processes all the outputs from the mapper and arrives at the final result. ● In simple terms, the mapper is meant to filter and transform the input into something that the reducer can aggregate over.
  • 26. PERFORMANCE ANALYSIS MODEL ● In the performance analysis model, we will discusse use of two basic applications : 1. WordCount Application 2. Imagetopdf Conversion Application. ● WordCount is a common MapReduce program which is used to count total number of word found in the document. ● Imagetopdf Conversion program is used for converting image into pdf.
  • 27. PERFORMANCE ANALYSIS MODEL continued... ● These two programs are executed in commodity computers cluster as well as OpenStack Cloud virtual instance cluster. ● Analyse the performance by changing the number of nodes and size of the data. ● The performance analysis has been done for both the applications.
  • 28. WORD COUNT APPLICATION ● WordCount is a simple application that counts the number of occurrences of each word in a given input set. ● The purpose of this program is to calculate the total number of repetition of words in a particular document. ● The pseudo code for the Mapper and Reducer for WordCount program is outlined in Algorithm 1 and Algorithm 2 respectively.
  • 29. Mapper function for WordCount Program ● Input: String filename, String document ● Output: String token , 1 1) Map(String filename, String document) 2) { 3) List<String> T = tokenize(document); 4) For each token in T 5) { 6) emit ((String) token, (Integer) 1); 7) } 8) }
  • 30. Reducer function for WordCount Program ● Input: String token, 1 ● Output: (String) token , sum 1) Reduce(String token, List<Integer> values) 2) { 3) Integer sum = 0; 4) For each value in values 5) { 6) Sum = sum + value; 7) } 8) emit ((String) token, (Integer) sum); 9) } 10) }
  • 31. Image to Pdf Conversion Application ● Hadoop is popular for processing textual big data so there are a lot of materials available if an application related with texts is to be developed. ● But few works has been done on image data processing in Hadoop. ● So there were a lot of challenges while developing the application. ● Some of the difficulties faced were serialization issues with images, splitting of images by hadoop to its default blocks, image to pdf conversion, text to pdf conversion and many more.
  • 32. Work flow of the application ● Under Map Reduce model, data processing primitives are called mappers and reducers.
  • 33. Mapper function for ImagetoPdf Program ● Input: String key, KUPDF value ● Output: filename, KUPDF value(pdf file) 1) Map(String key, KUPDF value) 2) { 3) For each bufferList in value 4) { 5) write (filename, value) 6) } 7) }
  • 34. Reducer function for ImagetoPdf Program ● Input: String key, KUPDF value ● Output: filename, KUPDF value(pdf file) 1) Reduce(String key, KUPDF values) 2) { 3) For each value in values 4) { 5) concat value as separate page of pdf 6) } 7) write (key, final pdf) 8) } 9) }
  • 35. MAPPER AND REDUCER ● In the mapping phase, Map Reduce takes the input data and feeds each data element to the mapper. ● In the reducing phase, the reducer processes all the outputs from the mapper and arrives at the final result. ● The mapper is meant to filter and transform the input into something that the reducer can aggregate over. ● PDFMapper and PDFReducer class in the application does the above mentioned jobs in the application developed with image files and pdf files.
  • 36. METHODS OF PERFORMANCE EVALUATIONS Cloud Cluster Setup • quad core Intel® Xenon(R) CPU of 64 bits • 16GB RAM • 1TB ATA disk • 500GB ATA disk as storage • 32 bits dual gigabit network interfaces was used. • Ubuntu 14.04 server was installed for Operating System.
  • 37. Cloud Cluster Setup continued... ● Installation and configuration of Openstack Essex cloud was done by the tutorial of Openstack. ● Appropriate images were created using virtualization tool, such as QEMU, supporting KVM or XEN and by using terminal commands. ● virtual systems for cloud cluster setup, were created by using terminal commands and by accessing web-interface of openstack after successful network configuration for fixed and floating ips and other security parameters by using the images created.
  • 38. Commodity Computer Cluster Setup The four-node cluster of commodity computers is setup on : ● Intel i5 quad core 64-bit CPU with 2GB RAM ● One with 160GB ATA disk ● Other three with 80GB ATA disk ● 32 bit gigabit network interface.
  • 39. 2. Commodity Computer Cluster Setup continued... ● Passwordless secure shell was confugured ● Java 7 was installed and ● Hadoop 0.20.2 was configured on all four instances.
  • 40. Cloud Cluster Setup continued... ● Four-node cluster, one acting as master/slave and the rest three as slaves, was created by using Ubuntu 14.04 image. ● Passwordless secure shell was confugured, java 7 was installed and Hadoop 0.20.2 was configured on all four instances.
  • 41. Configuration of Experiments Commodity Computer/Cloud Server Commodity Computers Details Server Computer Details Master vs Cse-dcg 2 GB RAM | 2 VCPU | 160 GB Storage 2 GB RAM | 2 VCPU | 80 GB Storage Slave1 vs uesr1 2 GB RAM | 2 VCPU | 160 GB Storage 2 GB RAM | 2 VCPU | 80 GB Storage Slavs2 vs user2 2 GR BAM | 2 VCPU | 80 GB Storage 2 GB RAM | 2 VCPU | 80 GB Sotrage Slave3 vs user3 2 GB RAM | 2 VCPU | 80 GB Storage 2 GB RAM | 2 VCPU | 80 GB Storage
  • 42. Experimental Results After the successful configuration of the clusters, three jobs: ● Two jobs to convert image files to pdf files and ● One job of word count were run on both systems. ● The first two jobs were based on image and pdf files being serialized in map reduce framework. ● The last job was implemented based on the standard WordCount program available in the Hadoop package. ● The algorithms are run first on two node cluster with a Master and two slaves and then scaled up to four node cluster of a Master and four slaves (Master running slave machine as well).
  • 43. 1) Directory-wise Image to PDF: The results of first job is summarized in the Table II and Table III : ● TABLE II : SUMMARY OF FIRST JOB ON TWO NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 23 folder | 94 image files | 169 MB 23 folder | 94 pdf files| 90.1 MB 5 minute and 20 second Cloud Cluster 23 folder | 94 image files| 169 MB 1 folder | 94 pdf files| 90.1 MB 3 minute and 43 second
  • 44. Directory-wise Image to PDF continued... ● The first jobs algorithm is designed to search images directory-wise and convert each image files to pdf file with same directory tree as the input image files. ● TABLE III : SUMMARY OF FIRST JOB ON FOUR NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 23 folder | 94 image files | 169 MB 23 folder | 94 pdf files| 90.1 MB 3 minute and 8 second Cloud Cluster 23 folder | 94 image files| 169 MB 1 folder | 94 pdf files| 90.1 MB 1 minute and 31 second
  • 46. EXPLANATION The input to the job contained : ● 23 folder ● 94 files and ● In total 169 MB in size. The output was the conversion of each image files to pdf with same directory and file names and configured to generate 90.1 MB in size for both the processing. ● The processing were repeated three times to get the average.
  • 47. 2) Multiple Images to Single PDF: ● Modified version of the first one explained above. ● All images are converted to final single pdf output file. ● This processing is also done first in two node and later scaled up to four node cluster as done in first algorithm. ● The processing are repeated three time to get the average. ● The experiments are summarized in Table IV and Table V.
  • 48. Multiple Images to Single PDF continued... ● The input contained 476 image files in one directory and its size was 926 MB. ● The output was a single pdf file of 200.1 MB in size. TABLE IV. SUMMARY OF 2ND JOB ON TWO NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 1 folder | 476 image files | 926 MB 1 pdf files | 200.1 MB 11 minute and 29 second Cloud Cluster 1 pdf files | 200.1 MB 1 pdf files | 200.1 MB 12 minute and 28 second
  • 49. Multiple Images to Single PDF continued... TABLE V. SUMMARY OF 2ND JOB ON FOUR NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 1 folder | 476 image files | 926 MB 1 pdf files | 200.1 MB 7 minute and 51 second Cloud Cluster 1 pdf files | 200.1 MB 1 pdf files | 200.1 MB 9 minute and 22 second
  • 50. GRAPHICAL REPRESENTATION This Shows that commodity computer clusters are more efficient than virtual node cluster in Openstack cloud. Time taken for Second Job
  • 51. EXPLANATION ● The first two jobs were processed on mapping of small image files that are no so effective in Hadoop system as hadoop performs really well with large data sets as input. ● So in order to test the real performance of Hadoop on big data, the default word count job of Hadoop system was also run. ● Hadoop was designed for text processing rather than image processing so the textual processing was also chosen to analyse the Hadoop clusters.
  • 52. TABLE VI. SUMMARY OF THIRD JOB ON TWO NODES The inputs for the job was a text file of 1.1 GB and the output was a file containing list of words which was 364.6KB in size. INPUT OUTPUT TIME TAKEN Commodity Cluster 1 folder | 476 image files | 926 MB 1 pdf files | 200.1 MB 7 minute and 51 second Cloud Cluster 1 pdf files | 200.1 MB 1 pdf files | 200.1 MB 9 minute and 22 second
  • 53. TABLE VII. SUMMARY OF THIRD JOB ON FOUR NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 1 text file | 1.1 GB 1 text files with counts | 361.6 KB 4 minute and 0 second Cloud Cluster 1 text file | 1.1 GB 1 text files with counts | 361.6 KB 5 minute and 1 second
  • 54. GRAPHICAL REPRESENTATION It proves that Hadoop cluster in commodity computers performed well than the Hadoop cluster in the cloud. Time taken for Third Job
  • 55. PERFORMANCE ANALYSIS ● The Hadoop distributed system set up on personal computers is certain to be more efficient and faster than the cloud system. ● First reason is that Hadoop is developed with commodity machines in mind. ● Second obvious reason is that the processing is done in physical hardware without any resources sharing as compared to cloud systems.
  • 56. CONTRADICTION ● The first job is contradictory with the points discussed above and with other two jobs. REASON : ● The job has to recursively read and write files, thus has to cache all the bytes read and to be written,which is faster in cloud as the nodes are in one server and there is no wire communication between nodes.
  • 57. CONCLUSION ● An analysis of running a Hadoop cluster incloud and in real system and identifying the best solution by running simple Hadoop jobs in the configured clusters. ● It concludes that running a Hadoop cluster in cloud for data storage and analysis is more flexible and easily-scalable than the real system cluster. ● The two nodes to four nodes experiments proved the easy-scalability where cloud cluster scaled up with creation of an instance from already configured image. ● The case was not the same in real system where we needed to get the machine, download the softwares, adjust configuration to join the new machine to the cluster.
  • 58. ● The failed nodes in cloud cluster could be terminated and can be replaced with a new instance in seconds but the same is not possible in real system. ● The cluster in real system computers are faster than the cloud clusters. ● But due to different advantageous features of the cloud computing system such as quick termination of servers (nodes) if problems arise and creation of the node from the same state the machine was terminated, automatic networking, instant creation of nodes and cluster and many such features cloud Hadoop clusterwould be more favorable. ● Despite the difficulties in writing algorithms related with images in map reduce framework and serialization errors of images, and despite the popularity of text processings in Hadoop, It is still possible to perform image processings in distributed framework as hadoop.
  • 59. FUTURE SCOPE To perform the same algorithms using different cloud frameworks and comparing the commodity cluster performance versus new cloud virtual cluster Or analysis and comparison of Openstack cloud virtual cluster versus new cloud virtual cluster.
  • 60. REFERENCES ● [1] Jinesh Varia, Sajee Mathew, “Overview of Amazon Web Services,”Amazon Web Services, 2014. ● [2] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, Ivona Brandic, “Cloud Computing and emerging IT platforms: Vision, hype and reality for delivering computing as the 5th utility,” Future Generation Computer Systems, 2009, Elsevier. ● [3] Dai Yuefa, Wu Bo, Gu Yaqiang, Zhang Quan, Tang Chaojin. “Data Security Model for Cloud Computing, Proceedings of the 2009 International Workshop on Information Security and Application ,” (IWISA 2009), (pp. 21-22). China.
  • 61. ● [4] Qiao Lian, Wei Chen, Zheng Zhang, “On the impact of replica placement to the reliability of distributed brick storage systems,” Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, (pp. 187-196), 2005, IEEE. ● [5] Daniel Ford et al., “Availability in globally distributed storage system,” Google Inc. ● [6] HDFS Architecture Guide,“from Address:, ”http://hadoop.apache.org/docs/r1.0.4/hdfs design.html.