SlideShare a Scribd company logo
1 of 46
CS6703 GRID AND CLOUD COMPUTING
Unit 4
Dr Gnanasekaran Thangavel
Professor and Head
Faculty of Information Technology
R M K College of Engineering and
Technology
UNIT IVPROGRAMMING MODEL
Open source grid middleware packages – Globus
Toolkit (GT4) Architecture , Configuration – Usage of
Globus – Main components and Programming
model - Introduction to Hadoop Framework -
Mapreduce, Input splitting, map and reduce
functions, specifying input and output parameters,
configuring and running a job – Design of Hadoop file
system, HDFS concepts, command line and java
interface, dataflow of File read & File write.
8/24/20162 Dr Gnanasekaran Thangavel
Open source grid middleware packages
8/24/2016Dr Gnanasekaran Thangavel3
 The Open Grid Forum and Object Management are two well- formed
organizations behind the standards
 Middleware is the software layer that connects software components. It lies
between operating system and the applications.
 Grid middleware is specially designed a layer between hardware and
software, enable the sharing of heterogeneous resources and managing
virtual organizations created around the grid.
 The popular grid middleware are
1. BOINC -Berkeley Open Infrastructure for Network Computing.
2. UNICORE - Middleware developed by the German grid computing
community.
3. Globus (GT4) - A middleware library jointly developed by Argonne National
Lab., Univ. of Chicago, and USC Information Science Institute, funded by
DARPA, NSF, and NIH.
4. CGSP in ChinaGrid - The CGSP (ChinaGrid Support Platform) is a
Open source grid middleware packages conti…
8/24/2016Dr Gnanasekaran Thangavel4
5. Condor-G - Originally developed at the Univ. of Wisconsin for
general distributed computing, and later extended to Condor-G for
grid job management.
6. Sun Grid Engine (SGE) - Developed by Sun Microsystems for
business grid applications. Applied to private grids and local clusters
within enterprises or campuses.
7. gLight -Born from the collaborative efforts of more than 80 people in
12 different academic and industrial research centers as part of the
EGEE Project, gLite provided a framework for building grid
applications tapping into the power of distributed computing and
storage resources across the Internet.
The Globus Toolkit Architecture (GT4)
8/24/2016Dr Gnanasekaran Thangavel5
 The Globus Toolkit, is an open middleware library for the grid computing
communities. These open source software libraries support many operational
grids and their applications on an international basis.
 The toolkit addresses common problems and issues related to grid resource
discovery, management, communication, security, fault detection, and
portability. The software itself provides a variety of components and capabilities.
 The library includes a rich set of service implementations. The implemented
software supports grid infrastructure management, provides tools for building
new web services in Java , C, and Python, builds a powerful standard-based
security infrastructure and client API s (in different languages), and offers
comprehensive command-line programs for accessing various grid services. T
 The Globus Toolkit was initially motivated by a desire to remove obstacles that
prevent seamless collaboration, and thus sharing of resources and services, in
scientific and engineering applications. The shared resources can be
computers, storage, data, services, networks, science instruments (e.g.,
sensors), and so on. The Globus library version GT4, is conceptually shown in
The Globus Toolkit
8/24/2016Dr Gnanasekaran Thangavel6
The Globus Toolkit
8/24/2016Dr Gnanasekaran Thangavel7
 GT4 offers the middle-level core services in grid
applications.
 The high-level services and tools, such as MPI , Condor-G,
and Nirod/G, are developed by third parties for general
purpose distributed computing applications.
 The local services, such as LSF, TCP, Linux, and Condor,
are at the boom level and are fundamental tools supplied
by other developers.
 As a de facto standard in grid middleware, GT4 is based on
industry-standard web service technologies.
Functionalities of GT4
8/24/2016Dr Gnanasekaran Thangavel8
 Global Resource Allocation Manager (GRAM ) - Grid Resource Access
and Management (HTTP-based)
 Communication (Nexus ) - Unicast and multicast communication
 Grid Security Infrastructure (GSI ) - Authentication and related security
services
 Monitory and Discovery Service (MDS ) - Distributed access to structure
and state information
 Health and Status (HBM ) - Heartbeat monitoring of system components
 Global Access of Secondary Storage (GASS ) - Grid access of data in
remote secondary storage
 Grid File Transfer (GridFTP ) Inter-node fast file transfer
Globus Job Workflow
8/24/2016Dr Gnanasekaran Thangavel9
Globus Job Workflow
8/24/2016Dr Gnanasekaran Thangavel10
 A typical job execution sequence proceeds as follows: The user delegates his
credentials to a delegation service.
 The user submits a j ob request to GRAM with the delegation identifier as a
parameter.
 GRAM parses the request, retrieves the user proxy certificate from the
delegation service, and then acts on behalf of the user.
 GRAM sends a transfer request to the RFT (Reliable File Transfer), which
applies GridFTP to bring in the necessary files.
 GRAM invokes a local scheduler via a GRAM adapter and the SEG (Scheduler
Event
 Generator) initiates a set of user j obs.
 The local scheduler reports the j ob state to the SEG. Once the j ob is
complete, GRAM uses RFT and GridFTP to stage out the resultant files. The
grid monitors the progress of these operations and sends the user a notification
Client-Globus Interactions
8/24/2016Dr Gnanasekaran Thangavel11
 There are strong interactions between provider programs and user code. GT4
makes heavy use of industry-standard web service protocols and mechanisms in
service description, discovery, access, authentication, authorization.
 GT4 makes extensive use of java, C, and Python to write user code. Web
service mechanisms define specific interfaces for grid computing.
 Web services provide flexible, extensible, and widely adopted XML-based
interfaces.
Data Management Using GT4
8/24/2016Dr Gnanasekaran Thangavel12
 Grid applications one need to provide access to and/or integrate large quantities
of data at multiple sites. The GT4 tools can be used individually or in conj
unction with other tools to develop interesting solutions to efficient data access.
The following list briefly introduces these GT4 tools:
 1. Grid FTP supports reliable, secure, and fast memory-to-memory and disk-to-
disk data movement over high-bandwidth WANs. Based on the popular FTP
protocol for internet file transfer, Grid FTP adds additional features such as
parallel data transfer, third-party data transfer, and striped data transfer. I n
addition, Grid FTP benefits from using the strong Globus Security Infra structure
for securing data channels with authentication and reusability. It has been
reported that the grid has achieved 27 Gbit/second end-to-end transfer speeds
over some WANs.
 2. RFT provides reliable management of multiple Grid FTP transfers. I t has
been used to orchestrate the transfer of millions of files among many sites
simultaneously.
 3. RLS (Replica Location Service) is a scalable system for maintaining and
providing access to information about the location of replicated files and data
sets.
8/24/2016Dr Gnanasekaran Thangavel13
 Apache top level project, open-source implementation of frameworks for reliable,
scalable, distributed computing and data storage.
 It is a flexible and highly-available architecture for large scale computation and
data processing on a network of commodity hardware.
 Hadoop offers a software platform that was originally developed by a Yahoo!
group. The package enables users to write and run applications over vast
amounts of distributed data.
 Users can easily scale Hadoop to store and process petabytes of data in the
web space.
 Hadoop is economical in that it comes with an open source version of
MapReduce that minimizes overhead in task spawning and massive data
communication.
 It is efficient, as it processes data with a high degree of parallelism across a large
number of commodity nodes, and it is reliable in that it automatically keeps
multiple data copies to facilitate redeployment of computing tasks upon
Hadoop
8/24/2016Dr Gnanasekaran Thangavel14
• Hadoop:
• an open-source software framework that supports data-intensive distributed
applications, licensed under the Apache v2 license.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of large and/or rapidly growing data
sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data
Hadoop Framework Tools
8/24/2016Dr Gnanasekaran Thangavel15
Hadoop’s Architecture
8/24/2016Dr Gnanasekaran Thangavel16
 Distributed, with some centralization
 Main nodes of cluster are where most of the computational power and
storage of the system lies
 Main nodes run TaskTracker to accept and reply to MapReduce tasks, and
also DataNode to store needed blocks closely as possible
 Central control node runs NameNode to keep track of HDFS directories &
files, and JobTracker to dispatch compute tasks to TaskTracker
 Written in Java, also supports Python and Ruby
Hadoop’s Architecture
8/24/2016Dr Gnanasekaran Thangavel17
 Hadoop Distributed
Filesystem
 Tailored to needs of
MapReduce
 Targeted towards
many reads of
filestreams
 Writes are more
costly
 High degree of data
replication (3x by
default)
 No need for RAID on
normal nodes
 Large blocksize
(64MB)
Hadoop’s Architecture
8/24/2016Dr Gnanasekaran Thangavel18
NameNode:
 Stores metadata for the files, like the directory structure of a typical FS.
 The server holding the NameNode instance is quite crucial, as there is
only one.
 Transaction log for file deletes/adds, etc. Does not use transactions for
whole blocks or file-streams, only metadata.
 Handles creation of more replica blocks when necessary after a DataNode
failure
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
Hadoop’s Architecture
8/24/2016Dr Gnanasekaran Thangavel19
Hadoop’s Architecture
8/24/2016Dr Gnanasekaran Thangavel20
MapReduce Engine:
 JobTracker & TaskTracker
 JobTracker splits up data into smaller tasks(“Map”) and sends it to
the TaskTracker process in each node
 TaskTracker reports back to the JobTracker node and reports on job
progress, sends data (“Reduce”) or requests new jobs
 None of these components are necessarily limited to using HDFS
 Many other distributed file-systems with quite different architectures
work
 Many other software packages besides Hadoop's MapReduce
platform make use of HDFS
Hadoop in the Wild
8/24/2016Dr Gnanasekaran Thangavel21
Hadoop is in use at most organizations that handle big data:
 Yahoo!
 Facebook
 Amazon
 Netflix
 Etc…
Some examples of scale:
 Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers
Yahoo! Web search
 FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½
PB/day (Nov, 2012)
Three main applications of Hadoop
8/24/2016Dr Gnanasekaran Thangavel22
 Advertisement (Mining user behavior to generate recommendations)
 Searches (group related documents)
 Security (search for uncommon patterns)
8/24/2016Dr Gnanasekaran Thangavel23
Hadoop Highlights
 Distributed File System
 Fault Tolerance
 Open Data Format
 Flexible Schema
 Queryable Database
Why use Hadoop?
 Need to process Multi Petabyte Datasets
 Data may not have strict schema
 Expensive to build reliability in each application
 Nodes fails everyday
 Need common infrastructure
 Very Large Distributed File System
 Assumes Commodity Hardware
 Optimized for Batch Processing
 Runs on heterogeneous OS
8/24/2016Dr Gnanasekaran Thangavel24
DataNode
 A Block Sever
 Stores data in local file system
 Stores meta-data of a block - checksum
 Serves data and meta-data to clients
 Block Report
 Periodically sends a report of all existing
blocks to NameNode
 Facilitate Pipelining of Data
 Forwards data to other specified
DataNodes
8/24/2016Dr Gnanasekaran Thangavel25
Block Placement
 Replication Strategy
 One replica on local node
 Second replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly placed
 Clients read from nearest replica
8/24/2016Dr Gnanasekaran Thangavel26
Data Correctness
 Use Checksums to validate data – CRC32
 File Creation
 Client computes checksum per 512 byte
 DataNode stores the checksum
 File Access
 Client retrieves the data and checksum from DataNode
 If validation fails, client tries other replicas
8/24/2016Dr Gnanasekaran Thangavel27
Data Pipelining
 Client retrieves a list of DataNodes on which to place
replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next
DataNode in the Pipeline
 When all replicas are written, the client moves on to
write the next block in file
8/24/2016Dr Gnanasekaran Thangavel28
MapReduce Usage
 Log processing
 Web search indexing
 Ad-hoc queries
8/24/2016Dr Gnanasekaran Thangavel29
MapReduce Process (org.apache.hadoop.mapred)
 JobClient
 Submit job
 JobTracker
 Manage and schedule job, split job into tasks
 TaskTracker
 Start and monitor the task execution
 Child
 The process that really execute the task
8/24/2016Dr Gnanasekaran Thangavel30
Inter Process Communication
 Protocol
 JobClient <-------------> JobTracker
 TaskTracker <------------> JobTracker
 TaskTracker <-------------> Child
 JobTracker impliments both protocol and works as
server in both IPC
 TaskTracker implements the TaskUmbilicalProtocol;
Child gets task information and reports task status
through it.
JobSubmissionProtocol
InterTrackerProtocol
TaskUmbilicalProtocol
8/24/2016Dr Gnanasekaran Thangavel31
JobClient.submitJob - 1
 Check input and output, e.g. check if the output
directory is already existing
 job.getInputFormat().validateInput(job);
 job.getOutputFormat().checkOutputSpecs(fs, job);
 Get InputSplits, sort, and write output to HDFS
 InputSplit[] splits = job.getInputFormat().
getSplits(job, job.getNumMapTasks());
 writeSplitsFile(splits, out); // out is
$SYSTEMDIR/$JOBID/job.split
8/24/2016Dr Gnanasekaran Thangavel32
JobClient.submitJob - 2
 The jar file and configuration file will be uploaded to
HDFS system directory
 job.write(out); // out is $SYSTEMDIR/$JOBID/job.xml
 JobStatus status = jobSubmitClient.submitJob(jobId);
 This is an RPC invocation, jobSubmitClient is a proxy
created in the initialization
8/24/2016Dr Gnanasekaran Thangavel33
Job initialization on JobTracker - 1
 JobTracker.submitJob(jobID) <-- receive RPC invocation request
 JobInProgress job = new JobInProgress(jobId, this, this.conf)
 Add the job into Job Queue
 jobs.put(job.getProfile().getJobId(), job);
 jobsByPriority.add(job);
 jobInitQueue.add(job);
8/24/2016Dr Gnanasekaran Thangavel34
Job initialization on JobTracker - 2
 Sort by priority
 resortPriority();
 compare the JobPrioity first, then compare the
JobSubmissionTime
 Wake JobInitThread
 jobInitQueue.notifyall();
 job = jobInitQueue.remove(0);
 job.initTasks();
8/24/2016Dr Gnanasekaran Thangavel35
JobTracker Task Scheduling - 1
 Task getNewTaskForTaskTracker(String taskTracker)
 Compute the maximum tasks that can be running on
taskTracker
 int maxCurrentMap Tasks = tts.getMaxMapTasks();
 int maxMapLoad = Math.min(maxCurrentMapTasks,
(int)Math.ceil(double)
remainingMapLoad/numTaskTrackers));
8/24/2016Dr Gnanasekaran Thangavel36
JobTracker Task Scheduling - 2
 int numMaps = tts.countMapTasks(); // running tasks
number
 If numMaps < maxMapLoad, then more tasks can be
allocated, then based on priority, pick the first job
from the jobsByPriority Queue, create a task, and
return to TaskTracker
 Task t = job.obtainNewMapTask(tts, numTaskTrackers);
8/24/2016Dr Gnanasekaran Thangavel37
Start TaskTracker - 1
 initialize()
 Remove original local directory
 RPC initialization
 TaskReportServer = RPC.getServer(this, bindAddress, tmpPort,
max, false, this, fConf);
 InterTrackerProtocol jobClient = (InterTrackerProtocol)
RPC.waitForProxy(InterTrackerProtocol.class,
InterTrackerProtocol.versionID, jobTrackAddr, this.fConf);
8/24/2016Dr Gnanasekaran Thangavel38
Run Task on TaskTracker - 1
 TaskTracker.localizeJob(TaskInProgress tip);
 launchTasksForJob(tip, new JobConf(rjob.jobFile));
 tip.launchTask(); // TaskTracker.TaskInProgress
 tip.localizeTask(task); // create folder, symbol link
 runner = task.createRunner(TaskTracker.this);
 runner.start(); // start TaskRunner thread
8/24/2016Dr Gnanasekaran Thangavel39
Run Task on TaskTracker - 2
 TaskRunner.run();
 Configure child process’ jvm parameters, i.e. classpath,
taskid, taskReportServer’s address & port
 Start Child Process
 runChild(wrappedCommand, workDir, taskid);
8/24/2016Dr Gnanasekaran Thangavel40
Child.main()
 Create RPC Proxy, and execute RPC invocation
 TaskUmbilicalProtocol umbilical =
(TaskUmbilicalProtocol)
RPC.getProxy(TaskUmbilicalProtocol.class,
TaskUmbilicalProtocol.versionID, address,
defaultConf);
 Task task = umbilical.getTask(taskid);
 task.run(); // mapTask / reduceTask.run
8/24/2016Dr Gnanasekaran Thangavel41
Finish Job - 1
 Child
 task.done(umilical);
 RPC call: umbilical.done(taskId, shouldBePromoted)
 TaskTracker
 done(taskId, shouldPromote)
 TaskInProgress tip = tasks.get(taskid);
 tip.reportDone(shouldPromote);
 taskStatus.setRunState(TaskStatus.State.SUCCEEDED)
8/24/2016Dr Gnanasekaran Thangavel42
Finish Job - 2
 JobTracker
 TaskStatus report: status.getTaskReports();
 TaskInProgress tip = taskidToTIPMap.get(taskId);
 JobInProgress update JobStatus
 tip.getJob().updateTaskStatus(tip, report, myMetrics);
 One task of current job is finished
 completedTask(tip, taskStatus, metrics);
 If (this.status.getRunState() == JobStatus.RUNNING &&
allDone) {this.status.setRunState(JobStatus.SUCCEEDED)}
8/24/2016Dr Gnanasekaran Thangavel43
Demo
 Word Count
 hadoop jar hadoop-0.20.2-examples.jar wordcount
<input dir> <output dir>
 Hive
 hive -f pagerank.hive
References
1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, “Distributed and Cloud
Computing: Clusters, Grids, Clouds and the Future of Internet”, First Edition,
Morgan Kaufman Publisher, an Imprint of Elsevier, 2012.
2. www.csee.usf.edu/~anda/CIS6930-S11/notes/hadoop.ppt
3. www.ics.uci.edu/~cs237/lectures/cloudvirtualization/Hadoop.pptx
44 Dr Gnanasekaran Thangavel 8/24/2016
Other presentations
http://www.slideshare.net/drgst/presentations
45 Dr Gnanasekaran Thangavel 8/24/2016
46
Thank You
Questions and Comments?
Dr Gnanasekaran Thangavel 8/24/2016

More Related Content

What's hot

Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communicationAbDul ThaYyal
 
Cloud service management
Cloud service managementCloud service management
Cloud service managementgaurav jain
 
Implementation levels of virtualization
Implementation levels of virtualizationImplementation levels of virtualization
Implementation levels of virtualizationGokulnath S
 
Mobile transportlayer
Mobile transportlayerMobile transportlayer
Mobile transportlayerRahul Hada
 
Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.MITS Gwalior
 
File replication
File replicationFile replication
File replicationKlawal13
 
Distributed document based system
Distributed document based systemDistributed document based system
Distributed document based systemChetan Selukar
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notesSyed Mustafa
 
Open Source Grid Middleware Packages
Open Source Grid Middleware  PackagesOpen Source Grid Middleware  Packages
Open Source Grid Middleware PackagesShivaramBose
 
Inter process communication
Inter process communicationInter process communication
Inter process communicationMohd Tousif
 
Distributed system architecture
Distributed system architectureDistributed system architecture
Distributed system architectureYisal Khan
 
Virtualize of IO Devices .docx
Virtualize of IO Devices .docxVirtualize of IO Devices .docx
Virtualize of IO Devices .docxkumari36
 

What's hot (20)

Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communication
 
Cloud service management
Cloud service managementCloud service management
Cloud service management
 
Implementation levels of virtualization
Implementation levels of virtualizationImplementation levels of virtualization
Implementation levels of virtualization
 
Gcc notes unit 1
Gcc notes unit 1Gcc notes unit 1
Gcc notes unit 1
 
Mobile transportlayer
Mobile transportlayerMobile transportlayer
Mobile transportlayer
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.
 
Distributed Coordination-Based Systems
Distributed Coordination-Based SystemsDistributed Coordination-Based Systems
Distributed Coordination-Based Systems
 
File replication
File replicationFile replication
File replication
 
Distributed document based system
Distributed document based systemDistributed document based system
Distributed document based system
 
Distributed Operating System_1
Distributed Operating System_1Distributed Operating System_1
Distributed Operating System_1
 
Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notes
 
Open Source Grid Middleware Packages
Open Source Grid Middleware  PackagesOpen Source Grid Middleware  Packages
Open Source Grid Middleware Packages
 
Peer to Peer services and File systems
Peer to Peer services and File systemsPeer to Peer services and File systems
Peer to Peer services and File systems
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
 
Distributed Computing ppt
Distributed Computing pptDistributed Computing ppt
Distributed Computing ppt
 
Inter process communication
Inter process communicationInter process communication
Inter process communication
 
Distributed system architecture
Distributed system architectureDistributed system architecture
Distributed system architecture
 
Virtualize of IO Devices .docx
Virtualize of IO Devices .docxVirtualize of IO Devices .docx
Virtualize of IO Devices .docx
 

Viewers also liked (7)

Globus ppt
Globus pptGlobus ppt
Globus ppt
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Cloud computing ppt
Cloud computing pptCloud computing ppt
Cloud computing ppt
 

Similar to Cs6703 grid and cloud computing unit 4

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
 
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...IRJET Journal
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...IJCSES Journal
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...ijcses
 
Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Orgad Kimchi
 
5. the grid implementing production grid
5. the grid implementing production grid5. the grid implementing production grid
5. the grid implementing production gridDr Sandeep Kumar Poonia
 
IRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication SystemIRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication SystemIRJET Journal
 
Cloud Computing for National Security Applications
Cloud Computing for National Security ApplicationsCloud Computing for National Security Applications
Cloud Computing for National Security Applicationswhite paper
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET Journal
 
Resume_Appaji
Resume_AppajiResume_Appaji
Resume_AppajiAppaji K
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...Haripds Shrestha
 

Similar to Cs6703 grid and cloud computing unit 4 (20)

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
Grid Presentation
Grid PresentationGrid Presentation
Grid Presentation
 
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...
 
5. the grid implementing production grid
5. the grid implementing production grid5. the grid implementing production grid
5. the grid implementing production grid
 
IRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication SystemIRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication System
 
Globus and Gridbus
Globus and GridbusGlobus and Gridbus
Globus and Gridbus
 
Cloud Computing for National Security Applications
Cloud Computing for National Security ApplicationsCloud Computing for National Security Applications
Cloud Computing for National Security Applications
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop Environment
 
Resume_Appaji
Resume_AppajiResume_Appaji
Resume_Appaji
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
MapR Unique features
MapR Unique featuresMapR Unique features
MapR Unique features
 

More from RMK ENGINEERING COLLEGE, CHENNAI

Big picture of electronics and instrumentation engineering
Big picture of electronics and instrumentation engineeringBig picture of electronics and instrumentation engineering
Big picture of electronics and instrumentation engineeringRMK ENGINEERING COLLEGE, CHENNAI
 

More from RMK ENGINEERING COLLEGE, CHENNAI (20)

EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 3
EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 3EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 3
EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 3
 
EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 2
EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 2EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 2
EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 2
 
EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 1
EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 1EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 1
EC8353 ELECTRONIC DEVICES AND CIRCUITS Unit 1
 
EC6651 COMMUNICATION ENGINEERING UNIT 5
EC6651 COMMUNICATION ENGINEERING UNIT 5EC6651 COMMUNICATION ENGINEERING UNIT 5
EC6651 COMMUNICATION ENGINEERING UNIT 5
 
EC6651 COMMUNICATION ENGINEERING UNIT 4
EC6651 COMMUNICATION ENGINEERING UNIT 4EC6651 COMMUNICATION ENGINEERING UNIT 4
EC6651 COMMUNICATION ENGINEERING UNIT 4
 
EC6651 COMMUNICATION ENGINEERING UNIT 2
EC6651 COMMUNICATION ENGINEERING UNIT 2EC6651 COMMUNICATION ENGINEERING UNIT 2
EC6651 COMMUNICATION ENGINEERING UNIT 2
 
EC6651 COMMUNICATION ENGINEERING UNIT 1
EC6651 COMMUNICATION ENGINEERING UNIT 1EC6651 COMMUNICATION ENGINEERING UNIT 1
EC6651 COMMUNICATION ENGINEERING UNIT 1
 
EC6202 ELECTRONIC DEVICES AND CIRCUITS Unit 2
EC6202 ELECTRONIC DEVICES AND CIRCUITS Unit 2EC6202 ELECTRONIC DEVICES AND CIRCUITS Unit 2
EC6202 ELECTRONIC DEVICES AND CIRCUITS Unit 2
 
EC6202 ELECTRONIC DEVICES AND CIRCUITS Unit 1
EC6202 ELECTRONIC DEVICES AND CIRCUITS Unit 1EC6202 ELECTRONIC DEVICES AND CIRCUITS Unit 1
EC6202 ELECTRONIC DEVICES AND CIRCUITS Unit 1
 
EC6202 ELECTRONIC DEVICES AND CIRCUITS NOTES
EC6202 ELECTRONIC DEVICES AND CIRCUITS NOTESEC6202 ELECTRONIC DEVICES AND CIRCUITS NOTES
EC6202 ELECTRONIC DEVICES AND CIRCUITS NOTES
 
Big picture of electronics and instrumentation engineering
Big picture of electronics and instrumentation engineeringBig picture of electronics and instrumentation engineering
Big picture of electronics and instrumentation engineering
 
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 5
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 5GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 5
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 5
 
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 4
GE6075 PROFESSIONAL ETHICS IN ENGINEERINGUnit 4GE6075 PROFESSIONAL ETHICS IN ENGINEERINGUnit 4
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 4
 
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 3
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 3GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 3
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 3
 
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 2
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 2GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 2
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 2
 
Cs6703 grid and cloud computing unit 5 questions
Cs6703 grid and cloud computing  unit 5 questionsCs6703 grid and cloud computing  unit 5 questions
Cs6703 grid and cloud computing unit 5 questions
 
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 1
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 1GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 1
GE6075 PROFESSIONAL ETHICS IN ENGINEERING Unit 1
 
Cs6703 grid and cloud computing unit 4 questions
Cs6703 grid and cloud computing  unit 4 questionsCs6703 grid and cloud computing  unit 4 questions
Cs6703 grid and cloud computing unit 4 questions
 
Cs6703 grid and cloud computing unit 3 questions
Cs6703 grid and cloud computing  unit 3 questionsCs6703 grid and cloud computing  unit 3 questions
Cs6703 grid and cloud computing unit 3 questions
 
Cs6703 grid and cloud computing unit 2 questions
Cs6703 grid and cloud computing  unit 2 questionsCs6703 grid and cloud computing  unit 2 questions
Cs6703 grid and cloud computing unit 2 questions
 

Recently uploaded

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 

Recently uploaded (20)

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 

Cs6703 grid and cloud computing unit 4

  • 1. CS6703 GRID AND CLOUD COMPUTING Unit 4 Dr Gnanasekaran Thangavel Professor and Head Faculty of Information Technology R M K College of Engineering and Technology
  • 2. UNIT IVPROGRAMMING MODEL Open source grid middleware packages – Globus Toolkit (GT4) Architecture , Configuration – Usage of Globus – Main components and Programming model - Introduction to Hadoop Framework - Mapreduce, Input splitting, map and reduce functions, specifying input and output parameters, configuring and running a job – Design of Hadoop file system, HDFS concepts, command line and java interface, dataflow of File read & File write. 8/24/20162 Dr Gnanasekaran Thangavel
  • 3. Open source grid middleware packages 8/24/2016Dr Gnanasekaran Thangavel3  The Open Grid Forum and Object Management are two well- formed organizations behind the standards  Middleware is the software layer that connects software components. It lies between operating system and the applications.  Grid middleware is specially designed a layer between hardware and software, enable the sharing of heterogeneous resources and managing virtual organizations created around the grid.  The popular grid middleware are 1. BOINC -Berkeley Open Infrastructure for Network Computing. 2. UNICORE - Middleware developed by the German grid computing community. 3. Globus (GT4) - A middleware library jointly developed by Argonne National Lab., Univ. of Chicago, and USC Information Science Institute, funded by DARPA, NSF, and NIH. 4. CGSP in ChinaGrid - The CGSP (ChinaGrid Support Platform) is a
  • 4. Open source grid middleware packages conti… 8/24/2016Dr Gnanasekaran Thangavel4 5. Condor-G - Originally developed at the Univ. of Wisconsin for general distributed computing, and later extended to Condor-G for grid job management. 6. Sun Grid Engine (SGE) - Developed by Sun Microsystems for business grid applications. Applied to private grids and local clusters within enterprises or campuses. 7. gLight -Born from the collaborative efforts of more than 80 people in 12 different academic and industrial research centers as part of the EGEE Project, gLite provided a framework for building grid applications tapping into the power of distributed computing and storage resources across the Internet.
  • 5. The Globus Toolkit Architecture (GT4) 8/24/2016Dr Gnanasekaran Thangavel5  The Globus Toolkit, is an open middleware library for the grid computing communities. These open source software libraries support many operational grids and their applications on an international basis.  The toolkit addresses common problems and issues related to grid resource discovery, management, communication, security, fault detection, and portability. The software itself provides a variety of components and capabilities.  The library includes a rich set of service implementations. The implemented software supports grid infrastructure management, provides tools for building new web services in Java , C, and Python, builds a powerful standard-based security infrastructure and client API s (in different languages), and offers comprehensive command-line programs for accessing various grid services. T  The Globus Toolkit was initially motivated by a desire to remove obstacles that prevent seamless collaboration, and thus sharing of resources and services, in scientific and engineering applications. The shared resources can be computers, storage, data, services, networks, science instruments (e.g., sensors), and so on. The Globus library version GT4, is conceptually shown in
  • 6. The Globus Toolkit 8/24/2016Dr Gnanasekaran Thangavel6
  • 7. The Globus Toolkit 8/24/2016Dr Gnanasekaran Thangavel7  GT4 offers the middle-level core services in grid applications.  The high-level services and tools, such as MPI , Condor-G, and Nirod/G, are developed by third parties for general purpose distributed computing applications.  The local services, such as LSF, TCP, Linux, and Condor, are at the boom level and are fundamental tools supplied by other developers.  As a de facto standard in grid middleware, GT4 is based on industry-standard web service technologies.
  • 8. Functionalities of GT4 8/24/2016Dr Gnanasekaran Thangavel8  Global Resource Allocation Manager (GRAM ) - Grid Resource Access and Management (HTTP-based)  Communication (Nexus ) - Unicast and multicast communication  Grid Security Infrastructure (GSI ) - Authentication and related security services  Monitory and Discovery Service (MDS ) - Distributed access to structure and state information  Health and Status (HBM ) - Heartbeat monitoring of system components  Global Access of Secondary Storage (GASS ) - Grid access of data in remote secondary storage  Grid File Transfer (GridFTP ) Inter-node fast file transfer
  • 9. Globus Job Workflow 8/24/2016Dr Gnanasekaran Thangavel9
  • 10. Globus Job Workflow 8/24/2016Dr Gnanasekaran Thangavel10  A typical job execution sequence proceeds as follows: The user delegates his credentials to a delegation service.  The user submits a j ob request to GRAM with the delegation identifier as a parameter.  GRAM parses the request, retrieves the user proxy certificate from the delegation service, and then acts on behalf of the user.  GRAM sends a transfer request to the RFT (Reliable File Transfer), which applies GridFTP to bring in the necessary files.  GRAM invokes a local scheduler via a GRAM adapter and the SEG (Scheduler Event  Generator) initiates a set of user j obs.  The local scheduler reports the j ob state to the SEG. Once the j ob is complete, GRAM uses RFT and GridFTP to stage out the resultant files. The grid monitors the progress of these operations and sends the user a notification
  • 11. Client-Globus Interactions 8/24/2016Dr Gnanasekaran Thangavel11  There are strong interactions between provider programs and user code. GT4 makes heavy use of industry-standard web service protocols and mechanisms in service description, discovery, access, authentication, authorization.  GT4 makes extensive use of java, C, and Python to write user code. Web service mechanisms define specific interfaces for grid computing.  Web services provide flexible, extensible, and widely adopted XML-based interfaces.
  • 12. Data Management Using GT4 8/24/2016Dr Gnanasekaran Thangavel12  Grid applications one need to provide access to and/or integrate large quantities of data at multiple sites. The GT4 tools can be used individually or in conj unction with other tools to develop interesting solutions to efficient data access. The following list briefly introduces these GT4 tools:  1. Grid FTP supports reliable, secure, and fast memory-to-memory and disk-to- disk data movement over high-bandwidth WANs. Based on the popular FTP protocol for internet file transfer, Grid FTP adds additional features such as parallel data transfer, third-party data transfer, and striped data transfer. I n addition, Grid FTP benefits from using the strong Globus Security Infra structure for securing data channels with authentication and reusability. It has been reported that the grid has achieved 27 Gbit/second end-to-end transfer speeds over some WANs.  2. RFT provides reliable management of multiple Grid FTP transfers. I t has been used to orchestrate the transfer of millions of files among many sites simultaneously.  3. RLS (Replica Location Service) is a scalable system for maintaining and providing access to information about the location of replicated files and data sets.
  • 13. 8/24/2016Dr Gnanasekaran Thangavel13  Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.  It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.  Hadoop offers a software platform that was originally developed by a Yahoo! group. The package enables users to write and run applications over vast amounts of distributed data.  Users can easily scale Hadoop to store and process petabytes of data in the web space.  Hadoop is economical in that it comes with an open source version of MapReduce that minimizes overhead in task spawning and massive data communication.  It is efficient, as it processes data with a high degree of parallelism across a large number of commodity nodes, and it is reliable in that it automatically keeps multiple data copies to facilitate redeployment of computing tasks upon
  • 14. Hadoop 8/24/2016Dr Gnanasekaran Thangavel14 • Hadoop: • an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. • Goals / Requirements: • Abstract and facilitate the storage and processing of large and/or rapidly growing data sets • Structured and non-structured data • Simple programming models • High scalability and availability • Use commodity (cheap!) hardware with little redundancy • Fault-tolerance • Move computation rather than data
  • 15. Hadoop Framework Tools 8/24/2016Dr Gnanasekaran Thangavel15
  • 16. Hadoop’s Architecture 8/24/2016Dr Gnanasekaran Thangavel16  Distributed, with some centralization  Main nodes of cluster are where most of the computational power and storage of the system lies  Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also DataNode to store needed blocks closely as possible  Central control node runs NameNode to keep track of HDFS directories & files, and JobTracker to dispatch compute tasks to TaskTracker  Written in Java, also supports Python and Ruby
  • 17. Hadoop’s Architecture 8/24/2016Dr Gnanasekaran Thangavel17  Hadoop Distributed Filesystem  Tailored to needs of MapReduce  Targeted towards many reads of filestreams  Writes are more costly  High degree of data replication (3x by default)  No need for RAID on normal nodes  Large blocksize (64MB)
  • 18. Hadoop’s Architecture 8/24/2016Dr Gnanasekaran Thangavel18 NameNode:  Stores metadata for the files, like the directory structure of a typical FS.  The server holding the NameNode instance is quite crucial, as there is only one.  Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or file-streams, only metadata.  Handles creation of more replica blocks when necessary after a DataNode failure DataNode: • Stores the actual data in HDFS • Can run on any underlying filesystem (ext3/4, NTFS, etc) • Notifies NameNode of what blocks it has • NameNode replicates blocks 2x in local rack, 1x elsewhere
  • 20. Hadoop’s Architecture 8/24/2016Dr Gnanasekaran Thangavel20 MapReduce Engine:  JobTracker & TaskTracker  JobTracker splits up data into smaller tasks(“Map”) and sends it to the TaskTracker process in each node  TaskTracker reports back to the JobTracker node and reports on job progress, sends data (“Reduce”) or requests new jobs  None of these components are necessarily limited to using HDFS  Many other distributed file-systems with quite different architectures work  Many other software packages besides Hadoop's MapReduce platform make use of HDFS
  • 21. Hadoop in the Wild 8/24/2016Dr Gnanasekaran Thangavel21 Hadoop is in use at most organizations that handle big data:  Yahoo!  Facebook  Amazon  Netflix  Etc… Some examples of scale:  Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web search  FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov, 2012)
  • 22. Three main applications of Hadoop 8/24/2016Dr Gnanasekaran Thangavel22  Advertisement (Mining user behavior to generate recommendations)  Searches (group related documents)  Security (search for uncommon patterns)
  • 23. 8/24/2016Dr Gnanasekaran Thangavel23 Hadoop Highlights  Distributed File System  Fault Tolerance  Open Data Format  Flexible Schema  Queryable Database Why use Hadoop?  Need to process Multi Petabyte Datasets  Data may not have strict schema  Expensive to build reliability in each application  Nodes fails everyday  Need common infrastructure  Very Large Distributed File System  Assumes Commodity Hardware  Optimized for Batch Processing  Runs on heterogeneous OS
  • 24. 8/24/2016Dr Gnanasekaran Thangavel24 DataNode  A Block Sever  Stores data in local file system  Stores meta-data of a block - checksum  Serves data and meta-data to clients  Block Report  Periodically sends a report of all existing blocks to NameNode  Facilitate Pipelining of Data  Forwards data to other specified DataNodes
  • 25. 8/24/2016Dr Gnanasekaran Thangavel25 Block Placement  Replication Strategy  One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed  Clients read from nearest replica
  • 26. 8/24/2016Dr Gnanasekaran Thangavel26 Data Correctness  Use Checksums to validate data – CRC32  File Creation  Client computes checksum per 512 byte  DataNode stores the checksum  File Access  Client retrieves the data and checksum from DataNode  If validation fails, client tries other replicas
  • 27. 8/24/2016Dr Gnanasekaran Thangavel27 Data Pipelining  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the client moves on to write the next block in file
  • 28. 8/24/2016Dr Gnanasekaran Thangavel28 MapReduce Usage  Log processing  Web search indexing  Ad-hoc queries
  • 29. 8/24/2016Dr Gnanasekaran Thangavel29 MapReduce Process (org.apache.hadoop.mapred)  JobClient  Submit job  JobTracker  Manage and schedule job, split job into tasks  TaskTracker  Start and monitor the task execution  Child  The process that really execute the task
  • 30. 8/24/2016Dr Gnanasekaran Thangavel30 Inter Process Communication  Protocol  JobClient <-------------> JobTracker  TaskTracker <------------> JobTracker  TaskTracker <-------------> Child  JobTracker impliments both protocol and works as server in both IPC  TaskTracker implements the TaskUmbilicalProtocol; Child gets task information and reports task status through it. JobSubmissionProtocol InterTrackerProtocol TaskUmbilicalProtocol
  • 31. 8/24/2016Dr Gnanasekaran Thangavel31 JobClient.submitJob - 1  Check input and output, e.g. check if the output directory is already existing  job.getInputFormat().validateInput(job);  job.getOutputFormat().checkOutputSpecs(fs, job);  Get InputSplits, sort, and write output to HDFS  InputSplit[] splits = job.getInputFormat(). getSplits(job, job.getNumMapTasks());  writeSplitsFile(splits, out); // out is $SYSTEMDIR/$JOBID/job.split
  • 32. 8/24/2016Dr Gnanasekaran Thangavel32 JobClient.submitJob - 2  The jar file and configuration file will be uploaded to HDFS system directory  job.write(out); // out is $SYSTEMDIR/$JOBID/job.xml  JobStatus status = jobSubmitClient.submitJob(jobId);  This is an RPC invocation, jobSubmitClient is a proxy created in the initialization
  • 33. 8/24/2016Dr Gnanasekaran Thangavel33 Job initialization on JobTracker - 1  JobTracker.submitJob(jobID) <-- receive RPC invocation request  JobInProgress job = new JobInProgress(jobId, this, this.conf)  Add the job into Job Queue  jobs.put(job.getProfile().getJobId(), job);  jobsByPriority.add(job);  jobInitQueue.add(job);
  • 34. 8/24/2016Dr Gnanasekaran Thangavel34 Job initialization on JobTracker - 2  Sort by priority  resortPriority();  compare the JobPrioity first, then compare the JobSubmissionTime  Wake JobInitThread  jobInitQueue.notifyall();  job = jobInitQueue.remove(0);  job.initTasks();
  • 35. 8/24/2016Dr Gnanasekaran Thangavel35 JobTracker Task Scheduling - 1  Task getNewTaskForTaskTracker(String taskTracker)  Compute the maximum tasks that can be running on taskTracker  int maxCurrentMap Tasks = tts.getMaxMapTasks();  int maxMapLoad = Math.min(maxCurrentMapTasks, (int)Math.ceil(double) remainingMapLoad/numTaskTrackers));
  • 36. 8/24/2016Dr Gnanasekaran Thangavel36 JobTracker Task Scheduling - 2  int numMaps = tts.countMapTasks(); // running tasks number  If numMaps < maxMapLoad, then more tasks can be allocated, then based on priority, pick the first job from the jobsByPriority Queue, create a task, and return to TaskTracker  Task t = job.obtainNewMapTask(tts, numTaskTrackers);
  • 37. 8/24/2016Dr Gnanasekaran Thangavel37 Start TaskTracker - 1  initialize()  Remove original local directory  RPC initialization  TaskReportServer = RPC.getServer(this, bindAddress, tmpPort, max, false, this, fConf);  InterTrackerProtocol jobClient = (InterTrackerProtocol) RPC.waitForProxy(InterTrackerProtocol.class, InterTrackerProtocol.versionID, jobTrackAddr, this.fConf);
  • 38. 8/24/2016Dr Gnanasekaran Thangavel38 Run Task on TaskTracker - 1  TaskTracker.localizeJob(TaskInProgress tip);  launchTasksForJob(tip, new JobConf(rjob.jobFile));  tip.launchTask(); // TaskTracker.TaskInProgress  tip.localizeTask(task); // create folder, symbol link  runner = task.createRunner(TaskTracker.this);  runner.start(); // start TaskRunner thread
  • 39. 8/24/2016Dr Gnanasekaran Thangavel39 Run Task on TaskTracker - 2  TaskRunner.run();  Configure child process’ jvm parameters, i.e. classpath, taskid, taskReportServer’s address & port  Start Child Process  runChild(wrappedCommand, workDir, taskid);
  • 40. 8/24/2016Dr Gnanasekaran Thangavel40 Child.main()  Create RPC Proxy, and execute RPC invocation  TaskUmbilicalProtocol umbilical = (TaskUmbilicalProtocol) RPC.getProxy(TaskUmbilicalProtocol.class, TaskUmbilicalProtocol.versionID, address, defaultConf);  Task task = umbilical.getTask(taskid);  task.run(); // mapTask / reduceTask.run
  • 41. 8/24/2016Dr Gnanasekaran Thangavel41 Finish Job - 1  Child  task.done(umilical);  RPC call: umbilical.done(taskId, shouldBePromoted)  TaskTracker  done(taskId, shouldPromote)  TaskInProgress tip = tasks.get(taskid);  tip.reportDone(shouldPromote);  taskStatus.setRunState(TaskStatus.State.SUCCEEDED)
  • 42. 8/24/2016Dr Gnanasekaran Thangavel42 Finish Job - 2  JobTracker  TaskStatus report: status.getTaskReports();  TaskInProgress tip = taskidToTIPMap.get(taskId);  JobInProgress update JobStatus  tip.getJob().updateTaskStatus(tip, report, myMetrics);  One task of current job is finished  completedTask(tip, taskStatus, metrics);  If (this.status.getRunState() == JobStatus.RUNNING && allDone) {this.status.setRunState(JobStatus.SUCCEEDED)}
  • 43. 8/24/2016Dr Gnanasekaran Thangavel43 Demo  Word Count  hadoop jar hadoop-0.20.2-examples.jar wordcount <input dir> <output dir>  Hive  hive -f pagerank.hive
  • 44. References 1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, “Distributed and Cloud Computing: Clusters, Grids, Clouds and the Future of Internet”, First Edition, Morgan Kaufman Publisher, an Imprint of Elsevier, 2012. 2. www.csee.usf.edu/~anda/CIS6930-S11/notes/hadoop.ppt 3. www.ics.uci.edu/~cs237/lectures/cloudvirtualization/Hadoop.pptx 44 Dr Gnanasekaran Thangavel 8/24/2016
  • 46. 46 Thank You Questions and Comments? Dr Gnanasekaran Thangavel 8/24/2016