SlideShare una empresa de Scribd logo
1 de 23
15 Condor – A Distributed Job
Scheduler
Todd Tannenbaum, Derek Wright, Karen
Miller, and Miron Livny
Beowulf Cluster Computing with
Linux, Thomas Sterling, editor, Oct.
2001. Summarized by Simon Kim
Contents
• Introduction to Condor
• Using Condor
• Condor Architecture
• Installing Condor under Linux
• Configuring Condor
• Administration Tools
• Cluster Setup Scenarios
Introduction to Condor
• Distributed Job Scheduler
• Condor Research Project at University of
Wisconsin-Madison Department of Computer
Sciences
• Changed name to HTCondor in 2012
– http://research.cs.wisc.edu/htcondor
Introduction to Condor
Condor
Job
Run Run
Idle
Monitor
Progress
Report
Queue
Run
Idle Run
Nodes
User
Policy
Complete!
Introduction to Condor
• Workload Management System
• Job Queuing Mechanism
• Scheduling Policy
• Priority Scheme
• Resource Monitoring and Management
Condor Features
• Distributed Submission
• User/Job Priorities
• Job Dependency - DAG
• Multiple Job Models – Serial/Parallel Jobs
• ClassAds – Job : Machine Matchmaking
• Job Checkpoint and Migration
• Remote System Calls – Seamless I/O Redirection
• Grid Computing – Interaction with Globus
Resources
ClassAds and Matchmaking
• Job ClassAd
– Looking for Machine
– Requirements: Intel, Linux, Disk Space, …
– Rank: Memory, Kflops, …
• Machine ClassAd
– Looking for Job
– Requirements
– Rank
Using Condor
• Roadmap to Using Condor
• Submitting a Job
• User Commands
• Universes
• Standard Universe
– Process Checkpointing
– Remote System Calls
– Relinking
– Limitations
• Data File Access
• DAGMan Scheduler
Using Condor
Batch
Job
STDIN
STDOUT
STDERR
univers = vanilla
executable = foo
log = foo.log
input = input.data
output = output.data
queue
Submit Description
Standard
Vanilla
PVM
MPI
Grid
Scheduler
Universes:
Runtime Environment
Prepare a Job
Submit
Serial Job
Parallel Job
Meta Scheduler
$ condor_submit
Status of Submitted Jobs
• $ condor_status -submitters
All jobs in the Queue
• $ condor_q
• Removing Job
– $ condor_rm 350.0
• Changing Job Priority: -20 ~ 20(high), default: 0
– $ condor_prio –p -15 350.1
Universes
• Execution Environment - Universe
• Vanilla
– Serial Jobs
– Binary Executable and Scripts
• MPI Universe
– MPI Programs
– Parallel Jobs
– Only on Dedicated Resources
# Submit Description
Universe = mpi
…
machine_count = 8
queue
Universes
• PVM Universe
– Master-worker Style Parallel
Programs
• Written for Parallel Virtual Machine
Interface
– Both Dedicated and Non-
dedicated (workstations)
– Condor Acts as Resource Manager
for PVM Daemon
– Dynamic Node Allocation
PVM Daemon
Condor
# Submit Description
Universe = pvm
…
machine_count = 1..75
queue
pvm_addhosts()
Universes
• Scheduler Universe
– Meta-Scheduler
– DAGMan Scheduler
• Complex Interdependencies Between Jobs
A
B C
D
* B and C are executed in parallel
Job Sequence: A -> B and C -> D
Universes
• Standard Universe
– Serial Job
– Process Checkpoint, Restart, and Migration
– Remote System Calls
Process Checkpointing
• Checkpoint
– Snapshot of the Program’s Current State
– Preemptive Resume Scheduling
– Periodic Checkpoints – Fault Tolerance
– No Program Source Code Change
• Relinking with Condor System Call Library
– Signal Handler
• Process State Written to a Local/Network File
• Stack/Data Segments, CPU state, Open Files, Signal Handlers
and Pending Signals
– Optional Checkpoint Server
• Checkpoint Repository
Remote System Calls
• Redirects File I/O
– Open(), read(), write() -> Network Socket I/O
– Sent to ‘condor_shadow’ process on Submit
Machine
• Handles Actual File I/O
• Note that Job Runs on Remote Machine
• Relinking Condor Remote System Call Library
– $ condor_compile cc myprog.o –o myprog
Standard Universe Limitations
• No Multi-Process Jobs
– fork(), exec(), system()
• No IPC
– Pipes, Semaphores, and Shared Memory
• Brief Network Communication
– Long Connection -> Delay Checkpoints and Migration
• No Kernel-level Threads
– User-level Threads Are Allowed
• File Access: Read-only or Write-only
– Read-Write: Hard to Roll Back to Old Checkpoint
• On Linux, Must be Statically Linked
Data Access from a Job
• Remote System Call – Standard Universe
• Shared Network File System
• What About Non-dedicated Machines (Desktops)
?
– Condor File Transfer
– Before Run, Input Files Transferred to Remote
– On Completion, Output Files Transferred Back to
Submit Machine
– Requested in Submit Description File
• transfer_input_files = <…>, transfer_output_files=<…>
• transfer_files=<ONEXIT | ALWAYS | NEVER>
Condor Architecture
Central Manager Machine
Negotiator
Collector
Startd
Sched
Startd
Sched
Machine 1
Startd
Sched
Machine 2
Startd
Sched
Machine N
Condor Architecture
Central Manager Machine
Negotiator
Collector
Startd
Sched
Startd
Sched
Machine 1: Submit
Startd
Sched
Machine N: Execute
Starter
Job
Shadow
Condor Remote
System Call
Cluster Setup Scenarios
• Uniformed Owned Dedicated Cluster
– MPI Jobs on Dedicated Nodes
• Cluster of Multi-Processor Nodes
– 1VM per Processor
• Cluster of Distributively Owned Nodes
– Jobs from Owner Preferred
• Desktop Submission to Cluster
– Submit-only Node Setup
• Non-Dedicated Computing Resources
– Opportunistic Scheduling and Matchmaking with Process
Checkpointing, Migration, Suspend and Resume
Conclusion
• Distinct Features
– Matchmaking with Job and Machine ClassAds
– Preemptive Scheduling and Migration with
Checkpointing
– Condor Remote System Call
• Powerful Tool for Distributed Scheduling Jobs
– Within and Beyond Beowulf Clusters
• Unique Combination of Dedicated and
Opportunistic Scheduling

Más contenido relacionado

Similar a Presentation 15 condor-v1

Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Adam Dunkels
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Freedom of Movement for redisconf19
Freedom of Movement for redisconf19Freedom of Movement for redisconf19
Freedom of Movement for redisconf19Richard Leddy
 
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...Lucas Jellema
 
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...Tony Erwin
 
Social Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections PinkSocial Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections PinkNico Meisenzahl
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresDocker, Inc.
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsSteve Jamieson
 
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...Adam Dunkels
 
Give your little scripts big wings: Using cron in the cloud with Amazon Simp...
Give your little scripts big wings:  Using cron in the cloud with Amazon Simp...Give your little scripts big wings:  Using cron in the cloud with Amazon Simp...
Give your little scripts big wings: Using cron in the cloud with Amazon Simp...Amazon Web Services
 
Cloud computing Module 2 First Part
Cloud computing Module 2 First PartCloud computing Module 2 First Part
Cloud computing Module 2 First PartSoumee Maschatak
 
Migrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to KubernetesMigrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to KubernetesTony Erwin
 
Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)ewerkboy
 
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04Aritra Sarkar
 
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3Adam Dunkels
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Uri Cohen
 
IBM MQ Disaster Recovery
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster RecoveryMarkTaylorIBM
 

Similar a Presentation 15 condor-v1 (20)

Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Freedom of Movement for redisconf19
Freedom of Movement for redisconf19Freedom of Movement for redisconf19
Freedom of Movement for redisconf19
 
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
 
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
 
Social Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections PinkSocial Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections Pink
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failures
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.js
 
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
 
Give your little scripts big wings: Using cron in the cloud with Amazon Simp...
Give your little scripts big wings:  Using cron in the cloud with Amazon Simp...Give your little scripts big wings:  Using cron in the cloud with Amazon Simp...
Give your little scripts big wings: Using cron in the cloud with Amazon Simp...
 
Cloud computing Module 2 First Part
Cloud computing Module 2 First PartCloud computing Module 2 First Part
Cloud computing Module 2 First Part
 
Migrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to KubernetesMigrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to Kubernetes
 
Condor
CondorCondor
Condor
 
week15a.pdf
week15a.pdfweek15a.pdf
week15a.pdf
 
Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)
 
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
 
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
IBM MQ Disaster Recovery
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster Recovery
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Presentation 15 condor-v1

  • 1. 15 Condor – A Distributed Job Scheduler Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny Beowulf Cluster Computing with Linux, Thomas Sterling, editor, Oct. 2001. Summarized by Simon Kim
  • 2. Contents • Introduction to Condor • Using Condor • Condor Architecture • Installing Condor under Linux • Configuring Condor • Administration Tools • Cluster Setup Scenarios
  • 3. Introduction to Condor • Distributed Job Scheduler • Condor Research Project at University of Wisconsin-Madison Department of Computer Sciences • Changed name to HTCondor in 2012 – http://research.cs.wisc.edu/htcondor
  • 4. Introduction to Condor Condor Job Run Run Idle Monitor Progress Report Queue Run Idle Run Nodes User Policy Complete!
  • 5. Introduction to Condor • Workload Management System • Job Queuing Mechanism • Scheduling Policy • Priority Scheme • Resource Monitoring and Management
  • 6. Condor Features • Distributed Submission • User/Job Priorities • Job Dependency - DAG • Multiple Job Models – Serial/Parallel Jobs • ClassAds – Job : Machine Matchmaking • Job Checkpoint and Migration • Remote System Calls – Seamless I/O Redirection • Grid Computing – Interaction with Globus Resources
  • 7. ClassAds and Matchmaking • Job ClassAd – Looking for Machine – Requirements: Intel, Linux, Disk Space, … – Rank: Memory, Kflops, … • Machine ClassAd – Looking for Job – Requirements – Rank
  • 8. Using Condor • Roadmap to Using Condor • Submitting a Job • User Commands • Universes • Standard Universe – Process Checkpointing – Remote System Calls – Relinking – Limitations • Data File Access • DAGMan Scheduler
  • 9. Using Condor Batch Job STDIN STDOUT STDERR univers = vanilla executable = foo log = foo.log input = input.data output = output.data queue Submit Description Standard Vanilla PVM MPI Grid Scheduler Universes: Runtime Environment Prepare a Job Submit Serial Job Parallel Job Meta Scheduler $ condor_submit
  • 10. Status of Submitted Jobs • $ condor_status -submitters
  • 11. All jobs in the Queue • $ condor_q • Removing Job – $ condor_rm 350.0 • Changing Job Priority: -20 ~ 20(high), default: 0 – $ condor_prio –p -15 350.1
  • 12. Universes • Execution Environment - Universe • Vanilla – Serial Jobs – Binary Executable and Scripts • MPI Universe – MPI Programs – Parallel Jobs – Only on Dedicated Resources # Submit Description Universe = mpi … machine_count = 8 queue
  • 13. Universes • PVM Universe – Master-worker Style Parallel Programs • Written for Parallel Virtual Machine Interface – Both Dedicated and Non- dedicated (workstations) – Condor Acts as Resource Manager for PVM Daemon – Dynamic Node Allocation PVM Daemon Condor # Submit Description Universe = pvm … machine_count = 1..75 queue pvm_addhosts()
  • 14. Universes • Scheduler Universe – Meta-Scheduler – DAGMan Scheduler • Complex Interdependencies Between Jobs A B C D * B and C are executed in parallel Job Sequence: A -> B and C -> D
  • 15. Universes • Standard Universe – Serial Job – Process Checkpoint, Restart, and Migration – Remote System Calls
  • 16. Process Checkpointing • Checkpoint – Snapshot of the Program’s Current State – Preemptive Resume Scheduling – Periodic Checkpoints – Fault Tolerance – No Program Source Code Change • Relinking with Condor System Call Library – Signal Handler • Process State Written to a Local/Network File • Stack/Data Segments, CPU state, Open Files, Signal Handlers and Pending Signals – Optional Checkpoint Server • Checkpoint Repository
  • 17. Remote System Calls • Redirects File I/O – Open(), read(), write() -> Network Socket I/O – Sent to ‘condor_shadow’ process on Submit Machine • Handles Actual File I/O • Note that Job Runs on Remote Machine • Relinking Condor Remote System Call Library – $ condor_compile cc myprog.o –o myprog
  • 18. Standard Universe Limitations • No Multi-Process Jobs – fork(), exec(), system() • No IPC – Pipes, Semaphores, and Shared Memory • Brief Network Communication – Long Connection -> Delay Checkpoints and Migration • No Kernel-level Threads – User-level Threads Are Allowed • File Access: Read-only or Write-only – Read-Write: Hard to Roll Back to Old Checkpoint • On Linux, Must be Statically Linked
  • 19. Data Access from a Job • Remote System Call – Standard Universe • Shared Network File System • What About Non-dedicated Machines (Desktops) ? – Condor File Transfer – Before Run, Input Files Transferred to Remote – On Completion, Output Files Transferred Back to Submit Machine – Requested in Submit Description File • transfer_input_files = <…>, transfer_output_files=<…> • transfer_files=<ONEXIT | ALWAYS | NEVER>
  • 20. Condor Architecture Central Manager Machine Negotiator Collector Startd Sched Startd Sched Machine 1 Startd Sched Machine 2 Startd Sched Machine N
  • 21. Condor Architecture Central Manager Machine Negotiator Collector Startd Sched Startd Sched Machine 1: Submit Startd Sched Machine N: Execute Starter Job Shadow Condor Remote System Call
  • 22. Cluster Setup Scenarios • Uniformed Owned Dedicated Cluster – MPI Jobs on Dedicated Nodes • Cluster of Multi-Processor Nodes – 1VM per Processor • Cluster of Distributively Owned Nodes – Jobs from Owner Preferred • Desktop Submission to Cluster – Submit-only Node Setup • Non-Dedicated Computing Resources – Opportunistic Scheduling and Matchmaking with Process Checkpointing, Migration, Suspend and Resume
  • 23. Conclusion • Distinct Features – Matchmaking with Job and Machine ClassAds – Preemptive Scheduling and Migration with Checkpointing – Condor Remote System Call • Powerful Tool for Distributed Scheduling Jobs – Within and Beyond Beowulf Clusters • Unique Combination of Dedicated and Opportunistic Scheduling