This presentation explains why Condor is not suitable for use on user-owned machines, and why RemoteCondor is the best available solution to the problem.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Wedding convenience and control with RemoteCondor
1. UCSD HEP Group Trainings
Wedding
convenience and control
with
RemoteCondor
by Igor Sfiligoi
RemoteCondor co-developed with J. Dost
UC San Diego
Apr 2012 Remote Condor 1
2. The Condor Batch System
● Condor is a Workload Management System
● i.e. a batch system
● Strong points
● Fault tolerant
● Robust feature set
● Flexible
● Large community base
● Both commercial and scientific
http://research.cs.wisc.edu/condor/
Apr 2012 Remote Condor 2
3. Condor Architecture
● Clearly separates Machines (aka worker nodes)
CPUs, Memory, IO,...
● Resource providers
from
● Resource consumers
Job queues (aka submit nodes)
● Each has a daemon Jobs submitted by users
process to represent it
● Startd for resource provides
● Schedd for resource consumers
● A central service connects them all
●
Managed by a Collector/Negotiator pair
Apr 2012 Remote Condor 3
5. The truth about submit nodes
● Corollary
● The submit node is a server!
● There is no real “Condor client”
● The cmdline tools are just a convenience
to talk to the daemon process
Submit node
Collector
Negotiator
Schedd Startd
condor_submit
condor_q
Apr 2012 Remote Condor 5
6. Implications
● Being a server has several implications
● Security implications
● Will have incoming connectivity
● All security configuration on the submit node
● Submit node controls user
authentication and authorization
● Unfriendly to non-dedicated hardware
● Requires always on operation
● Must be on a public&static IP address
Apr 2012 Remote Condor 6
7. Implications
● Being a server has several implications
● Security implications High exploit risk
● Will have incoming connectivity
● All security configuration on the submit node
● Submit node controls user Requires high trust
between all nodes
authentication and authorization in the cluster
● Unfriendly to non-dedicated hardware
● Requires always on operation Impossible to
use on a laptop
● Must be on a public&static IP address
Apr 2012 Remote Condor 7
8. Implications
● Being a server has several implications
● Security implications High exploit risk
● Will have incoming connectivity
● All security configuration on the submit node
●
Not suitable Requires high trust
Submit node controls user
for and authorization between cluster
authentication an unmanaged in the all nodes
●
user machine
Unfriendly to non-dedicated hardware
● Requires always on operation Impossible to
use on a laptop
● Must be on a public&static IP address
Apr 2012 Remote Condor 8
9. What are the alternatives?
● Out of the box, Condor provides
● Remote submission
● Condor-C
● In the contrib sections, you can find
● RemoteCondor
Apr 2012 Remote Condor 9
10. What are the alternatives?
● Out of the box, Condor provides
● Remote submission
● Condor-C
● In the contrib sections, you can find
● RemoteCondor
This presentation
argues that this is
the best solution
Apr 2012 Remote Condor 10
11. What are the alternatives?
● Out of the box, Condor provides
● Remote submission
So what is wrong with these?
● Condor-C
● In the contrib sections, you can find
● RemoteCondor
This presentation
argues that this is
the best solution
Apr 2012 Remote Condor 11
12. Remote submission
● Essentially, connecting to a remote Schedd
● condor_submit -remote … + condor_transfer_data
and
● condor_q -name ..., condor_rm -name ..., …
● So no daemon processes on the submit node
● A true client solution!
Submit node Schedd node
Collector
Negotiator
Auth
Schedd
Schedd
condor_submit Startd
condor_q
condor_transfer_data
http://research.cs.wisc.edu/condor/manual/v7.6/condor_submit.html
http://research.cs.wisc.edu/condor/manual/v7.6/condor_transfer_data.html
Apr 2012 Remote Condor 12
13. So, what's the problem?
● No local user log file
● Annoying at best
● Must use ● High monitoring load
condor_q ● And it does not work
to monitor progress with DAGMan
● Fully Condor-based user authentication
● While rich, not what users expect
(e.g. no user/password)
● Hard to tie into campus-wide auth
● Staged input data not shared
Could be a problem with large datasets
Apr 2012 Remote Condor 13
14. Condor-C
● Based on the Grid paradigm
● Submit locally, then delegate to remote Schedd
● Still running a daemon process ● Secure
● Laptop
● But requires no incoming connections
friendly
Submit node Schedd node
Collector
Negotiator
Schedd
Auth
Schedd
Schedd
Startd
condor_submit
condor_q
http://research.cs.wisc.edu/condor/manual/v7.6/5_3Grid_Universe.html#sec:Condor-C
Apr 2012 Remote Condor 14
15. What are the drawbacks?
● Awkward syntax
● At least compared to Vanilla universe Can be mitigated
with Job Router
● See the Condor manual for examples (but adds another
layer of complexity)
● Has scalability problems
● Could likely be improved,
but this is the current state-of-the-art
● Fully Condor-based user authentication
● Staged input data not shared Same as remote
submissions
Apr 2012 Remote Condor 15
17. What's the big idea?
● Let the users login into a remote machine
● And run the cmdline tools there True client
approach
Apr 2012 Remote Condor 17
18. What's the big idea?
● Let the users login into a remote machine
● And run the cmdline tools there
Advantages: No exceptions
● True local Condor experience
● Standard system
● Minimize security risk
● Central handling
authentication and authorization ● Familiar to users
● No admin privileges for the users
● Trust based on “central” Schedd admin skills
● Can regulate and transform Condor submissions
Apr 2012 Remote Condor 18
19. What's the big idea?
● Let the users login into a remote machine
● And run the cmdline tools there
Advantages: No exceptions
● True local Condor experience
Minimize security risk
Big deal!
●
● Standard system
Central handling
●
authentication and authorization Familiar to users
●
Where's the news?
● No admin privileges for the users
● Trust based on “central” Schedd admin skills
● Can regulate and transform Condor submissions
Apr 2012 Remote Condor 19
20. What's the big idea?
● Let the users login into a remote machine
● And run the cmdline tools there
● … while preserving the local look-and-feel
● RemoteCondor provides
● Wrappers around major Condor cmdline tools
● Integration with sshfs
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=RemoteCondor
Apr 2012 Remote Condor 20
21. RemoteCondor wrappers
● Provide wrappers that use ssh under the hood
● Users (almost) unaware of the trick
● But may be prompted for a password
● Works best with public key authentication
Submit node Schedd node
Collector
Negotiator
Auth
sshd Schedd
Startd
condor_submit condor_submit
condor_q condor_q
Apr 2012 Remote Condor 21
22. RemoteCondor and sshfs
● But being able to talk to Condor is not enough
● Users must be able to create and read data!
● Using sshfs solves the problem
● Schedd-local disk mounted on submit node
● Using ssh as a tunnel Disk local to Schedd
for maximum performance
● All in user space (FUSE)
● RemoteCondor will properly convert paths
(within certain limits)
http://fuse.sourceforge.net/sshfs.html
Apr 2012 Remote Condor 22
23. RemoteCondor and sshfs
● But being able to talk to Condor is not enough
● Users must be able to create and read data!
● Using sshfs solves the problem
● Schedd-local disk mounted on submit node
Submit node Schedd node
Collector
Negotiator
Auth
sshd Schedd
Startd
sshfs Real disk
Apr 2012 Remote Condor 23
24. Using RemoteCondor
● Distributed in the Condor src tarball
● In the Contrib section
● Requires a “make install”
● To put the proper files in place
● Plus minimal configuration
● Where is the remote Schedd node?
● What username to use?
● Where to mount the sshfs partition?
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=RemoteCondor
Apr 2012 Remote Condor 24
25. Summary
● Traditional Condor not suitable for user machines
● Keeping Schedd nodes professionally maintained
highly desirable
● To minimize security risks and control job flow
● RemoteCondor allows this operation mode
while preserving the local look-and-feel
● Requires minimal local install
Apr 2012 Remote Condor 25
26. Acknowledgements
This work is partially sponsored by
● the US National Science Foundation under Grants
No. OCI-0943725 (STCI) and PHY-0612805
(CMS Maintenance & Operations),
and
● the US Department of Energy under Grant No. DE-
FC02-06ER41436 subcontract No. 647F290 (OSG).
Apr 2012 Remote Condor 26