This document provides a high level overview of how glideinWMS-based instanced do matchmaking in CMS (a High Energy Experiment). The information is accurate as of early Dec 2012.
1. glideinWMS for users
Matchmaking in glideinWMS
in CMS
by Igor Sfiligoi (UCSD)
CERN, Dec 2012 glideinWMS matchmaking 1
2. Scope of this talk
This talk provides a
high level description of how
glideinWMS matchmaking
works in CMS.
Reader is expected to be familiar with the CMS experiment environment
http://cms.web.cern.ch/
CERN, Dec 2012 glideinWMS matchmaking 2
3. glideinWMS architecture
● A reminder G.F.
+3
VO FE Grid
G.F.
+1
Execute node
Central manager Execute node
Submit node
Execute node
Negotiator
Submit node
Execute node
Submit node
Execute node
Schedd Condor
CERN, Dec 2012 glideinWMS matchmaking 3
4. Two levels of matchmaking
● First in the VO Frontend
● To decide where G.F.
to provision resources VO FE
+3
+1
G.F.
Grid
Execute node
● i.e. where Submit node
Central manager Execute node
Execute node
to send glideins
Negotiator
Submit node
Execute node
Submit node
Execute node
Schedd
Then in the
Condor
●
HTCondor Negotiator
● To decide The two
which Job gets the glidein Slot must have
compatible
policies
CERN, Dec 2012 glideinWMS matchmaking 4
5. Defining the policy
● The VO FE configures the glideins
● So it can define the Slot Requirements
● Preferred strategy to leave all policy
decisions in the VO FE hands, i.e. both
● VO FE matchmaking policy Easier keep them
in sync this way
● HTCondor matchmaking policy
● This implies
● Users should not define Job Requirements
● Instead, publish attributes describing requirements
http://www.slideshare.net/igor_sfiligoi/condor-week-12-attribute-matchmaking-move-req-out-of-user-hands
CERN, Dec 2012 glideinWMS matchmaking 5
6. CMS Production @ CERN
Policies
CERN, Dec 2012 glideinWMS matchmaking 6
7. Description
● The VO FE @ CERN serves
the production needs
● i.e. Reconstruction and MC production
● Job submission regulated by service managed
by a dedicated team,
so jobs are
● Targeted
● Well behaved
At least by and large
CERN, Dec 2012 glideinWMS matchmaking 7
8. Matchmaking policy
● Two dimensions
● Grid Site
● Single CPU vs HTPC
● The actual policy is the AND of both
● Both VO FE policy and HTCondor policy
defined in the VO FE instance configuration
CERN, Dec 2012 glideinWMS matchmaking 8
9. Matching on Grid site name
● User Jobs expected to publish the attribute
DESIRED_Sites String list
● e.g. +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”
● The G.F. and the glideins advertising
GLIDEIN_CMSSite
● The matchmaking policy is
GLIDEIN_CMSSite ∈ DESIRED_Sites
CERN, Dec 2012 glideinWMS matchmaking 9
10. Matching on Job Type
● Use Jobs can publish the attribute
DESIRES_HTPC Integer representation of Boolean values
● e.g. +DESIRES_HTPS = 1
● If not defined, defaults to 0
● The G.F. And the glideins may advertise
GLIDEIN_Is_HTPC Boolean value
● If not defined, defaults to False
● The matchmaking policy is
(GLIDEIN_Is_HTPC==True)==(DESIRES_HTPC==1)
CERN, Dec 2012 glideinWMS matchmaking 10
13. Description
● VO FE @ UCSD serves CMS analysis users
● User Jobs much more chaotic
● Most users don't really understand their needs
● Must protect from accidental errors
● Yet keep the system flexible
● Net result
● More complex policy
CERN, Dec 2012 glideinWMS matchmaking 13
14. Two different policies
● The AnaOps FE actually has two policies
● The Regular policy
● The Overflow policy
● The Regular policy tries to match resources
● Based on User desires
● The Overflow policy “outsmarts” the Users
● Will violate User desires without breaking the Jobs
● The aim is to finish user jobs sooner
● User can opt-out, if he wishes
CERN, Dec 2012 glideinWMS matchmaking 14
15. The Regular M.M. policy
● Four+one dimensions
● Grid Site
● Single CPU vs HTPC
● Memory usage
● Job duration
Due to preemption
● Number of Job Starts
● The actual policy is the AND of both
● Both VO FE policy and HTCondor policy
defined in the VO FE instance configuration
CERN, Dec 2012 glideinWMS matchmaking 15
16. Grid site selection
● This is both similar and different compared to
the Production FE @CERN
● Serves the same purpose, but supports three
different ways to select a site
– Due to historical evolution
● The three options are
● GLIDEIN_CMSSite ∈ DESIRED_Sites
Planning to extend to
● GLIDEIN_SEs ∈ DESIRED_SEs (GLIDEIN_SEs ∩ DESIRED_SEs) ≠∅
● GLIDEIN_Gatekeeper ∈ DESIRED_Gatekeepers
● The actual policy is the OR of the three
CERN, Dec 2012 glideinWMS matchmaking 16
17. Job type selection
● Just like @ CERN
CERN, Dec 2012 glideinWMS matchmaking 17
18. Memory Usage
● Most Grid sites put strict limits on the amount of
memory that can be used
● Will kill glideins if they exceed the limit
● G.F. and glideins advertise the Entry-specific limit
GLIDEIN_MaxMemMBs
● Jobs can explicitly declare the needed memory
request_memory Native Condor attribute, no + needed
● Condor will also measure it at run time Use a combination
of these to calculate
– ImageSize – Virtual memory used the actual JobMemory
– ResidentSetSize – True memory usage
● Policy: JobMemory <= GLIDEIN_MaxMemMBs
CERN, Dec 2012 glideinWMS matchmaking 18
19. Job Duration 1/2
● Glideins have a limited lifetime
● Must fit within the limits of the Grid site's queue
● Glideins publish the deadline
GLIDEIN_ToDie
– Jobs must finish before reaching the deadline
● Final user job lifetime unpredictable
● Depends on the type of computing done
● User should indicate the expected job lifetime
– Else we have to assume reasonable defaults
Not many users set
this value(s) right now
CERN, Dec 2012 glideinWMS matchmaking 19
20. Job Duration 2/2
● The same type of computation may take
different amount of time
● e.g. Based on the type of input
● Jobs can declare two attributes
● NormMaxWallTimeMins – Expected limit
● MaxWallTimeMins – Absolute max limit
● The matchmaking logic is
● Use NormMaxWallTimeMins for
Based on simple assumption
the first job startup that the job was killed for
hitting the deadline.
● Use MaxWallTimeMins for all others
CERN, Dec 2012 glideinWMS matchmaking 20
21. Cut on number of re-starts
● Not really a user configurable property
● More an emergency break
● In a properly configured system,
should never be triggered
● But unexpected problems happen
● So better limit the damage
CERN, Dec 2012 glideinWMS matchmaking 21
22. The Overflow Use case
● User Jobs specify a list of sites,
because the data they need is there
● With recent versions of CMSSW, jobs can
access the data from remote
● With a small performance penalty
● We can thus schedule jobs “anywhere”
● As long as the needed data is
at a Site that has joined the xrootd federation
● But only if no CPU available “close to the data”
– And not too far, either
http://indico.cern.ch/contributionDisplay.py?contribId=381&sessionId=5&confId=149557
http://indico.cern.ch/contributionDisplay.py?contribId=232&sessionId=8&confId=149557
CERN, Dec 2012 glideinWMS matchmaking 22
23. The Overflow M.M. policy
● Violate only the “Site selection” rule
● Keep all the others
● Plus, add one+one more:
● An opt-out mechanism
● Delayed matching
CERN, Dec 2012 glideinWMS matchmaking 23
24. New Site M.M. policy
● The user specified attribute is used
to flag the job as “Overflowable”
● i.e. the job will match if and only if
(DESIRED_<site>s ∩ SUPPORTED_<site>s) ≠∅
Still support all 3 types of site identification
● Matching jobs can then run on any glidein
● Additional limits can be put in place by the FE,
but mostly invisible to the user
CERN, Dec 2012 glideinWMS matchmaking 24
25. The opt-out mechanism
● The Overflow policy
considers all jobs by default
● But Users may want to opt-out some of the Jobs
– Sometimes it is just a need
(to get deterministic results, e.g. for testing a site)
● To opt-out, the user defines
+CMS_ALLOW_OVERFLOW = False
● The FE will not consider such jobs for Overflowing
CERN, Dec 2012 glideinWMS matchmaking 25
26. Delayed matching
● As said initially,
Jobs should preferentially run close to the data
● Overflow should only consider jobs
“that cannot find resources close to the data”
● We implemented it based on time
● Jobs are matched only
if waiting in the queue for more than 6 hours
Users cannot influence it
CERN, Dec 2012 glideinWMS matchmaking 26
29. Pointers
● glideinWMS Home Page
http://tinyurl.com/glideinWMS
● HTCondor Home Page
http://research.cs.wisc.edu/htcondor/
● HTCondor support
htcondor-users@cs.wisc.edu
htcondor-admin@cs.wisc.edu
● glideinWMS support
glideinwms-support@fnal.gov
CERN, Dec 2012 glideinWMS matchmaking 29
30. Acknowledgments
● The creation of this document was sponsored
by grants from the US NSF and US DOE,
and by the University of California system
CERN, Dec 2012 glideinWMS matchmaking 30