SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
glideinWMS for users



    Matchmaking in glideinWMS
             in CMS
                     by Igor Sfiligoi (UCSD)




CERN, Dec 2012          glideinWMS matchmaking   1
Scope of this talk



                      This talk provides a
                 high level description of how
                  glideinWMS matchmaking
                        works in CMS.



                 Reader is expected to be familiar with the CMS experiment environment
                                        http://cms.web.cern.ch/


CERN, Dec 2012                        glideinWMS matchmaking                             2
glideinWMS architecture
 ●   A reminder                  G.F.
                       +3
          VO FE                                          Grid
                                 G.F.
                       +1
                                                      Execute node

                            Central manager           Execute node
         Submit node
                                                      Execute node
                              Negotiator
         Submit node
                                                      Execute node
         Submit node
                                                      Execute node
           Schedd                                       Condor




CERN, Dec 2012               glideinWMS matchmaking                  3
Two levels of matchmaking
 ●   First in the VO Frontend
      ●   To decide where                                             G.F.

          to provision resources               VO FE
                                                             +3

                                                             +1
                                                                      G.F.
                                                                                               Grid
                                                                                          Execute node

      ●   i.e. where                           Submit node
                                                                  Central manager         Execute node
                                                                                          Execute node

          to send glideins
                                                                    Negotiator
                                               Submit node
                                                                                          Execute node
                                               Submit node
                                                                                          Execute node
                                                Schedd

     Then in the
                                                                                              Condor
 ●

     HTCondor Negotiator
      ●   To decide                                                                 The two
          which Job gets the glidein Slot                                    must have
                                                                              compatible
                                                                               policies


CERN, Dec 2012               glideinWMS matchmaking                                                      4
Defining the policy
 ●    The VO FE configures the glideins
       ●   So it can define the Slot Requirements
 ●    Preferred strategy to leave all policy
      decisions in the VO FE hands, i.e. both
       ●   VO FE matchmaking policy                                           Easier keep them
                                                                              in sync this way
       ●   HTCondor matchmaking policy
 ●    This implies
       ●   Users should not define Job Requirements
       ●   Instead, publish attributes describing requirements
     http://www.slideshare.net/igor_sfiligoi/condor-week-12-attribute-matchmaking-move-req-out-of-user-hands


CERN, Dec 2012                              glideinWMS matchmaking                                             5
CMS Production @ CERN
                Policies




CERN, Dec 2012   glideinWMS matchmaking   6
Description
 ●   The VO FE @ CERN serves
     the production needs
      ●   i.e. Reconstruction and MC production
 ●   Job submission regulated by service managed
     by a dedicated team,
     so jobs are
      ●   Targeted
      ●   Well behaved
                             At least by and large



CERN, Dec 2012            glideinWMS matchmaking     7
Matchmaking policy
 ●   Two dimensions
      ●   Grid Site
      ●   Single CPU vs HTPC
 ●   The actual policy is the AND of both
 ●   Both VO FE policy and HTCondor policy
     defined in the VO FE instance configuration




CERN, Dec 2012          glideinWMS matchmaking     8
Matching on Grid site name
 ●   User Jobs expected to publish the attribute
     DESIRED_Sites               String list

      ●   e.g. +DESIRED_Sites   = “T2_DE_DESY,T2_US_UCSD”
 ●   The G.F. and the glideins advertising
     GLIDEIN_CMSSite
 ●   The matchmaking policy is
     GLIDEIN_CMSSite ∈ DESIRED_Sites




CERN, Dec 2012            glideinWMS matchmaking            9
Matching on Job Type
 ●   Use Jobs can publish the attribute
     DESIRES_HTPC            Integer representation of Boolean values

      ●   e.g. +DESIRES_HTPS   = 1
      ●   If not defined, defaults to 0
 ●   The G.F. And the glideins may advertise
     GLIDEIN_Is_HTPC          Boolean value

      ●   If not defined, defaults to False
 ●   The matchmaking policy is
     (GLIDEIN_Is_HTPC==True)==(DESIRES_HTPC==1)


CERN, Dec 2012              glideinWMS matchmaking                 10
Example submit file


         Universe
          Universe = vanilla
                     = vanilla
         Executable = mcgen
          Executable = mcgen
         Arguments = -k 1543.3
          Arguments = -k 1543.3
         Output
          Output    = mcgen.out
                     = mcgen.out
         Error
          Error     = mcgen.err
                     = mcgen.err
         Log
          Log       = mcgen.log
                     = mcgen.log
         +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”
          +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”
         +DESIRES_HTPC = 0
          +DESIRES_HTPC = 0
         Requirements = True
          Requirements = True
         Queue 1
          Queue 1




CERN, Dec 2012           glideinWMS matchmaking      11
CMS AnaOps @ UCSD
                      Policies




CERN, Dec 2012        glideinWMS matchmaking   12
Description
 ●   VO FE @ UCSD serves CMS analysis users
 ●   User Jobs much more chaotic
      ●   Most users don't really understand their needs
      ●   Must protect from accidental errors
      ●   Yet keep the system flexible
 ●   Net result
      ●   More complex policy



CERN, Dec 2012             glideinWMS matchmaking          13
Two different policies
 ●   The AnaOps FE actually has two policies
      ●   The Regular policy
      ●   The Overflow policy
 ●   The Regular policy tries to match resources
      ●   Based on User desires
 ●   The Overflow policy “outsmarts” the Users
      ●   Will violate User desires without breaking the Jobs
      ●   The aim is to finish user jobs sooner
      ●   User can opt-out, if he wishes
CERN, Dec 2012             glideinWMS matchmaking               14
The Regular M.M. policy
 ●   Four+one dimensions
      ●   Grid Site
      ●   Single CPU vs HTPC
      ●   Memory usage
      ●   Job duration
                                           Due to preemption
      ●   Number of Job Starts
 ●   The actual policy is the AND of both
 ●   Both VO FE policy and HTCondor policy
     defined in the VO FE instance configuration
CERN, Dec 2012            glideinWMS matchmaking               15
Grid site selection
 ●   This is both similar and different compared to
     the Production FE @CERN
      ●   Serves the same purpose, but supports three
          different ways to select a site
           –     Due to historical evolution
 ●   The three options are
      ●   GLIDEIN_CMSSite ∈ DESIRED_Sites
                                                              Planning to extend to
      ●   GLIDEIN_SEs ∈ DESIRED_SEs                        (GLIDEIN_SEs ∩ DESIRED_SEs) ≠∅

      ●   GLIDEIN_Gatekeeper ∈ DESIRED_Gatekeepers
 ●   The actual policy is the OR of the three

CERN, Dec 2012                    glideinWMS matchmaking                           16
Job type selection
 ●   Just like @ CERN




CERN, Dec 2012        glideinWMS matchmaking   17
Memory Usage
●   Most Grid sites put strict limits on the amount of
    memory that can be used
    ●   Will kill glideins if they exceed the limit
●   G.F. and glideins advertise the Entry-specific limit
    GLIDEIN_MaxMemMBs
●   Jobs can explicitly declare the needed memory
    request_memory Native Condor attribute, no + needed
     ● Condor will also measure it at run time            Use a combination
                                                          of these to calculate
         –   ImageSize – Virtual memory used              the actual JobMemory

         –   ResidentSetSize – True memory usage
●   Policy: JobMemory <= GLIDEIN_MaxMemMBs
CERN, Dec 2012               glideinWMS matchmaking                               18
Job Duration                  1/2




 ●   Glideins have a limited lifetime
      ●   Must fit within the limits of the Grid site's queue
      ●   Glideins publish the deadline
          GLIDEIN_ToDie
           –     Jobs must finish before reaching the deadline
 ●   Final user job lifetime unpredictable
      ●   Depends on the type of computing done
      ●   User should indicate the expected job lifetime
           –     Else we have to assume reasonable defaults
                                                                Not many users set
                                                               this value(s) right now
CERN, Dec 2012                  glideinWMS matchmaking                                   19
Job Duration                  2/2




 ●   The same type of computation may take
     different amount of time
      ●   e.g. Based on the type of input
 ●   Jobs can declare two attributes
      ●   NormMaxWallTimeMins – Expected limit
      ●   MaxWallTimeMins – Absolute max limit
 ●   The matchmaking logic is
      ●   Use NormMaxWallTimeMins for
                                                       Based on simple assumption
          the first job startup                         that the job was killed for
                                                           hitting the deadline.
      ●   Use MaxWallTimeMins for all others

CERN, Dec 2012              glideinWMS matchmaking                                    20
Cut on number of re-starts
 ●   Not really a user configurable property
      ●   More an emergency break
 ●   In a properly configured system,
     should never be triggered
      ●   But unexpected problems happen
      ●   So better limit the damage




CERN, Dec 2012            glideinWMS matchmaking   21
The Overflow Use case
 ●   User Jobs specify a list of sites,
     because the data they need is there
 ●   With recent versions of CMSSW, jobs can
     access the data from remote
      ●   With a small performance penalty
 ●   We can thus schedule jobs “anywhere”
      ●   As long as the needed data is
          at a Site that has joined the xrootd federation
      ●   But only if no CPU available “close to the data”
           –     And not too far, either
                      http://indico.cern.ch/contributionDisplay.py?contribId=381&sessionId=5&confId=149557
                      http://indico.cern.ch/contributionDisplay.py?contribId=232&sessionId=8&confId=149557

CERN, Dec 2012                               glideinWMS matchmaking                                          22
The Overflow M.M. policy
 ●   Violate only the “Site selection” rule
      ●   Keep all the others
 ●   Plus, add one+one more:
      ●   An opt-out mechanism
      ●   Delayed matching




CERN, Dec 2012             glideinWMS matchmaking   23
New Site M.M. policy
 ●   The user specified attribute is used
     to flag the job as “Overflowable”
      ●   i.e. the job will match if and only if
          (DESIRED_<site>s ∩ SUPPORTED_<site>s) ≠∅
                        Still support all 3 types of site identification
 ●   Matching jobs can then run on any glidein
      ●   Additional limits can be put in place by the FE,
          but mostly invisible to the user




CERN, Dec 2012               glideinWMS matchmaking                        24
The opt-out mechanism
 ●   The Overflow policy
     considers all jobs by default
      ●   But Users may want to opt-out some of the Jobs
           –     Sometimes it is just a need
                 (to get deterministic results, e.g. for testing a site)
 ●   To opt-out, the user defines
     +CMS_ALLOW_OVERFLOW = False
      ●   The FE will not consider such jobs for Overflowing




CERN, Dec 2012                     glideinWMS matchmaking                  25
Delayed matching
 ●   As said initially,
     Jobs should preferentially run close to the data
      ●   Overflow should only consider jobs
          “that cannot find resources close to the data”
 ●   We implemented it based on time
      ●   Jobs are matched only
          if waiting in the queue for more than 6 hours

                                    Users cannot influence it




CERN, Dec 2012             glideinWMS matchmaking               26
Example submit file

 Universe
  Universe = vanilla
             = vanilla
 Executable = myana
  Executable = myana
 Arguments = -k 1543.3
  Arguments = -k 1543.3
 Output
  Output    = myana.out
             = myana.out
 Error
  Error     = myana.err
             = myana.err
 Log
  Log       = myana.log
             = myana.log
 request_memory = 1500
  request_memory = 1500
 +DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it"
  +DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it"
 +NormMaxWallTimeMins = 7200
  +NormMaxWallTimeMins = 7200
 +MaxWallTimeMins = 14400
  +MaxWallTimeMins = 14400
 +DESIRES_HTPC = 0
  +DESIRES_HTPC = 0
 +CMS_ALLOW_OVERFLOW = True
  +CMS_ALLOW_OVERFLOW = True
 Requirements = True
  Requirements = True
 Queue 1
  Queue 1


CERN, Dec 2012            glideinWMS matchmaking                  27
The End




CERN, Dec 2012   glideinWMS matchmaking   28
Pointers
 ●   glideinWMS Home Page
     http://tinyurl.com/glideinWMS
 ●   HTCondor Home Page
     http://research.cs.wisc.edu/htcondor/
 ●   HTCondor support
     htcondor-users@cs.wisc.edu
     htcondor-admin@cs.wisc.edu
 ●   glideinWMS support
     glideinwms-support@fnal.gov

CERN, Dec 2012        glideinWMS matchmaking   29
Acknowledgments
 ●   The creation of this document was sponsored
     by grants from the US NSF and US DOE,
     and by the University of California system




CERN, Dec 2012       glideinWMS matchmaking        30

Más contenido relacionado

Destacado

glideinWMS Training 2014 - HTCondor Internals
glideinWMS Training 2014 - HTCondor InternalsglideinWMS Training 2014 - HTCondor Internals
glideinWMS Training 2014 - HTCondor InternalsIgor Sfiligoi
 
Using ssh as portal - The CMS CRAB over glideinWMS experience
Using ssh as portal - The CMS CRAB over glideinWMS experienceUsing ssh as portal - The CMS CRAB over glideinWMS experience
Using ssh as portal - The CMS CRAB over glideinWMS experienceIgor Sfiligoi
 
VMworld 2013: Performance and Capacity Management of DRS Clusters
VMworld 2013: Performance and Capacity Management of DRS Clusters VMworld 2013: Performance and Capacity Management of DRS Clusters
VMworld 2013: Performance and Capacity Management of DRS Clusters VMworld
 
Understanding priorities in HTCondor
Understanding priorities in HTCondorUnderstanding priorities in HTCondor
Understanding priorities in HTCondorIgor Sfiligoi
 
Presentation 15 condor-v1
Presentation 15 condor-v1Presentation 15 condor-v1
Presentation 15 condor-v1Simon Kim
 
Where to find DHTC resources - OSG School 2014
Where to find DHTC resources - OSG School 2014Where to find DHTC resources - OSG School 2014
Where to find DHTC resources - OSG School 2014Igor Sfiligoi
 
Augmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaAugmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaIgor Sfiligoi
 
Known HTCondor break points
Known HTCondor break pointsKnown HTCondor break points
Known HTCondor break pointsIgor Sfiligoi
 

Destacado (9)

glideinWMS Training 2014 - HTCondor Internals
glideinWMS Training 2014 - HTCondor InternalsglideinWMS Training 2014 - HTCondor Internals
glideinWMS Training 2014 - HTCondor Internals
 
Using ssh as portal - The CMS CRAB over glideinWMS experience
Using ssh as portal - The CMS CRAB over glideinWMS experienceUsing ssh as portal - The CMS CRAB over glideinWMS experience
Using ssh as portal - The CMS CRAB over glideinWMS experience
 
distcom-short-20140112-1600
distcom-short-20140112-1600distcom-short-20140112-1600
distcom-short-20140112-1600
 
VMworld 2013: Performance and Capacity Management of DRS Clusters
VMworld 2013: Performance and Capacity Management of DRS Clusters VMworld 2013: Performance and Capacity Management of DRS Clusters
VMworld 2013: Performance and Capacity Management of DRS Clusters
 
Understanding priorities in HTCondor
Understanding priorities in HTCondorUnderstanding priorities in HTCondor
Understanding priorities in HTCondor
 
Presentation 15 condor-v1
Presentation 15 condor-v1Presentation 15 condor-v1
Presentation 15 condor-v1
 
Where to find DHTC resources - OSG School 2014
Where to find DHTC resources - OSG School 2014Where to find DHTC resources - OSG School 2014
Where to find DHTC resources - OSG School 2014
 
Augmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaAugmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with Nirvana
 
Known HTCondor break points
Known HTCondor break pointsKnown HTCondor break points
Known HTCondor break points
 

Similar a Matchmaking in glideinWMS in CMS

glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...Igor Sfiligoi
 
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolMonitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolIgor Sfiligoi
 
glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012Igor Sfiligoi
 
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Igor Sfiligoi
 
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM... glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...Igor Sfiligoi
 
Introduction to glideinWMS
Introduction to glideinWMSIntroduction to glideinWMS
Introduction to glideinWMSIgor Sfiligoi
 
glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012Igor Sfiligoi
 
Condor from the user point of view - glideinWMS Training Jan 2012
Condor from the user point of view - glideinWMS Training Jan 2012Condor from the user point of view - glideinWMS Training Jan 2012
Condor from the user point of view - glideinWMS Training Jan 2012Igor Sfiligoi
 
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012Igor Sfiligoi
 
Drools planner - 2012-10-23 IntelliFest 2012
Drools planner - 2012-10-23 IntelliFest 2012Drools planner - 2012-10-23 IntelliFest 2012
Drools planner - 2012-10-23 IntelliFest 2012Geoffrey De Smet
 
An argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS ExperienceAn argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS ExperienceIgor Sfiligoi
 
Wedding convenience and control with RemoteCondor
Wedding convenience and control with RemoteCondorWedding convenience and control with RemoteCondor
Wedding convenience and control with RemoteCondorIgor Sfiligoi
 
Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Igor Sfiligoi
 
glideinWMS Training Jan 2012 - Condor tuning
glideinWMS Training Jan 2012 - Condor tuningglideinWMS Training Jan 2012 - Condor tuning
glideinWMS Training Jan 2012 - Condor tuningIgor Sfiligoi
 
Consolidated shared indexes in real time
Consolidated shared indexes in real timeConsolidated shared indexes in real time
Consolidated shared indexes in real timeJeff Mace
 
Moby is killing your devops efforts
Moby is killing your devops effortsMoby is killing your devops efforts
Moby is killing your devops effortsKris Buytaert
 

Similar a Matchmaking in glideinWMS in CMS (20)

glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
 
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolMonitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
 
glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012
 
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
 
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM... glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 
Glidein internals
Glidein internalsGlidein internals
Glidein internals
 
Introduction to glideinWMS
Introduction to glideinWMSIntroduction to glideinWMS
Introduction to glideinWMS
 
glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012
 
Condor from the user point of view - glideinWMS Training Jan 2012
Condor from the user point of view - glideinWMS Training Jan 2012Condor from the user point of view - glideinWMS Training Jan 2012
Condor from the user point of view - glideinWMS Training Jan 2012
 
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012
 
Drools planner - 2012-10-23 IntelliFest 2012
Drools planner - 2012-10-23 IntelliFest 2012Drools planner - 2012-10-23 IntelliFest 2012
Drools planner - 2012-10-23 IntelliFest 2012
 
An argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS ExperienceAn argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS Experience
 
Pilot Factory
Pilot FactoryPilot Factory
Pilot Factory
 
Wedding convenience and control with RemoteCondor
Wedding convenience and control with RemoteCondorWedding convenience and control with RemoteCondor
Wedding convenience and control with RemoteCondor
 
Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012
 
GIT Introduction
GIT IntroductionGIT Introduction
GIT Introduction
 
glideinWMS Training Jan 2012 - Condor tuning
glideinWMS Training Jan 2012 - Condor tuningglideinWMS Training Jan 2012 - Condor tuning
glideinWMS Training Jan 2012 - Condor tuning
 
Consolidated shared indexes in real time
Consolidated shared indexes in real timeConsolidated shared indexes in real time
Consolidated shared indexes in real time
 
Moby is killing your devops efforts
Moby is killing your devops effortsMoby is killing your devops efforts
Moby is killing your devops efforts
 
SWT Tech Sharing: Node.js + Redis
SWT Tech Sharing: Node.js + RedisSWT Tech Sharing: Node.js + Redis
SWT Tech Sharing: Node.js + Redis
 

Más de Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 

Más de Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 

Último

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Último (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Matchmaking in glideinWMS in CMS

  • 1. glideinWMS for users Matchmaking in glideinWMS in CMS by Igor Sfiligoi (UCSD) CERN, Dec 2012 glideinWMS matchmaking 1
  • 2. Scope of this talk This talk provides a high level description of how glideinWMS matchmaking works in CMS. Reader is expected to be familiar with the CMS experiment environment http://cms.web.cern.ch/ CERN, Dec 2012 glideinWMS matchmaking 2
  • 3. glideinWMS architecture ● A reminder G.F. +3 VO FE Grid G.F. +1 Execute node Central manager Execute node Submit node Execute node Negotiator Submit node Execute node Submit node Execute node Schedd Condor CERN, Dec 2012 glideinWMS matchmaking 3
  • 4. Two levels of matchmaking ● First in the VO Frontend ● To decide where G.F. to provision resources VO FE +3 +1 G.F. Grid Execute node ● i.e. where Submit node Central manager Execute node Execute node to send glideins Negotiator Submit node Execute node Submit node Execute node Schedd Then in the Condor ● HTCondor Negotiator ● To decide The two which Job gets the glidein Slot must have compatible policies CERN, Dec 2012 glideinWMS matchmaking 4
  • 5. Defining the policy ● The VO FE configures the glideins ● So it can define the Slot Requirements ● Preferred strategy to leave all policy decisions in the VO FE hands, i.e. both ● VO FE matchmaking policy Easier keep them in sync this way ● HTCondor matchmaking policy ● This implies ● Users should not define Job Requirements ● Instead, publish attributes describing requirements http://www.slideshare.net/igor_sfiligoi/condor-week-12-attribute-matchmaking-move-req-out-of-user-hands CERN, Dec 2012 glideinWMS matchmaking 5
  • 6. CMS Production @ CERN Policies CERN, Dec 2012 glideinWMS matchmaking 6
  • 7. Description ● The VO FE @ CERN serves the production needs ● i.e. Reconstruction and MC production ● Job submission regulated by service managed by a dedicated team, so jobs are ● Targeted ● Well behaved At least by and large CERN, Dec 2012 glideinWMS matchmaking 7
  • 8. Matchmaking policy ● Two dimensions ● Grid Site ● Single CPU vs HTPC ● The actual policy is the AND of both ● Both VO FE policy and HTCondor policy defined in the VO FE instance configuration CERN, Dec 2012 glideinWMS matchmaking 8
  • 9. Matching on Grid site name ● User Jobs expected to publish the attribute DESIRED_Sites String list ● e.g. +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD” ● The G.F. and the glideins advertising GLIDEIN_CMSSite ● The matchmaking policy is GLIDEIN_CMSSite ∈ DESIRED_Sites CERN, Dec 2012 glideinWMS matchmaking 9
  • 10. Matching on Job Type ● Use Jobs can publish the attribute DESIRES_HTPC Integer representation of Boolean values ● e.g. +DESIRES_HTPS = 1 ● If not defined, defaults to 0 ● The G.F. And the glideins may advertise GLIDEIN_Is_HTPC Boolean value ● If not defined, defaults to False ● The matchmaking policy is (GLIDEIN_Is_HTPC==True)==(DESIRES_HTPC==1) CERN, Dec 2012 glideinWMS matchmaking 10
  • 11. Example submit file Universe Universe = vanilla = vanilla Executable = mcgen Executable = mcgen Arguments = -k 1543.3 Arguments = -k 1543.3 Output Output = mcgen.out = mcgen.out Error Error = mcgen.err = mcgen.err Log Log = mcgen.log = mcgen.log +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD” +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD” +DESIRES_HTPC = 0 +DESIRES_HTPC = 0 Requirements = True Requirements = True Queue 1 Queue 1 CERN, Dec 2012 glideinWMS matchmaking 11
  • 12. CMS AnaOps @ UCSD Policies CERN, Dec 2012 glideinWMS matchmaking 12
  • 13. Description ● VO FE @ UCSD serves CMS analysis users ● User Jobs much more chaotic ● Most users don't really understand their needs ● Must protect from accidental errors ● Yet keep the system flexible ● Net result ● More complex policy CERN, Dec 2012 glideinWMS matchmaking 13
  • 14. Two different policies ● The AnaOps FE actually has two policies ● The Regular policy ● The Overflow policy ● The Regular policy tries to match resources ● Based on User desires ● The Overflow policy “outsmarts” the Users ● Will violate User desires without breaking the Jobs ● The aim is to finish user jobs sooner ● User can opt-out, if he wishes CERN, Dec 2012 glideinWMS matchmaking 14
  • 15. The Regular M.M. policy ● Four+one dimensions ● Grid Site ● Single CPU vs HTPC ● Memory usage ● Job duration Due to preemption ● Number of Job Starts ● The actual policy is the AND of both ● Both VO FE policy and HTCondor policy defined in the VO FE instance configuration CERN, Dec 2012 glideinWMS matchmaking 15
  • 16. Grid site selection ● This is both similar and different compared to the Production FE @CERN ● Serves the same purpose, but supports three different ways to select a site – Due to historical evolution ● The three options are ● GLIDEIN_CMSSite ∈ DESIRED_Sites Planning to extend to ● GLIDEIN_SEs ∈ DESIRED_SEs (GLIDEIN_SEs ∩ DESIRED_SEs) ≠∅ ● GLIDEIN_Gatekeeper ∈ DESIRED_Gatekeepers ● The actual policy is the OR of the three CERN, Dec 2012 glideinWMS matchmaking 16
  • 17. Job type selection ● Just like @ CERN CERN, Dec 2012 glideinWMS matchmaking 17
  • 18. Memory Usage ● Most Grid sites put strict limits on the amount of memory that can be used ● Will kill glideins if they exceed the limit ● G.F. and glideins advertise the Entry-specific limit GLIDEIN_MaxMemMBs ● Jobs can explicitly declare the needed memory request_memory Native Condor attribute, no + needed ● Condor will also measure it at run time Use a combination of these to calculate – ImageSize – Virtual memory used the actual JobMemory – ResidentSetSize – True memory usage ● Policy: JobMemory <= GLIDEIN_MaxMemMBs CERN, Dec 2012 glideinWMS matchmaking 18
  • 19. Job Duration 1/2 ● Glideins have a limited lifetime ● Must fit within the limits of the Grid site's queue ● Glideins publish the deadline GLIDEIN_ToDie – Jobs must finish before reaching the deadline ● Final user job lifetime unpredictable ● Depends on the type of computing done ● User should indicate the expected job lifetime – Else we have to assume reasonable defaults Not many users set this value(s) right now CERN, Dec 2012 glideinWMS matchmaking 19
  • 20. Job Duration 2/2 ● The same type of computation may take different amount of time ● e.g. Based on the type of input ● Jobs can declare two attributes ● NormMaxWallTimeMins – Expected limit ● MaxWallTimeMins – Absolute max limit ● The matchmaking logic is ● Use NormMaxWallTimeMins for Based on simple assumption the first job startup that the job was killed for hitting the deadline. ● Use MaxWallTimeMins for all others CERN, Dec 2012 glideinWMS matchmaking 20
  • 21. Cut on number of re-starts ● Not really a user configurable property ● More an emergency break ● In a properly configured system, should never be triggered ● But unexpected problems happen ● So better limit the damage CERN, Dec 2012 glideinWMS matchmaking 21
  • 22. The Overflow Use case ● User Jobs specify a list of sites, because the data they need is there ● With recent versions of CMSSW, jobs can access the data from remote ● With a small performance penalty ● We can thus schedule jobs “anywhere” ● As long as the needed data is at a Site that has joined the xrootd federation ● But only if no CPU available “close to the data” – And not too far, either http://indico.cern.ch/contributionDisplay.py?contribId=381&sessionId=5&confId=149557 http://indico.cern.ch/contributionDisplay.py?contribId=232&sessionId=8&confId=149557 CERN, Dec 2012 glideinWMS matchmaking 22
  • 23. The Overflow M.M. policy ● Violate only the “Site selection” rule ● Keep all the others ● Plus, add one+one more: ● An opt-out mechanism ● Delayed matching CERN, Dec 2012 glideinWMS matchmaking 23
  • 24. New Site M.M. policy ● The user specified attribute is used to flag the job as “Overflowable” ● i.e. the job will match if and only if (DESIRED_<site>s ∩ SUPPORTED_<site>s) ≠∅ Still support all 3 types of site identification ● Matching jobs can then run on any glidein ● Additional limits can be put in place by the FE, but mostly invisible to the user CERN, Dec 2012 glideinWMS matchmaking 24
  • 25. The opt-out mechanism ● The Overflow policy considers all jobs by default ● But Users may want to opt-out some of the Jobs – Sometimes it is just a need (to get deterministic results, e.g. for testing a site) ● To opt-out, the user defines +CMS_ALLOW_OVERFLOW = False ● The FE will not consider such jobs for Overflowing CERN, Dec 2012 glideinWMS matchmaking 25
  • 26. Delayed matching ● As said initially, Jobs should preferentially run close to the data ● Overflow should only consider jobs “that cannot find resources close to the data” ● We implemented it based on time ● Jobs are matched only if waiting in the queue for more than 6 hours Users cannot influence it CERN, Dec 2012 glideinWMS matchmaking 26
  • 27. Example submit file Universe Universe = vanilla = vanilla Executable = myana Executable = myana Arguments = -k 1543.3 Arguments = -k 1543.3 Output Output = myana.out = myana.out Error Error = myana.err = myana.err Log Log = myana.log = myana.log request_memory = 1500 request_memory = 1500 +DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it" +DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it" +NormMaxWallTimeMins = 7200 +NormMaxWallTimeMins = 7200 +MaxWallTimeMins = 14400 +MaxWallTimeMins = 14400 +DESIRES_HTPC = 0 +DESIRES_HTPC = 0 +CMS_ALLOW_OVERFLOW = True +CMS_ALLOW_OVERFLOW = True Requirements = True Requirements = True Queue 1 Queue 1 CERN, Dec 2012 glideinWMS matchmaking 27
  • 28. The End CERN, Dec 2012 glideinWMS matchmaking 28
  • 29. Pointers ● glideinWMS Home Page http://tinyurl.com/glideinWMS ● HTCondor Home Page http://research.cs.wisc.edu/htcondor/ ● HTCondor support htcondor-users@cs.wisc.edu htcondor-admin@cs.wisc.edu ● glideinWMS support glideinwms-support@fnal.gov CERN, Dec 2012 glideinWMS matchmaking 29
  • 30. Acknowledgments ● The creation of this document was sponsored by grants from the US NSF and US DOE, and by the University of California system CERN, Dec 2012 glideinWMS matchmaking 30