Семинар «Использование современных информационных технологий для решения современных задач физики частиц» в московском офисе Яндекса, 3 июля 2012
Marco Cattaneo, CERN
4. Data reduction in real time (trigger)
pp collisions
15 MHz ü custom hardware
Level-0
ü partial detector information
ü CPU
farm -> software trigger
1 MHz
ü full detector information
HLT
ü reconstruction in real time
ü (~20k jobs in parallel)
~5 kHz ü ~200 independent “lines”
ü 300 MB/s
Offline Storage
ü 1.3 PB in 2012
4
5. Event Reconstruction
❍ On full event sample (~2 billion events expected in 2012):
❏ Pattern recognition to measure particle trajectories
✰ Identify vertices, measure momentum
❏ Particle ID to measure particle types
~ 2 sec/event
~ 5k concurrent jobs
(run on grid)
Reconstructed data
stored in “FULL DST”
DST event size ~100 kB -> 2 PB in 2012 (includes RAW)
5
6. Data reduction offline (stripping)
❍ Any given physics analysis selects 0.1-1.0% of events
❏ Inefficient to allow individual physicists to run selection job
over FULL DST
❍ Group all selections in a single selection pass, executed by
central production team
❏ Runs over FULL DST
❏ Executes ~800 independent stripping “lines”
✰ ~ 0.5 sec/event in total
✰ Writes out only events selected by one or more of these lines
❏ Output events are grouped into ~15 streams
❏ Each stream selects 1-10% of events
❍ Overall data volume reduction: x50 – x500 depending on
stream
❏ Few TB (<50) per stream, replicated at several places
❏ Accessible to physicists for data analysis
6
7. Simulation
LHCb Monte Carlo simulation software:
– Simulation of physics event
– Detailed detector and material description (GEANT)
– Pattern recognition, trigger simulation and offline event selection
– Implements detector inefficiencies, noise hits, effects of multiple collisions
7
8. Resources for simulation
❍ Simulation jobs require 1-2 minutes per event
❏ Several billion events per year required
❏ Runs on ~20k CPUs worldwide
Yandex contribution ~25%
8
9. Physics Applications Software Organization
High level triggers
Reconstruction
Simulation
Applications built on top of
Analysis
frameworks and implementing the
required algorithms.
One framework for basic services
+ various specialized frameworks:
detector description,
visualization, persistency, Frameworks
interactivity, simulation, etc.
Toolkits
A series of widely used basic
libraries: Boost, GSL, Root etc. Foundation Libraries
10. Gaudi Framework (Object Diagram)
Application Converter
Event Converter
Manager Converter
Selector
Message Event Data Persistency Data
Service Service Service Files
Transient
JobOptions Event
Service Algorithm Store
Algorithm
Algorithm
Transient
Particle Prop. Detec. Data Persistency Data
Detector
Service Service Service Files
Store
Other
Services Transient
Histogram Histogram Persistency Data
Service Store Service Files
12. LHCb Workload management System :DIRAC !
User Community!
❍ DIRAC forms an
overlay network!
❏ A way for grid
interoperability for Grid B!
Grid A!
a given Community! (WLCG)! (NDG)!
❏ Needs specific Agent
Director per resource type!
❍ From the user perspective
all the resources are seen as a single large “batch system”!
13. DIRAC WMS!
u Jobs are submitted to the
DIRAC Central Task Queue !
u VRC policies are applied here by
prioritizing jobs in the Queue!
u Pilot Jobs are submitted by
specific Directors to various
Grids or computer clusters!
u Allows to aggregate various
computing types resources
transparently for the users !
u The Pilot Job gets the most
appropriate user job!
u Jobs are running in a verified
environment with a high efficiency!
!
13
15. Data replication
RAW! CERN +
1 Tier1
❍ Active data placement
Brunel
!
(recons)! ❍ Split disk (for analysis) and
archive
❏ No disk replica for RAW and
FULL 1 Tier1
FULL DST (read few times)
DST!
DaVinci!
(stripping DST1!
DST1 ! 1 Tier1
and DST1 ! DST1! DST1!
(scratch)
streaming) !
Merge!
CERN +
1 Tier1
DST1!
CERN +
3 Tier1s
15
16. Datasets
❍ Granularity at the file level
❏ Data Management operations (replicate, remove replica,
delete file)
❏ Workload Management: input/output files of jobs
❍ LHCbDirac perspective
❏ DMS and WMS use Logical File Names to reference files
❏ LFN namespace refers to the origin of the file
✰ Constructed by the jobs (uses production and job number)
✰ Hierarchical namespace for convenience
✰ Used to define file class (tape-sets) for RAW, FULL.DST, DST
✰ GUID used for internal navigation between files (Gaudi)
❍ User perspective
❏ File is part of a dataset (consistent for physics analysis)
❏ Dataset: specific conditions of data, processing version and
processing level
✰ Files in a dataset should be exclusive and consistent in quality
and content
16
17. Replica Catalog (1)
❍ Logical namespace
❏ Reflects somewhat the origin of the file (run number for
RAW, production number for output files of jobs)
❏ File type also explicit in the directory tree
❍ Storage Elements
❏ Essential component in the DIRAC DMS
❏ Logical SEs: several DIRAC SEs can physically use the same
hardware SE (same instance, same SRM space)
❏ Described in the DIRAC configuration
✰ Protocol, endpoint, port, SAPath, Web Service URL
✰ Allows autonomous construction of the SURL
✰ SURL = srm:<endPoint>:<port><WSUrl><SAPath><LFN>
❏ SRM spaces at Tier1s
✰ Used to have as many SRM spaces as DIRAC SEs, now only 3
✰ LHCb-Tape (T1D0) custodial storage
✰ LHCb-Disk (T0D1) fast disk access
✰ LHCb-User (T0D1) fast disk access for user data
17
18. Replica Catalog (2)
❍ Currently using the LFC
❏ Master write service at CERN
❏ Replication using Oracle streams to Tier1s
❏ Read-only instances at CERN and Tier1s
✰ Mostly for redundancy, no need for scaling
❍ LFC information:
❏ Metadata of the file
❏ Replicas
✰ Use “host name” field for the DIRAC SE name
✰ Store SURL of creation for convenience (not used)
❄ Allows lcg-util commands to work
❏ Quality flag
✰ One character comment used to set temporarily a replica as
unavailable
❍ Testing scalability of the DIRAC file catalog
❏ Built-in storage usage capabilities (per directory)
18
19. Bookkeeping Catalog (1)
❍ User selection criteria
❏ Origin of the data (real or MC, year of reference)
✰ LHCb/Collision12
❏ Conditions for data taking of simulation (energy, magnetic
field, detector configuration…
✰ Beam4000GeV-VeloClosed-MagDown
❏ Processing Pass is the level of processing (reconstruction,
stripping…) including compatibility version
✰ Reco13/Stripping19
❏ Event Type is mostly useful for simulation, single value for
real data
✰ 8 digit numeric code (12345678, 90000000)
❏ File Type defines which type of output files the user wants
to get for a given processing pass (e.g. which stream)
✰ RAW, SDST, BHADRON.DST (for a streamed file)
❍ Bookkeeping search
❏ Using a path
✰ /<origin>/<conditions>/<processing pass>/<event type>/<file type>!
19
20. Bookkeeping Catalog (2)
❍ Much more than a dataset catalog!
❍ Full provenance of files and jobs
❏ Files are input of processing steps (“jobs”) that produce files
❏ All files ever created are recorded, each processing step as
well
✰ Full information on the “job” (location, CPU, wall clock time…)
❍ BK relational database
❏ Two main tables: “files” and “jobs”
❏ Jobs belong to a “production”
❏ “Productions” belong to a “processing pass”, with a given
“origin” and “condition”
❏ Highly optimized search for files, as well as summaries
❍ Quality flags
❏ Files are immutable, but can have a mutable quality flag
❏ Files have a flag indicating whether they have a replica or
not
20
21. Bookkeeping browsing
❍ Allows to save datasets
❏ Filter, selection
❏ Plain list of files
❏ Gaudi configuration file
❍ Can return files with only replica
at a given location
21
22. Event indexing
❍ Book-keeping has no
104248:539058326 information about individual
events
Advanced search
Select all Download Selected
104248:539058326
But can be beneficial to
Event time Oct. 27, 2011, 9:11 p.m.
File names
❍
/lhcb/LHCb/Collision11/DIMUON.DST/00013016/0000/00013016_00000037_1.dimuon.dst
/lhcb/LHCb/Collision11/EW.DST/00013017/0000/00013017_00000033_1.ew.dst
select events based on global
Application
Tags
Brunel v41r1
DDDB
DQFLAGS
head-20110914
tt-20110126
criteria:
Number of tracks, clusters
LHCBCOND head-20111111
ONLINE HEAD
❏
etc.
Stripping lines
DY2MuMuLine2_Hlt 1
FullDSTDiMuonDiMuonHighMassLine 1
Trigger or Stripping lines
StreamDimuon 0
StreamEW 0
WMuLine 2 ❏
fired
Z02MuMuLine 1
Z02MuMuNoPIDsLine 1
Z02TauTauLine 2
…
Global Event Activity counters
nBackTracks 37 nPV 2 ❏
nDownstreamTracks 21 nRich1Hits 944
nITClusters 352 nRich2Hits 985
nLongTracks 11 nSPDhits 128
nMuonCoordsS0 156 nTTClusters 357
nMuonCoordsS1 86 nTTracks 14
Prototype implemented by
nMuonCoordsS2 15 nTracks 92
nMuonCoordsS3 31 nUpstreamTracks 6
nMuonCoordsS4
nMuonTracks
18
9
nVeloClusters
nVeloTracks
1389
56
❍
Andrey Ustyuzhanin
nOTClusters 4615
Search is supported by
22
23. Staging: using files from tape
❍ If jobs use files that are not online (on disk)
❏ Before submitting the job
❏ Stage the file from tape, and pin it on cache
❍ Stager agent
❏ Performs also cache management
❏ Throttle staging requests depending on the cache size and
the amount of pinned data
❏ Requires fine tuning (pinning and cache size)
✰ Caching architecture highly site dependent
✰ No publication of cache sizes (except Castor and StoRM)
❍ Jobs using staged files
❏ Check first the file is still staged
✰ If not reschedule the job
❏ Copies the file locally on the WN whenever possible
✰ Space is released faster
✰ More reliable access for very long jobs (reconstruction) or jobs
using many files (merging)
23
24. Data Management Outlook
❍ Improvements on staging
❏ Improve the tuning of cache settings
✰ Depends on how caches are used by sites
❏ Pinning/unpinning
✰ Difficult if files are used by more than one job
❍ Popularity
❏ Record dataset usage
✰ Reported by jobs: number of files used in a given dataset
✰ Account number of files used per dataset per day/week
❏ Assess dataset popularity
✰ Relate usage to dataset size
❏ Take decisions on the number of online replicas
✰ Taking into account available space
✰ Taking into account expected need in the coming week
24
25. Longer term challenges
❍ Computing model evolution
❏ Networking performance allows evolution from static “Tier”
model of data access to much more dynamic model
❏ Current processing model of 1 sequential job per file per CPU
breaks down in multi-core era due to memory limitations
✰ parallelisation of event processing
❏ Whole node scheduling, virtualisation, clouds
❍ LHCb upgrade in 2018
❏ From computing point of view:
✰ x40 readout rate into HLT farm
✰ x4 event rate to storage (20kHz)
✰ x1.5 event size (100kB/RAW event)
❏ x10 increase in data rates imply
✰ scaling of Dirac WMS and DMS
✰ scaling of data selection catalogs
✰ new ideas for data mining?
❍ Data preservation and open access
❏ Preserve ability to analyse old data many year in the future
❏ Make data available for analysis outside LHCb + general public
25