Presentation of the open source CFD code Code_Saturne
1. HPC and CFD at
EDF with
Code_Saturne
Yvan Fournier, Jérôme Bonelle
EDF R&D
Fluid Dynamics, Power Generation and Environment Department
Open Source CFD International Conference Barcelona 2009
2. Summary
1. General Elements on Code_Saturne
2. Real-world performance of Code_Saturne
3. Example applications: fuel assemblies
4. Parallel implementation of Code_Saturne
5. Ongoing work and future directions
2 Open Source CFD International Conference 2009
3. General elements on
Code_Saturne
3 Open Source CFD International Conference 2009
4. Code_Saturne: main capabilities
Physical modelling
Single-phase laminar and turbulent flows: k-ε, k-ω SST, v2f, RSM, LES
Radiative heat transfer (DOM, P-1)
Combustion coal, gas, heavy fuel oil (EBU, pdf, LWP)
Electric arc and Joule effect
Lagrangian module for dispersed particle tracking
Atmospheric flows (aka Mercure_Saturne)
Specific engineering module for cooling towers
ALE method for deformable meshes
Conjugate heat transfer (SYRTHES & 1D)
Common structure with NEPTUNE_CFD for eulerian multiphase flows
Flexibility
Portability (UNIX, Linux and MacOS X)
Standalone GUI and integrated in SALOME platform
Parallel on distributed memory machines
Periodic boundaries (parallel, arbitrary interfaces)
Wide range of unstructured meshes with arbitrary interfaces
Code coupling capabilities (Code_Saturne/Code_Saturne, Code_Saturne/Code_Aster, ...)
4 Open Source CFD International Conference 2009
5. Code_Saturne: general features
Technology
Co-located finite volume, arbitrary unstructured meshes (polyhdral cells), predictor-corrector method
500 000 lines of code, 50% FORTRAN 90, 40% C, 10% Python
Development
1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S, ...)
2000: version 1.0 (basic modelling, wide range of meshes)
2001: Qualification for single phase nuclear thermal-hydraulic applications
2004: Version 1.1 (complex physics, LES, parallel computing)
2006: Version 1.2 (state of the art turbulence models, GUI)
2008: Version 1.3 (massively parallel, ALE, code coupling, ...)
Released as open source (GPL licence)
2008: Dvpt version 1.4 (parallel IO, multigrid, atmospheric, cooling towers, ...)
2009: Dvp version 2.0-beta (parallel mesh joining, code coupling, easy install & packaging, extended
GUI)
schedule for industrial release beginning of 2010
Code_Saturne developed under Quality Assurance
5 Open Source CFD International Conference 2009
7. Code_Saturne environment
Graphical User Interface
setting up of calculation parameters
parameter stored in Xml file
interactive launch of calculations
some specific physics not yet covered by GUI
advanced setting up by Fortran user routines
Integration in the SALOME platform
extension of GUI capabilities
mouse selection of boundary zones
advanced user files management
from CAD to post-processing in one tool
7 Open Source CFD International Conference 2009
8. Allowable mesh examples
Example of mesh with
stretched cells and hanging nodes
PWR lower
Example of composite mesh
plenum
3D
polyhedral cells
8 Open Source CFD International Conference 2009
9. Joining of non-conforming meshes
Arbitrary interfaces
Meshes may be contained in one single file or in several separate files, in any order
Arbitrary interfaces can be selected by mesh references
Caution must be exercised if arbitrary interfaces are used:
in critical regions, or with LES
with very different mesh refinements, or on curved CAD surfaces
Often used in ways detrimental to mesh quality,but a functionality we can not do without
as long as we do not have a proven alternative.
Joining of meshes built in several pieces may also be used to circumvent meshing tool memory limitations.
Periodicity is also constructed as an extension of mesh joining.
9 Open Source CFD International Conference 2009
11. Code_Saturne Features of note to HPC
Segregated solver
All variables are solved or independently, coupling terms are explicit
Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi-CGstab) used for other variables
More important, matrices have no block structure, and are very sparse
Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra
Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as
memory bandwidth is as important as peak flops.
Linear equation solvers usually amount to 80% of CPU cost
(dominated by pressure), gradient reconstruction about 20%
The larger the mesh, the higher the relative cost of the pressure step
11 Open Source CFD International Conference 2009
12. Current performance (1/3)
2 LES test cases (most I/O factored out)
1 M cells: (n_cells_min + n_cells_max)/2 = 880 at 1024 cores, 109 at
8192
10 M cells (n_cells_min + n_cells_max)/2 = 9345 at 1024 cores, 1150
at 8192
FATHER
HYP P I
1 M hexahedra LES tes t cas e
10 M h e xa h e d ra LES te st c a se
100000 500000
10000 50000
O pteron + infiniband
O pteron + Myrinet Opte ron + Myrine t
NovaS cale Opte ron + infiniba nd
Ela ps e d time
Blue G ene/L Nova S c a le
Elapsed tile
Blue Ge ne /L
1000 5000
100 500
1 10 100 1000 10000 1 10 100 1000 10000
n c ore s
n cores
12 Open Source CFD International Conference 2009
13. Current performance (2/3)
RANS, 100 M tetrahedra + polyhedra (most I/O factored out)
Polyhedra due to mesh joinings may lead to higher load imbalance in
local MatVec for large core counts
96286/102242 min/max cells/core at 1024 cores
11344/12781 min/max cells cores at 8192 cores
FA Grid
RANS te s t ca s e
10000
Ela pse d tim e pe r ite ra tion
1000
Nov a S c a le
Blue Ge ne /L (C O)
Blue Ge ne /L (VN)
100
10
100 1000 10000
n c o re s
13 Open Source CFD International Conference 2009
14. Current performance (3/3)
Efficiency often goes through an optimum (due du better cache hit
rates) before dropping (due to latency induced by parallel
synchronization)
Example shown here: HYPI (10 M cell LES test case)
1,6
1,4
1,2
1
Efficacité
0,8
0,6
0,4
0,2
0
1 10 100 1000 10000
Number of MPI ranks
Cluster Chatou Tantale Platine BlueGene
14 Open Source CFD International Conference 2009
15. High Performance Computing with Code_Saturne
Code_Saturne used extensively on HPC machines
in-house EDF clusters
CCRT calculation centre (CEA based)
EDF IBM BlueGene machines (8 000 and 32 000 cores)
Run also on MareNostrum (Barcelona Computing Centre),
Cray XT, …
Code_Saturne used as reference in PRACE European project
reference code for CFD benchmarks on 6 large european HPC centres
Code_Saturne obtained “gold medal” status in scalability by Daresbury Laboratory (UK,
HPCx) machine)
15 Open Source CFD International Conference 2009
16. Example HPC
applications:
fuel assemblies
16 Open Source CFD International Conference 2009
17. Fuel Assembly Studies
Conflicting design goals
Good thermal mixing properties, requiring tubulent flow
Limit head loss
Limit vibrations
Fuel rods held by dimples and springs, and not welded,
as they lengthen slightly over the years due to irradiation
Complex core geometry
Circa 150 to 250 fuel assemblies per core depending
on reactor type, 8 to 10 grids per fuel assembly,
17x17 grid (mostly fuel rods, 24 guide tubes)
Geometry almost periodic, except for mix of several fuel assembly types in a given core (reload by 1/3
or 1/4)
Inlet an wall conditions not periodic, heat production not uniform at fine scale
Why we study these flows
Deformation may lead to difficulties in core unload/reload
Turbulent-induced vibrations of fuel assemblies in PWR power plants is a potential cause of
deformation and of fretting wear damage
These may lead to weeks or months of interruption of operations
17 Open Source CFD International Conference 2009
18. Prototype FA calculation with Code_Saturne
PWR nuclear reactor mixing grid mock-up (5x5)
100 million cells
calculation run on 4 000
to 8 000 cores
Main issue is mesh generation
18 Open Source CFD International Conference 2009
19. LES simulation of reduced FA domain
Particular features for LES
SIMPLEC algorithm with Rhie and Chow
interpolation
2nd order in time (Crank-Nicolson and Adams-
Bashforth)
2nd order in space (fully centered and sub-
iterations for non-orthogonal faces)
Fully hexahedral mesh, 8 million cells
Boundary Conditions
Implicit periodicity in x and y directions
Constant inlet conditions
Wall function when needed
Free outlet
Simulation
1 million time-steps: 40 flow passes, 20 flow
passes for averaging (no homogeneous
direction)
CFLmax= 0.8 (dt=5.10-6s)
BlueGene/L system, 1024 processors
Per time-step: 5s
For 100 000 time-steps: 1week
19 Open Source CFD International Conference 2009
21. Base parallel operations (1/4)
Distributed memory parallelism using domain partitioning
Use classical “ghost cell” method for both parallelism and periodicity
Most operations require only ghost cells sharing faces
Extended neighborhoods for gradients also require ghost cells sharing vertices
Global reductions (dot products) are also used, especially by the preconditioned
conjugate gradient algorithm
Periodicity uses the same mechanism
Vector and tensor
rotation also required
21 Open Source CFD International Conference 2009
22. Base parallel operations (/)
Use of global numbering
We associate a global number to each mesh entity
A specific C type (fvm_gnum_t) is used for this. Currently an unsigned integer
(usually 32-bit), but an unsigned long integer (64-bit) will be necessary
Face-cell connectivity for hexahedral cells : size 4.n_faces, and
n_faces about 3.n_cells, → size around 12.n_cells, so numbers
requiring 64 bit around 350 million cells.
Currently equal to the initial (pre-partitioning) number
Allows for partition-independent single-image files
Essential for restart files, also used for postprocessor output
Also used for legacy coupling where matches can be saved
22 Open Source CFD International Conference 2009
23. Base parallel operations (/)
Use of global numbering
Redistribution on n blocks
n blocks ≤ n cores
Minimum block size may be set to avoid many small
blocks (for some communication or usage schemes),
or to force 1 block (for I/O with non-parallel libraries)
In the future, using at most 1 of every p
processors may improve MPI/IO performance if
we use a smaller communicator (to be tested)
23 Open Source CFD International Conference 2009
24. Base parallel operations (/)
Conversely, simply using global numbers allows reconstructing neighbor partition
entity equivalents mapping
Used for parallel ghost cell construction from
initially partitioned mesh with no ghost data
Arbitrary distribution, inefficient for halo
exchange, but allow for simpler data
structure related algorithms with
deterministic performance bounds
Owning processor determined simply by
global number, messages are aggregated
24 Open Source CFD International Conference 2009
25. Parallel IO (1/2)
We prefer using single (partition independent) files
Easily run different stages or restarts of a calculation on different machines or queues
Avoids having thousands or tens of thousands of files in a directory
Better transparency of parallelism for the user
Use MPI I/O when available
Uses block to partition exchange when reading, partition to block when writing
Use of indexed datatypes may be tested in the future, but will not be possible everywhere
Used for reading of preprocessor and partitioner output, as well as for restart
files
These files use a unified binary format, consisting of an simple header an a succession of
sections
MPI IO pattern is thus a succession of global reads (or local read + broadcast) for
section headers and collective reading of data (with a different portion for each rank)
We could switch to HDF5 but preferred a lighter model and also avoid an extra
dependency or dependency conflicts
Infrastructure in progress for postprocessor output
Layered approach as we allow for multiple formats
25 Open Source CFD International Conference 2009
26. Parallel IO (2/2)
Parallel I/O only of benefit with parallel filesystems
Use of MPI IO may be disabled either at build time, or for a given file using
specific hints
Without MPI IO, data for each block is written or read successively by rank 0,
using the same FVM file API
Not much feedback yet, but initial results
dissapointing
Similar performance with and without MPI IO on at least 2 systems
Whether using MPI_File_read/write_at_all or MPI_File_read/write_all
Need to retest this forcing less processors in the MPI IO communicator
Bugs encountered in several MPI/IO implementations
26 Open Source CFD International Conference 2009
27. Ongoing work and
future directions
27 Open Source CFD International Conference 2009
28. Parallelization of mesh joining (2008-2009)
Parallelizing this algorithm requires the same main steps as the serial
algorithm:
Detect intersections (within a given tolerance) between edges of overlapping faces
Uses parallel octree for face bounding boxes, built in a bottom-up fashion (no balance
condition required)
Subdivide edges according to inserted intersection vertices
Merge coincident or nearly-coincident vertices/intersections
This is the most complex
Must be synchronized in parallel
Choice of merging criteria has a profound impact on the quality of the resulting mesh
Re-build sub-faces
With parallel mesh joining, the most memory-intensive serial
preprocessing step is removed
We will add parallel mesh « append » within a few months (for version 2.1);
this will allow generation of huge meshes even with serial meshing tools
28 Open Source CFD International Conference 2009
29. Coupling of Code_Saturne with itself
Objective
coupling of different models (RANS/LES)
fluid-structure interaction with large displacements
rotating machines
Two kinds of communications
data exchange at boundaries for interface coupling
volume forcing for overlapping regions
Still under development, but ...
data exchange already implemented in FVM library
optimised localisation algorithm
compliance with parallel/parallel coupling
prototype versions with promising results
more work needed on conservativity at the exchange
first version adapted to pump modelling implemented in version 2.0
rotor/stator coupling
compares favourably with CFX
29 Open Source CFD International Conference 2009
30. Multigrid
Currently, multigrid coarsening does not cross processor
boundaries
This implies that on p processors, the coarsest matrix may not contain less
than p cells
With a high processor count, less grid levels will be used, and solving for the
coarsest matrix may be significantly more expensive than with a low processor
count
This reduces scalability, and may be checked (if suspected) using the solver summary info at
the end of the log file
Planned solution: move grids to nearest rank multiple of 4 or
8 when mean local grid size is too small
The communication pattern is not expected to change too much, as partitioning
is of a recursive nature, and should already exhibit a “multigrid” nature
This may be less optimal than repartitioning at each level, but setup time
should also remain much cheaper
Important, as grids may be rebuilt each time step
30 Open Source CFD International Conference 2009
31. Partitioning
We currently use METIS or SCOTCH, but should move to
ParMETIS or Pt-SCOTCH within a few months
The current infrastructure makes this quite easy
We have recently added a « backup » partitioning based on
space-filling curves
We currently use the Z curve (from our octree construction for parallel joining), but the
appropriate changes in the coordinate comparison rules should allow switching to a
Hilbert curve (reputed to lead to better partitioning)
This is fully parallel and deterministic
Performance on initial tests is
about 20% worse on a single
10-million cell case on 256 processes
reasonable compared to
unoptimized partitioning
31 Open Source CFD International Conference 2009
32. Tool chain evolution
Code_Saturne V1.3 (current production version) added many
HPC-oriented improvements compared to prior versions:
Post-processor output handled by FVM / Kernel
Ghost cell construction handled by FVM / Kernel
Up to 40% gain in preprocessor memory peak compared to V1.2
Parallelized and scales (manages 2 ghost cell sets and multiple periodicities)
Well adapted up to 150 million cells (with 64 Gb for preprocessing)
All fundamental limitations are pre-processing related
Pre-Processor: Kernel + FVM: Post-
Meshes processing
serial run distributed run
output
Version 2.0 separates partitioning from preprocessing
Also reduces their memory footprint a bit, moving newly parallelized operations
to the kernel
Pre-Processor: Partitioner: Kernel + FVM: Post-
Meshes processing
serial run serial run distributed run output
32 Open Source CFD International Conference 2009
33. Future direction: Hybrid MPI / OpenMP (1/2)
Currently, a pure MPI model is used:
Everything is parallel, synchronization is explicit when required
On multiprocessor / multicore nodes, shared memory
parallelism could also be used (using OpenMP directives)
Parallel sections must be marked, and parallel loops must avoid
modifying the same values
Specific numberings must be used, similar to those used for
vectorization, but with different constraints:
Avoid false sharing, keep locality to limit cache misses
33 Open Source CFD International Conference 2009
34. Future direction: Hybrid MPI / OpenMP (2/2)
Hybrid MPI / OpenMP is being tested
IBM is testing this on Blue Gene/P
Requires work on renumbering algorithms
OpenMP parallelism would ease of packaging / installation on workstations
No dependencies on source but not binary-compatible MPI library choices,
only on the compiler runtime
Good enough for current multicore workstations
Coupling the code with itself or with with SYRTHES 4 will still require MPI
The main goal is to allow MPI communicators of “only”
10000’s of ranks on machines with 100000 cores
Performance benefits expected mainly at the very high end
Reduce risk of medium-term issues with MPI_Alltoallv used in I/O and parallelism-related
data redistribution
Though sparse collective algorithms is the long term solution for this specific issue
34 Open Source CFD International Conference 2009
35. Code_Saturne HPC roadmap
2003 2006 2007 2010 2015
Consecutive to the Civaux The whole vessel
9 fuel assemblies
thermal fatigue event reactor
No experimental approach up
Computations enable to better to now
understand the wall thermal
loading in an injection. Will enable the study of side
effects implied by the flow
Knowing the root causes of the Computation with an around neighbour fuel
event ⇒ define a new design to L.E.S. approach for assemblies.
avoid this problem. turbulent modelling
Part of a fuel assembly Better understanding of
Refined mesh near the vibration phenomena and
3 grid assemblies
wall. wear-out of the rods.
106 cells 107 cells 108 cells 109 cells 1010 cells
3.1013 operations 6.1014 operations 1016 operations 3.1017 operations 5.1018 operations
Fujistu VPP 5000 Cluster, IBM Power5 IBM Blue Gene/L « Frontier » 30 times the power of 500 times the power of
1 of 4 vector processors 400 processors 8000 processors IBM Blue Gene/L « Frontier » IBM Blue Gene/L « Frontier »
2 month length computation 9 days # 1 month # 1 month # 1 month
# 1 Gb of storage # 15 Gb of storage # 200 Gb of storage # 1 Tb of storage # 10 Tb of storage
2 Gb of memory 25 Gb of memory 250 Gb of memory 2,5 Tb of memory 25 Tb of memory
Power of the computer Pre-processing not parallelized Pre-processing not parallelized … ibid. … … ibid. …
Mesh generation … ibid. … … ibid. …
Scalability / Solver … ibid. …
Visualisation
35 Open Source CFD International Conference 2009
36. Thank you for your attention!
36 Open Source CFD International Conference 2009
38. Load imbalance (1/3)
In this example, using 8 partitions (with METIS), we
have the following local minima and maxima:
Cells:
416 / 440 (6% imbalance)
Cells + ghost cells:
469/519 (11% imbalance)
Interior faces:
852/946 (11% imbalance)
Most loops are on cells,
but some are on cells + ghosts,
and MatVec is in cells + faces
38 Open Source CFD International Conference 2009
39. Load imbalance (2/3)
If load imbalance increases with processor count,
scalability decreases
If load imbalance reaches a high value (say 30% to
50%) but does not increase, scalability is maintained,
though some processor power is wasted
Perfect balancing is impossible to reach, as different loops show
different imbalance levels, an synchronizations may be required
between these loops
GCP uses MatVec and dot products
Load imbalance might be reduced using weights for domain
partitioning, with Cell weight = 1 + f(n_faces)
39 Open Source CFD International Conference 2009
40. Load imbalance (3/3)
Another possible source of load imbalance is different
cache miss rates on different ranks
Difficult to estimate a priori
With otherwise balanced loops, if a processor has a cache miss every
300 instructions, and another a cache miss every 400 instructions,
considering that the cost of a cache miss is at least 100 instructions,
the corresponding imbalance reaches 20%
40 Open Source CFD International Conference 2009