Atsushi Hori
RIKEN
New portable and practical parallel execution model, Process in Process (PiP in short) will be presented. PiP tasks share the same virtual address space like the multi-thread model and privatized variables like the multi-process model. Because of this, PiP provides the best of two worlds, multi-process (MPI) and multi-thread (OpenMP). Researcher, System Software Development Team, RIKEN
2. Arm HPC Workshop@Akihabara 2017
Background
• The rise of many-core architectures
• The current parallel execution models are
designed for multi-core architectures
• Shall we have a new parallel execution
model ?
2
3. Arm HPC Workshop@Akihabara 2017
What to be shared and what not to be shared
• Isolated address spaces
• slow communication
• Shared variables
• contentions on shared variables
3
Address Space
Isolated Shared
Variables
Privatized
Multi-Process
(MPI)
Shared ??
Multi-Thread
(OpenMP)
4. Arm HPC Workshop@Akihabara 2017
What to be shared and what not to be shared
• Isolated address spaces
• slow communication
• Shared variables
• contentions on shared variables
4
Address Space
Isolated Shared
Variables
Privatized
Multi-Process
(MPI)
3rd Exec.
Model
Shared ??
Multi-Thread
(OpenMP)
5. Arm HPC Workshop@Akihabara 2017
Implementation of 3rd Execution Model
• MPC (by CEA)
• Multi-thread approach
• Compiler converts all variables thread local
• a.out and b.out cannot run simultaneously
• PVAS (by RIKEN)
• Multi-process approach
• Patched Linux
• OS kernel allows processes to share address
space
• MPC, PVAS, and SMARTMAP are not portable
5
6. Arm HPC Workshop@Akihabara 2017
Why portability matters ?
• On the large supercomputers (i.e. the K),
modified OS kernel or kernel module is not
allowed for users to install
• When I tried to port PVAS onto McKernel,
core developer denies the modification
• DO NOT CONTAMINATE MY CODE !!
6
7. Arm HPC Workshop@Akihabara 2017
PiP is very PORTABLE
7
CPU OS
Xeon and Xeon Phi
x86_64 Linux
x86_64 McKernel
the K and FX10 SPARC64 XTCOS
ARM (Opteron A1170) Aarch64 Linux
0
0.1
0.2
1
10
100
200
Time[S]
# Tasks -- Xeon
PiP:preload
PiP:thread
Fork&Exec
Vfork&Exec
PosixSpawn Pthread
0
1
2
1
10
100
200# Tasks -- KNL
0
0.1
0.2
1
10
100
200
# Tasks -- Aarch64
0
1
2
1
10
100
200
# Tasks -- K
Task Spawning Time
8. Arm HPC Workshop@Akihabara 2017
Portability
• PiP can run the machines where
• pthread_create() (, or clone() system call)
• PIE
• dlmopen()
are supported
• PiP does not run on
• BG/Q PIE is not supported
• Windows PIE is not fully supported
• Mac OSX dlmopen() is not supported
• FACT: All machines listed in Top500 (Nov. 2017)
use Linux family OS !!
8
10. Arm HPC Workshop@Akihabara 2017
Why address space sharing is better ?
• Memory mapping techniques in multi-process model
• POSIX (SYS-V, mmap, ..) shmem
• XPMEM
• Same page table is shared by tasks
• no page table coherency overhead
• saving memory for page tables
• pointers can be used as they are
10
Memory mapping
must maintain page
table coherency
-> OVERHEAD
(system call, page
fault, and page table
size)
shared
region
Page
Table
shared
region
Page
Table
Proc-0 Proc-1
coherent
Shared Physical Memory Pages
11. Arm HPC Workshop@Akihabara 2017
Memory Mapping vs. PiP
11
for Practical Address-Space Sharing PPoPP 2018, February 24–28, 2018, Vienna, Austria
concurrency because the
alysis is processing the data.
ction 7.4, we chose the latter
n is exible enough to allow
rms
forms to cover several OS
n our evaluation as listed in
platform H/W info.
Clock Memory Network
.6GHz 64 GiB ConnectX-3
.4GHz 96(+16) GiB Omni-Path
.0GHz 16 GiB Tofu
n Section 7.1 and 7.3 without using
ne with cache quadrant mode.
platform S/W info.
Table 5. Overhead of XPMEM and POSIX shmem functions
(Wallaby/Linux)
XPMEM Cycles
xpmem_make() 1,585
xpmem_get() 15,294
xpmem_attach() 2,414
xpmem_detach() 19,183
xpmem_release() 693
POSIX Shmem Cycles
Sender shm_open() 22,294
ftruncate() 4,080
mmap() 5,553
close() 6,017
Receiver shm_open() 13,522
mmap() 16,232
close() 16,746
6.2 Page Fault Overhead
Figure 4 shows the time series of each access using the same
microbenchmark program used in the preceding subsection.
Element access was strided with 64 bytes so that each cache
block was accessed only once, to eliminate the cache block
eect. In the XPMEM case, the mmap()ed region was attached
by using the XPMEM functions. The upper-left graph in
this gure shows the time series using POSIX shmem and
XPMEM, and the lower-left graph shows the time series
using PiP. Both graphs on the left-hand side show spikes at
every 4 KiB. Because of space limitations, we do not show
(Xeon/Linux)
10
100
1,000
5,000
AccessTime[Tick]
ShmemXPMEM XPMEM
PageSize:4KiB PageSize:2MiB
10
100
500
0 4,096 8,192 12,288 16,384
Array Elements [Byte offset]
PiP:process PiP:thread
0 4,096 8,192 12,288 16,384
Array Elements [Byte offset]
PiP:process PiP:thread
(Xeon/Linux)
PiP takes
less than
100 clocks !!
12. Arm HPC Workshop@Akihabara 2017
Process in Process (PiP)
• dlmopen (not a typo of dlopen)
• load a program having a new name space
• The same variable “foo” can have multiple
instances having different addresses
• Position Independent Executable (PIE)
• PIE programs can be loaded at any location
• Combine dlmopen and PIE
• load a PIE program with dlmopen
• We can privatize variables in the same
address space
12
13. Arm HPC Workshop@Akihabara 2017
Glibc Issue
• In the current Glibc, dlmopen() can create up to 16
name spaces only
• Each PiP task requires one name space to have
privatized variables
• Many-core architecture can run more than 16 PiP tasks,
up to the number of CPU cores
• Glibc patch is also provided to have more number of
name spaces, in case 16 is not enough
• Changing the size of name space stable
• Currently 260 PiP tasks can be created
• Some workaround codes can be found in PiP library
code
13
22. Arm HPC Workshop@Akihabara 2017
Research Collaboration
• ANL (Dr. Pavan and Dr. Min) — DOE-MEXT
• MPICH
• UT/ICL (Prof. Bosilca)
• Open MPI
• CEA (Dr. Pérache) — CEA-RIKEN
• MPC
• UIUC (Prof. Kale) — JLESC
• AMPI
• Intel (Dr. Dayal)
• In Situ
22
23. Arm HPC Workshop@Akihabara 2017
Summary
• Process in Process (PIP)
• New implementation of the 3rd execution
model
• better than memory mapping techniques
• PiP is portable and practical because of
user-level implementation
• can run on the K and OFP super
computers
• Showcases prove PiP can improve
performance
23
24. Arm HPC Workshop@Akihabara 2017
Final words
• The Glib issues will be reported to Redhat
• We are seeking PiP applications not only HPC
but also Enterprise
24