WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
Mosix : automatic load balancing and migration
1. Nov 24,2000 r.innocente 1
Automatic load balancing and
transparent process migration
Roberto InnocenteRoberto Innocente
rinnocente@hotmail.comrinnocente@hotmail.com
November 24,2000November 24,2000
Download postscript from :Download postscript from : mosix.psmosix.ps
or gzipped postscript from:or gzipped postscript from: mosix.ps.gzmosix.ps.gz
or pdf from:or pdf from: mosix.pdfmosix.pdf
2. Nov 24,2000 r.innocente 2
Introduction
Purpose of this presentation is an introductionPurpose of this presentation is an introduction
to the most important features of MOSIX:to the most important features of MOSIX:
-- automatic load balancingautomatic load balancing
-- transparent process migrationtransparent process migration
giving an overview of related projects and of thegiving an overview of related projects and of the
different possible implementations of thesedifferent possible implementations of these
mechanismsmechanisms..
3. Nov 24,2000 r.innocente 3
Scheme of the presentation
1.1. Overview of some Distributed OperatingOverview of some Distributed Operating
SystemsSystems
2.2. Load balancing mechanismsLoad balancing mechanisms
3.3. Process migration mechanismsProcess migration mechanisms
4.4. focus on Mosixfocus on Mosix
5. Nov 24,2000 r.innocente 5
Distributed O.S.
!! SpriteSprite
!! CharlotteCharlotte
!! AccentAccent
!! AmoebaAmoeba
!! V SystemV System
!! MOSIXMOSIX
!! Condor (this is just a batch system, but offeringCondor (this is just a batch system, but offering
job migration)job migration)
6. Nov 24,2000 r.innocente 6
Sprite (Douglis,Ousterhout* 1987)
Experimental Network OS developed at UCB.Experimental Network OS developed at UCB.
Was running on Suns and Decstations.Was running on Suns and Decstations.
The process migration in this system introduced theThe process migration in this system introduced the
simplifying concept of Home Node of a process (thesimplifying concept of Home Node of a process (the
originating node where it should still appear to run).originating node where it should still appear to run).
After then this concept has been utilized by many otherAfter then this concept has been utilized by many other
implementations.implementations.
*of Tcl/Tk fame*of Tcl/Tk fame
7. Nov 24,2000 r.innocente 7
Charlotte (Artsy, Finkel 1989)
It’s a Distributed Operating System developedIt’s a Distributed Operating System developed
at the UoWM based on the message passingat the UoWM based on the message passing
paradigm.paradigm.
Provides a wellProvides a well--developed process migrationdeveloped process migration
mechanism that leaves nomechanism that leaves no residual dependenciesresidual dependencies
(the migrated process can survive to the death of the(the migrated process can survive to the death of the
originating node), and therefore is suitable also fororiginating node), and therefore is suitable also for
fault tolerant systems.fault tolerant systems.
8. Nov 24,2000 r.innocente 8
Accent (Zayas 1987)
"" developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach
"" a kernel not Unix compatible with IPCsa kernel not Unix compatible with IPCs
based on a “port” abstractionbased on a “port” abstraction
"" process migration has been added byprocess migration has been added by
Zayas(1987)Zayas(1987)
"" no load balancingno load balancing
9. Nov 24,2000 r.innocente 9
Amoeba (Tanenbaum et al 1990)
"" Tanenbaum,van Renesse,van Staveren, Sharp, Mullender, Jansen, vaTanenbaum,van Renesse,van Staveren, Sharp, Mullender, Jansen, vann
Rossum*Rossum*(Experiences with the Amoeba Distributed Operating System(Experiences with the Amoeba Distributed Operating System
1990)1990)
"" Zhu,SteketeeZhu,Steketee(An experimental study of Load Balancing on Amoeba,(An experimental study of Load Balancing on Amoeba,
1994)1994)
"" doesn’t support virtual memorydoesn’t support virtual memory
"" was based on thewas based on the processorprocessor--pool modelpool model, that assumes the # of, that assumes the # of
processors is more than the # of processesprocessors is more than the # of processes
* of python fame* of python fame
10. Nov 24,2000 r.innocente 10
V system (Theimer 1987)
"" developed at Stanford, microkernel based,developed at Stanford, microkernel based,
runs on a set of diskless wksruns on a set of diskless wks
"" V has support for both remote executionV has support for both remote execution
and process migration, the facilities wereand process migration, the facilities were
added having in mind to utilize wks duringadded having in mind to utilize wks during
idle periodsidle periods
11. Nov 24,2000 r.innocente 11
MOSIX (Barak 1990,1995)
its roots are back in the mid ’80s :its roots are back in the mid ’80s :
!! MOS multicomputer OS (Barak, HUJI 1985)MOS multicomputer OS (Barak, HUJI 1985)
!! set of enhancements to BSD/OS: MOSIX ’90set of enhancements to BSD/OS: MOSIX ’90
!! port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around ’95:edition) around ’95:
the system call based mgmt has beenthe system call based mgmt has been
transformed to a /proc filesystem interfacetransformed to a /proc filesystem interface
!! enhancements ’98 (memory usheringenhancements ’98 (memory ushering
algorithm,...)algorithm,...)
12. Nov 24,2000 r.innocente 12
Condor (Litzkow,Livny 1988)
Condor is not a Distributed OS.Condor is not a Distributed OS.
It is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilization
of the idle time of WKS and PCs.of the idle time of WKS and PCs.
It can migrate jobs from machine to machine,It can migrate jobs from machine to machine,
but imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can be
migrated and the programs have to be linkedmigrated and the programs have to be linked
with a special checkpoint library.with a special checkpoint library.
(More details on(More details on http://hpc.sissa.it/batchhttp://hpc.sissa.it/batch))
14. Nov 24,2000 r.innocente 14
Load balancing methods
taxonomy
"" job initiation only (aka job placement, aka remotejob initiation only (aka job placement, aka remote
execution)execution)
!! staticstatic
!! dynamic (almost all batch systems can do this:dynamic (almost all batch systems can do this:
NQS and derivatives)NQS and derivatives)
"" migration (Mosix, some Distributed OS)migration (Mosix, some Distributed OS)
"" job initiation + migrationjob initiation + migration
15. Nov 24,2000 r.innocente 15
Load balancing algorithms
Two main categories:Two main categories:
"" Non preNon pre--emptive:emptive: they are used only when a newthey are used only when a new
job has to be started, they decide where to run ajob has to be started, they decide where to run a
new job (aka job placement, many batch systems)new job (aka job placement, many batch systems)
"" PrePre--emptive:emptive: jobs can be migrated duringjobs can be migrated during
execution, therefore load balancing can beexecution, therefore load balancing can be
dynamically applied duringdynamically applied during job executionjob execution
(Distributed OS,Condor)(Distributed OS,Condor)
16. Nov 24,2000 r.innocente 16
Non pre-emptive load balancing
(aka job placement)
"" centralized initiation:centralized initiation: there is only 1 load balancer in thethere is only 1 load balancer in the
system that is in charge of assigning new jobs to computerssystem that is in charge of assigning new jobs to computers
according to the info it collectsaccording to the info it collects
"" distributed initiationdistributed initiation: there is a load balancer on each: there is a load balancer on each
computer that xfers its load to the others anytime it detectscomputer that xfers its load to the others anytime it detects
a load change, the local balancer decides according to thea load change, the local balancer decides according to the
load info received where to start new jobsload info received where to start new jobs
"" random initiationrandom initiation: when a new job arrives the local load: when a new job arrives the local load
balancer selects randomly to which computer to assignbalancer selects randomly to which computer to assign
new jobsnew jobs
17. Nov 24,2000 r.innocente 17
Pre-emptive load balancing
"" randomrandom:: when a computer becomes overloaded it select at randomwhen a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process
"" central:central: one load balancer regularly polls the other machine to obtainone load balancer regularly polls the other machine to obtain
load info. If it detects a load imbalance it select a computer wload info. If it detects a load imbalance it select a computer with a lowith a low
load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
"" neighbourneighbour: (Bryant,Finkel: (Bryant,Finkel 1981) each computer sends a request to its1981) each computer sends a request to its
neighbours, when the request is accepted, a pair is constituted.neighbours, when the request is accepted, a pair is constituted.If theyIf they
detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processes
from one to another after that the pair is brokenfrom one to another after that the pair is broken
"" broadcastbroadcast:: server basedserver based
"" distributeddistributed: each node receives a measure of the load of other nodes,
if it discovers a load imbalance it would try to migrate some of its
processes to a less loaded node
18. Nov 24,2000 r.innocente 18
Processor load measures
The metric used to evaluate the load is extremely important in a load
balancing system. More than 20 different metrics are cited in only 1
paper(Mehra & Wah 1993):
! std Unix : # of tasks in the run queue
! size of the free available memory
! rate of CPU context switches
! rate of system calls
! 1 minute load avg
! amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a
better measure than complex load criteria in most situations.
20. Nov 24,2000 r.innocente 20
User/kernel level migration
"" Kernel level process migration (Mosix):Kernel level process migration (Mosix):
!! no special requirements, no need to recompileno special requirements, no need to recompile
!! transparent migrationtransparent migration
"" User level process migration (Condor,...):User level process migration (Condor,...):
!! compilation with special libraries (libckpt.a)compilation with special libraries (libckpt.a)
!! a snapshot of the running process is taken on signala snapshot of the running process is taken on signal
!! the snapshot is restarted on another machinethe snapshot is restarted on another machine
!! more details atmore details at http://hpc.sissa.it/batchhttp://hpc.sissa.it/batch
21. Nov 24,2000 r.innocente 21
Process migration factors:
"" transparencytransparency:how much a process needs to be:how much a process needs to be
aware of the migrationaware of the migration
"" residual dependencyresidual dependency: how much a process is: how much a process is
dependent on the source node after migrationdependent on the source node after migration
"" performanceperformance:how much the migration affects:how much the migration affects
performanceperformance
"" complexitycomplexity: how much complex the migration is: how much complex the migration is
23. Nov 24,2000 r.innocente 23
System call interposing
In many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some code
at the system call level. There are different ways to do it:at the system call level. There are different ways to do it:
"" Very easy on message based OSs in which a system call isVery easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel
"" A modified kernel: in this case it can be completelyA modified kernel: in this case it can be completely
transparent to the process (MOSIX), reasonably efficienttransparent to the process (MOSIX), reasonably efficient
for some kind of processesfor some kind of processes
"" A modified system library: in this case the programs needA modified system library: in this case the programs need
to be linked with a special library(Condor), reasonablyto be linked with a special library(Condor), reasonably
transparenttransparent
24. Nov 24,2000 r.innocente 24
Process migration parts:
"" execution state migrationexecution state migration
"" memory space migrationmemory space migration
"" communication state migrationcommunication state migration
25. Nov 24,2000 r.innocente 25
Execution state migration
It’s not that difficult. For the mostIt’s not that difficult. For the most
part the code can be the same as that used forpart the code can be the same as that used for
swapping. In that case the destination is a diskswapping. In that case the destination is a disk
while for migration the destination is awhile for migration the destination is a
different node.different node.
26. Nov 24,2000 r.innocente 26
Communication state
migration
It is a difficult task and usually it is avoided forcing the
migrated process to redirect networking operations to the
Home Node where they are performed.
On Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statete
which remembers the stage of the ongoing RPCs, these states arewhich remembers the stage of the ongoing RPCs, these states are kept inkept in
the kernelthe kernel
!! during migration any message received for a suspended process isduring migration any message received for a suspended process is
rejected and a “process migrating” response is returnedrejected and a “process migrating” response is returned
!! sending machine will rexmit after some delaysending machine will rexmit after some delay
!! sending machine using a special protocol(FLIP) will locate thesending machine using a special protocol(FLIP) will locate the
migrated processmigrated process
!! the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
27. Nov 24,2000 r.innocente 27
Migratable sockets
Unfortunately TCP/IP has no direct support for this.Unfortunately TCP/IP has no direct support for this.
Research is ongoing on different solutions.Research is ongoing on different solutions.
A proposed userA proposed user--level implementation requires:level implementation requires:
"" substitution of the socket librarysubstitution of the socket library
"" relinking of programs with the new libraryrelinking of programs with the new library
(Bubak, Zbik et al. 1999)(Bubak, Zbik et al. 1999)
In this proposal communication is done between virtualIn this proposal communication is done between virtual
addresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentween
virtual and real addresses). When an endpoint migrates, a stub rvirtual and real addresses). When an endpoint migrates, a stub remains onemains on
the original node to respond with a “process migrating” msg. Thethe original node to respond with a “process migrating” msg. The otherother
endpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contact.to contact.
28. Nov 24,2000 r.innocente 28
Memory space migration
"" total copytotal copy(Charlotte,LOCUS): process is frozen and the entire memory space(Charlotte,LOCUS): process is frozen and the entire memory space
is moved, the process is then restarted on remote (this can reqis moved, the process is then restarted on remote (this can require manyuire many
seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
"" prepre--copyingcopying (V System:Theimer): allows the process to continue the(V System:Theimer): allows the process to continue the
execution, while its memory space is being transferred to the taexecution, while its memory space is being transferred to the target.Then Vrget.Then V
freezes the process on the source and refreezes the process on the source and re--copies the pages modified during thecopies the pages modified during the
transfer timetransfer time
"" lazylazy--copying or demand pagecopying or demand page(Accent:Zayas): pages are left on the source(Accent:Zayas): pages are left on the source
machine and are transferred only at page faults. This mechanismmachine and are transferred only at page faults. This mechanism can leavecan leave
dirty pages on all the nodes the process goes through. A small vdirty pages on all the nodes the process goes through. A small variation usedariation used
by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages anddirty pages and
then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
"" shared file servershared file server(Sprite): Sprite uses the network file system as a backing(Sprite): Sprite uses the network file system as a backing
store for virtual memory. Dirty pages are flushed on the networstore for virtual memory. Dirty pages are flushed on the network file systemk file system
and on the target the process is restarted (it’s just aswapand on the target the process is restarted (it’s just aswap--out,swapout,swap--in)in)
30. Nov 24,2000 r.innocente 30
MOSIX important points
"" Probabilistic info disseminationProbabilistic info dissemination
"" PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration)
"" Dynamic load balancingDynamic load balancing
"" Memory usheringMemory ushering
"" Decentralized controlDecentralized control
"" Efficient kernel communicationsEfficient kernel communications
31. Nov 24,2000 r.innocente 31
Probabilistic info dissemination
"" each node at regular intervals sends info oneach node at regular intervals sends info on
its resources at a random subset of nodesits resources at a random subset of nodes
"" each node at the same time keeps a smalleach node at the same time keeps a small
buffer with the most recently info receivedbuffer with the most recently info received
32. Nov 24,2000 r.innocente 32
MOSIX load balancing
algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed load
balancing algorithm that is essentially basedbalancing algorithm that is essentially based
on a Unix load avg metric. This means thaton a Unix load avg metric. This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node to
which it would try to migrate one of itswhich it would try to migrate one of its
processes.processes.
When a node is low on memory a different algorithmWhen a node is low on memory a different algorithm
prevails: memory usheringprevails: memory ushering
33. Nov 24,2000 r.innocente 33
Mosix info variables
These are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes to
feed the distributed load balancing algorithm:feed the distributed load balancing algorithm:
"" Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz)
"" Number of processorsNumber of processors
"" Speed of processorSpeed of processor
"" Memory usedMemory used
"" Raw Memory UsedRaw Memory Used
"" Total MemoryTotal Memory
Look atLook at http://hpc.sissa.it/walrus/mjmhttp://hpc.sissa.it/walrus/mjm for a java appletfor a java applet
showing these values over a Mosix cluster.showing these values over a Mosix cluster.
34. Nov 24,2000 r.innocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is active.Normally only the load balancing algorithm is active.
When memory is under a threshold (paging), the memoryWhen memory is under a threshold (paging), the memory
ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancing
algorithm.algorithm.
This algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest running
process to the node with more free memory.process to the node with more free memory.
Eventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node..
35. Nov 24,2000 r.innocente 35
MOSIX network protocols
These protocols are used to exchange load info amongThese protocols are used to exchange load info among
the nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrations
between nodes. Mosix uses:between nodes. Mosix uses:
!! UDP to txfer load infoUDP to txfer load info
!! TCP to migrate processesTCP to migrate processes
These communications are not encrypted/ authenticated/These communications are not encrypted/ authenticated/
protected therefore it’s better to use a private network for thprotected therefore it’s better to use a private network for thee
MNP.MNP.
(Unfortunately the sockets used by Mosix are not reported by(Unfortunately the sockets used by Mosix are not reported by
standard tools: e.g. netstat)standard tools: e.g. netstat)
36. Nov 24,2000 r.innocente 36
MOSIX process migration
"" It’s aIt’s a kernel level migrationkernel level migration (it’s done without(it’s done without
the necessity of any modification to the programs,the necessity of any modification to the programs,
because Mosix changes the linux kernel)because Mosix changes the linux kernel)
"" fully transparentfully transparent : the process involved does’nt
have any need to know about the migration
" strong residual dependency:the migrated process
is completely dependent on the deputy for system
calls( and particularly I/O and networking)
37. Nov 24,2000 r.innocente 37
Mosix migration transparency
"" Mosix achieves complete transparency at the priceMosix achieves complete transparency at the price
of some efficiency interposing to system calls aof some efficiency interposing to system calls a
link layer that redirects their execution to thelink layer that redirects their execution to the
originating node of the process(originating node of the process(Unique HomeUnique Home
NodeNode) .) .
"" The skeleton of the process that remains on theThe skeleton of the process that remains on the
originating node to execute the system calls,originating node to execute the system calls,
receive signals, perform I/O is called in Mosix thereceive signals, perform I/O is called in Mosix the
deputydeputy (what Condor calls shadow)(what Condor calls shadow)
"" The migrated process is called theThe migrated process is called the remoteremote..
38. Nov 24,2000 r.innocente 38
MOSIX deputy/remote
mechanism
Deputy
Link layer
Remote
Link layer
MNP
(Mosix
Network
Protocol)
UHN
(Unique
Home
Node)
Running
node
39. Nov 24,2000 r.innocente 39
Mosix performance/1
" While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration, I/O bound and
network bound processes suffer big drawbacks and are usually not
good candidates for migration.
" A first solution addressing only part of the problems has been the
development of a daemon to support remote initiation
(MExec/MPmake) of processes. With the help of this daemon
processes that can run in parallel are dispatched during the exec call
(you need to change the source and recode the exec as mexec) thus
avoiding the creation of a bottle-neck single Home Node where all the
I/O would be performed. Gmake is a tool already prepared to be
recompiled to exploit its inherent parallelism (follow the indication in
the MExec tarball).
40. Nov 24,2000 r.innocente 40
Mosix performance/2
A more general solution to the I/O bound processesA more general solution to the I/O bound processes
performance problem under migration is anyway envisionedperformance problem under migration is anyway envisioned
to be the use of ato be the use of a cache coherent global file systemcache coherent global file system:
"" cache coherentcache coherent: clients keep their caches synchronized, or only 1clients keep their caches synchronized, or only 1
cache at the server node is keptcache at the server node is kept
" global: every file can be seen from every node with the same name: every file can be seen from every node with the same name
If such a file system can be used then I/O operations can be exeIf such a file system can be used then I/O operations can be executedcuted
directly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to the
Unique Home Node. Mosix recently has incorporated the possibilitUnique Home Node. Mosix recently has incorporated the possibility toy to
declare that a file system is a Direct File System Access.declare that a file system is a Direct File System Access.
For such file systems Mosix will perform I/O operation directlyFor such file systems Mosix will perform I/O operation directly on theon the
running node.running node.
41. Nov 24,2000 r.innocente 41
MOSIX dfsa (direct file system
access)
Deputy
Link layer
Remote
Link layer
MNP
(Mosix
Network
Protocol)
DFSA access
UHN
(Unique
Home Node)
Running
node
42. Nov 24,2000 r.innocente 42
Cache coherent global
distributed file systems
To be a MOSIX DFSA it’s not sufficient to be a cacheTo be a MOSIX DFSA it’s not sufficient to be a cache--coherentcoherent
global file system. In fact some additional functions (not availglobal file system. In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level::
"" identify,compare,reconstructidentify,compare,reconstruct
A prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linuxd Linux
file system isfile system is MFS (Mosix File System) provided directly by the MOSIXprovided directly by the MOSIX
team. Other possibile candidate DFSA are:team. Other possibile candidate DFSA are:
"" GFSGFS (GlobalFileSystem)(GlobalFileSystem)
"" xFSxFS (Serverless File System)(Serverless File System)
It seems now stable the integration of Mosix with GFS.It seems now stable the integration of Mosix with GFS.