What is Mesos? How does it works? In the following slides we make an interesting review of this open-source software project to manage computer clusters.
6. What is Mesos?
Resource Manager
Mesos abstracts
computing resources
from nodes in the
datacenter.
“Program against your
datacenter like it’s a single
pool of resources”
Different workloads
Mesos is a platform for
sharing a cluster
between applications. It
can scale up to 10,000s
of nodes.
Uses containerization
Workloads are launched
in containers (either LXC
or Docker), providing an
isolation level.
7. A Distributed Systems Kernel
Just like OS manages resource utilization allowing concurrent use of the limited
resources by multiple applications, Mesos applies this principle to a whole cluster
of machines to provide resource management and scheduling across the cluster.
10. Master Nodes
● Source of truth of the cluster status
(in memory - high memory usage)
● Send resource offers to the
applications.
● Host primary UI
● High availability with active-pasive
replication using Zookeeper for
leader election and Paxos for state
sharing.
Zookeeper
11. Agent Nodes
● Launch containers running
application tasks.
● Advertise their available resources
to the master.
● Host an UI for the launched
containers.
● Manage status updates from the
running tasks and they’re in charge
of communication with the master.
● Known as slaves until 0.28
12. MESOS IS NOT AN OS
The Kernel comparison can be confusing: each node
has an OS installed and Mesos runs as a service
daemon on it
13. 3. Resources and attributes
http://mesos.apache.org/documentation/latest/attributes-re
sources/
14. What is a Resource?
Types
● SCALAR (1024.0)
● RANGE ([1-10])
● SET ({elem1, elem2})
Predefined resources
● cpus
● mem
● disk
● ports
Everything an application task uses for doing its work
15. Resources are Defined by Agent
● Each Mesos agent is
configured with the
resources it has.
● The agent continuously
sends updates to the
master with its available
resources.
cpu 8.0
mem 4096.0
disk 1024.0
ports [9000-65536]
cpu 16.0
mem 8192.0
disk 512.0
ports [9000-10000]
16. CPUs Resource
Represents how many CPU cores are available.
● Can be specified in fractions (0.5
CPUs)
● By default, Mesos configures
each agent with the number of
cores in the processor.
● Mesos enforces it by using
CPU shares (CPU time per
second)
● It’s a guaranteed minimum (if
there’s more CPU time
available, it could be used)Example
cpus=24
17. Memory Resource
Represents how many MB of memory are available.
● By default, Mesos configures
each agent with 1 GB or 50% of
detected memory, whichever is
smaller. (Leave memory for the
OS!!)
● It’s a strictly preallocated
resource (you get what you
reserve)
● That makes it a critical resource
(you have to get the right amount
of memory for your tasks,
otherwise they could get killed if
they try to use too much)
Example
mem=1024.0
18. Disk Resource
Represents how many MB of disk space are available.
● By default, Mesos configures
each agent with 5 GB or 50% of
detected disk, whichever is
smaller
● If affects the container’s
sandbox.
● Mesos, by default, doesn’t
enforce it (it’s not really
allocated, a task can use as
much space as it wants). Setting
--enforce_container_disk_quota
changes that behaviour.
Example
disk=2048.0
19. Ports Resource
Represents the available ports to listen in the agent.
● It’s a RANGE.
● By default, Mesos configures
each agent to expose port range
31000–32000.
● Port usage is not enforced by
Mesos.
● However, it’s important to
reserve the ports a task must
listen to, to be sure to avoid
conflicts (only one process can
be listening in a port at a time).
Example
ports=[9000-9300]
20. Custom Resources
● Mesos allows to define any
custom resource.
● Remember that a resource is
something which can be
exclusively reserved.
● There’s no need to enforce
the resource allocation (see
disk or ports).
Examples
● network_bandwith=1000.0
● bugs={bug1, bug2}
● oranges=1500.0
This resources will be offered to
applications, which need to be
able to manage it if they want to
use it.
21. What is an Attribute?
Types
● SCALAR (1024.0)
● RANGE ([1-10])
● SET ({elem1, elem2})
● They are not allocated, only
passed along with the
resources to the applications
in offers.
● They are a helper for the
scheduling decisions.
Arbitrary key-value data that serves as metadata about the
machine running the agent.
Example
● rack_id=eu-1
● os=ubuntu
24. What is a Framework? An application that runs
on Mesos.
● Based in the
master-worker design.
● It’s ad-hoc for the
application business
model
Two components:
● Scheduler
● Executors
25. Scheduler
● It’s the brain of the
framework.
● Registers with Mesos and
receives resource offers.
● Launches tasks for the
application when it has been
offered with enough
resources, or according
another scheduling logic.
● We could see it as an
intermediate between the
application logic and the
Mesos layer.
● It’s developed for each
application. Mesos provides
an API for doing it (HTTP and
native)
26. Executor
● Launched by the scheduler
when it has work to do
(worker).
● It will receive tasks to do
from the scheduler and will
send back status updates
(it’s connected with Mesos
too).
● Act as a process container
that runs tasks.
● Mesos provides an executor
API also, but, given that it’s
more general purpose than
the scheduler, Mesos
provides a
CommandExecutor that
should be enough for most of
the workloads.
27. Task
● The unit of work in Mesos,
the workload that a
scheduler wants to run in the
cluster.
● Runs inside an executor.
● An Executor can run more
than one task (not common).
● A task has a definition of the
needed resources that will
be allocated.
● Mesos will allocate to the
container enough resources
for the bunch of tasks
launched plus the executor.
(and will resize it dynamically
if more tasks are added).
29. What is an Offer?
● Used by Mesos to allocate resources to a
framework.
● Leading master send offers to the
frameworks’ schedulers.
30. What’s Inside an Offer?
● Resources offered.
● Affected agent
(slaveId).
● Attributes of the
agent.
cpu 8.0
mem 4096.0
disk 1024.0
ports [9000-65536]
hostname agent-1
rack_id EU-I-1
slaveId asd1323...
31. How’re Offers Sent to Frameworks?
● Masters run the resource
allocator module.
● This module decides to
which framework send an
offer using an algorithm
called DRF (Dominant
resource fairness).
● The allocation module is
pluggable.
● The algorithm tries to
maximize the minimal
dominant share across
frameworks. (Considering
their dominant resource)
● DRF orders frameworks and
then the offer is sent to them
in order one at a time.
32. What to Do with an Offer?
ACCEPT
● Launch a task with
resources of the offer (only
the needed, not all)
● Perform a reservation.
● Create a persistent volume.
REJECT
● Don’t do anything with an
offer.
● Why? When Mesos sends an
offer to a scheduler for the
Allocator the resources are
allocated to the framework.
(framework penalized in the
DRF)
33. More About Offers
● Different offers of the same
agent can be grouped to get
more resources (when
accepting an offer).
● Several tasks can be
launched with the same
offer (as long as there are
enough resources)
● Mesos tries to send offers as
big as possible.
34. Two Level Scheduling
Master manages cluster
resources and decides to
which framework send an
offer.
Schedulers accept or
reject offers according to
the concrete application
needs.
37. What’s a Role?
● Like a group of frameworks.
● Used to ensure that certain resources are only offered
to certain frameworks (only resources allocated to a
role are offered to a framework, with an exception).
● Each framework registers with Mesos with a role (by
default, * )
38. * IS A ROLE, NOT ANY
The default role (*) doesn’t mean that any role is
accepted, is a concrete role (Bad name…)
39. More on Roles
Any role is allowed
Frameworks can register
with any role name,
unless the flag --roles is
set in the Mesos masters
with a concrete list.
Resources allocated to *
are available to all roles
By default, resources are
allocated to the default
role (*). All the
frameworks, no matter
their role, will receive
offers of resources
allocated to ‘*’.
Roles can use weights
Weights can be assigned
to roles, allowing to
indicate in DRF that
certain role has to get a
higher amount of
resources than other.
42. Static Reservation
While configuring the
exposed resources in an
agent, those resources
could be statically
reserved to concrete
roles.
cpu 4.0
mem 2048.0
disk(*) 512.0
ports [9000-65536]
cpu(pro) 4.0
mem(pro) 2048.0
disk(pro) 512.0
43. Static Reservation
Not recommended
Static reservations are
only maintained for
backwards compatibility.
Restart needed
To change the amount of
reserved resources it’s
needed to modify the
agent configuration and
restart it.
By default, resources
are allocated to the
default role
44. Dynamic Reservation
Resources can be
reserved and
unreserved
In runtime, resources
can be reserved to a
role, and later they can
be unreserved when no
task is using that
resources.
Using an HTTP
endpoint
Dynamic reservation is
managed by operators
using HTTP endpoints
for reserve and
unreserve.
Using an acceptOffers
operation
Schedulers can
reserve/unreserve
resources when
accepting an offer by
using two special
operations.
46. Sandbox (Disk Resource)
Working directory
A Sandbox is a
temporary directory
given to each executor
and set as working
directory for it. It’s
accessible from outside
the container.
Stores logs and other
data
It contains the stdout
and stderr of the
executor. Besides that it
contains the fetched files
(URI) and files created by
the task.
Garbage collected
This directory is cleaned
from the agent system
once a configurable
period of time has
passed.
47. Persistent Volumes
● Created from disk resources, they live outside the
executor’s sandbox and will persist on the agent.
● When a task using them finishes, they are offered back
without losing data.
● Used for stateful services.
48. More on Persistent Volumes
● Created over previously
reserved disk resources.
● No more than one task can
have the volume at the same
time.
● To unreserve the disk
resources associated with a
persistent volume, it’s
needed to destroy the
volume first
● Created/destroyed using
HTTP endpoints or via
acceptOffers in the
Scheduler.
● Associated to a role (volume
can be offered back to any
framework in the role).
49. Type of Disk Resources
ROOT
Maps to the main
operating system
storage drive. It’s the
default option.
MOUNT
Auxiliary disks provided
by operators which maps
to a mount point in the
host OS. When reserved,
all the disk is reserved
(no matter the reserved
size).
PATH
Auxiliary disk resource
created by operator,
which maps a directory
in the host OS to a disk
resource. Usually used
to carve up a mounted
disk in smaller chunks.
52. Docker Containerizer
● Works with Docker images
(task/executor).
● Uses docker-engine (docker
run….).
● Needs docker installed in
each agent. (external
dependency…)
Mesos roadmap is unifying
containerizers and stop its
support.
53. Mesos Containerizer
● Runs commands of the host
OS.
● Runs Docker/AppC Images
(Universal Containerizer).
● Uses LXC.
● Based on pluggable
isolators, which are used for
isolating resources from
other containers.
● Examples: cgroups/cpu,
cgroups/mem,
docker/volume, disk/du,
docker/runtime, network/cni,
etc.
Tip:
sudo nsenter --mount --uts --ipc --net --pid
--target <PID_CONTAINER>
54. Docker on Mesos Containerizer
● A Docker image
represents a filesystem.
● Mesos pulls the image
and extracts the
filesystem.
● Using pivotroot, the
container is launched
over that filesystem.
● Isolation is done by the
Mesos containerizer (no
docker-engine
dependency).http://events.linuxfoundation.org/sites/events/files/sli
des/Mesos%20and%20Containers.pdf
55. Docker on Mesos Containerizer
BE CAREFUL WITH
PERMISSIONS
User namespace matches with
the agent (the only way to use an
user created in the Dockerfile is
to have an user in the agent with
the same name, uid and gid).
BRIDGE NETWORK IS NOT
SUPPORTED
When you bind to a port, by
default you do it on the agent
host stack (if you’re not using
another isolator like network/cni
for using virtual networks and IP
per container).
57. External Volumes
● Uses dvdcli and a Docker
Volume plugin, for instance
REX-Ray or GlusterFS
(dependency).
● Mounts an external volume
from a storage provider to
the task container (Cinder,
Amazon EBS, etc).
● Instead of binding a task data
to an agent (persistent
volumes) it manages storage
outside the agents.
58. Oversubscription
● Frameworks can use
resources allocated to a
framework but temporarily
unused.
● These resources can be
revoked by Mesos in any
moment.
● A QoS module ensures that
the framework to which these
resources belong has not
impact in its performance.
59. Checkpointing
For agent recovery, a
Framework can enable
checkpointing to write its
state to disk regularly.
If the Mesos Agent is stopped (a
failure or upgrade), tasks of
checkpointed frameworks
continue running (otherwise,
all running tasks are killed).
60. Hands On: Let’s make a framework
https://github.com/roberveral/mesos-gocd