Server Administration in Python with Fabric, Cuisine and Watchdog
1. Fabric, Cuisine &
Watchdog
Sébastien Pierre, ffunction inc.
@Montréal Python, February 2011
www.ffctn.com
ffunction
inc.
2. How to use Python for
Server Administration
Thanks to
Fabric
Cuisine*
& Watchdog*
*custom tools
ffunction
inc.
3. The way we use
servers
has changed
ffunction
inc.
4. The era of dedicated servers
Hosted in your server room or in colocation
WEB DATABASE EMAIL
SERVER SERVER SERVER
ffunction
inc.
5. The era of dedicated servers
Hosted in your server room or in colocation
WEB DATABASE EMAIL
SERVER SERVER SERVER
Sysadmins typically
Sysadmins typically
SSH and configure
SSH and configure
the servers live
the servers live
ffunction
inc.
6. The era of dedicated servers
Hosted in your server room or in colocation
WEB DATABASE EMAIL
SERVER SERVER SERVER
The servers are
The servers are
conservatively managed,
conservatively managed,
updates are risky
updates are risky
ffunction
inc.
7. The era of slices/VPS
Linode.com Amazon Ec2
SLICESLICE SLICE 1
1 1 SLICE 1
SLICESLICE 6
1 SLICE SLICE 11
10 SLICE 9
We now have multiple
We now have multiple
small virtual servers
small virtual servers
(slices/VPS)
(slices/VPS)
ffunction
inc.
8. The era of slices/VPS
Linode.com Amazon Ec2
SLICESLICE SLICE 1
1 1 SLICE 1
SLICESLICE 6
1 SLICE SLICE 11
10 SLICE 9
Often located in different
Often located in different
data-centers
data-centers
ffunction
inc.
9. The era of slices/VPS
Linode.com Amazon Ec2
SLICESLICE SLICE 1
1 1 SLICE 1
SLICESLICE 6
1 SLICE SLICE 11
10 SLICE 9
...and sometimes with
...and sometimes with
different providers
different providers
ffunction
inc.
10. The era of slices/VPS
Linode.com Amazon Ec2
SLICESLICE SLICE 1
1 1 SLICE 1
SLICESLICE 6
1 SLICE SLICE 11
10 SLICE 9
IWeb.com
We even sometimes
DEDICATED DEDICATED We even sometimes
still have physical,
SERVER 1 SERVER 2 still have physical,
dedicated servers
dedicated servers
ffunction
inc.
13. The challenge
Create users, groups
Create users, groups
Customize config files
Customize config files
Install base packages
Install base packages
ORDER SETUP
SERVER SERVER
ffunction
inc.
15. The challenge
Install app-specific
Install app-specific
packages
packages
deploy application
deploy application
start services
start services
ORDER SETUP DEPLOY
SERVER SERVER APPLICATION
ffunction
inc.
16. The challenge
ORDER SETUP DEPLOY
SERVER SERVER APPLICATION
MAKE THIS PROCESS AS FAST (AND SIMPLE)
AS POSSIBLE
ffunction
inc.
18. The challenge
Quickly integrate your
Quickly integrate your
new server in the
new server in the
existing architecture
existing architecture
ffunction
inc.
19. The challenge ...and make sure
...and make sure
it's running!
it's running!
ffunction
inc.
20. Today's menu
Interact with your remote machines
FABRIC as if they were local
ffunction
inc.
21. Today's menu
Interact with your remote machines
FABRIC as if they were local
Takes care of users, group, packages
CUISINE
and configuration of your new machine
ffunction
inc.
22. Today's menu
Interact with your remote machines
FABRIC as if they were local
Takes care of users, group, packages
CUISINE
and configuration of your new machine
Ensures that your servers and services
WATCHDOG
are up and running
ffunction
inc.
23. Today's menu
Interact with your remote machines
FABRIC as if they were local
Takes care of users, group, packages
CUISINE Made by
Made by and configuration of your new machine
Ensures that your servers and services
WATCHDOG
are up and running
ffunction
inc.
24. Part 1
Fabric - http://fabfile.org
application deployment & systems administration tasks
ffunction
inc.
25. Fabric is a Python library
and command-line tool
for streamlining the use of SSH
for application deployment
or systems administration tasks.
ffunction
inc.
26. Wait... what does
Wait... what does
that mean ?
that mean ?
Fabric is a Python library
and command-line tool
for streamlining the use of SSH
for application deployment
or systems administration tasks.
ffunction
inc.
27. Streamlining SSH
By hand:
version = os.popen(“ssh myserver 'cat /proc/version'”).read()
Using Fabric:
version = run(“cat /proc/version”)
ffunction
inc.
28. Streamlining SSH
By hand:
version = os.popen(“ssh myserver 'cat /proc/version').read()
Using Fabric:
from fabric.api import *
env.hosts = [“myserver”]
version = run(“cat /proc/version”)
ffunction
inc.
29. Streamlining SSH
By hand:
You can specify
You can specify
multiple hosts and run
version = os.popen(“ssh myserver 'cat run
multiple hosts and /proc/version').read()
the same commands
the same commands
across them
across them
Using Fabric:
from fabric.api import *
env.hosts = [“myserver”]
version = run(“cat /proc/version”)
ffunction
inc.
30. Streamlining SSH
By hand:
version = os.popen(“ssh myserver 'cat /proc/version').read()
Connections will be
Connections will be
lazily created and
lazily created and
pooled
pooled
Using Fabric:
from fabric.api import *
env.hosts = [“myserver”]
version = run(“cat /proc/version”)
ffunction
inc.
31. Streamlining SSH
By hand:
version = os.popen(“ssh myserver 'cat /proc/version').read()
Using Fabric:
from fabric.api import *
env.hosts = [“myserver”]
version = run(“cat /proc/version”)
Failures ($STATUS) will
Failures ($STATUS) will
be handled just like in Make
be handled just like in Make
ffunction
inc.
33. Example: Installing packages
sudo(“aptitude install nginx”)
It's easy to take action
It's easy to take action
depending on the result
depending on the result
if run("dpkg -s %s | grep 'Status:' ; true" %
package).find("installed") == -1:
sudo("aptitude install '%s'" % (package)
ffunction
inc.
34. Example: Installing packages
Note that we add true
Note that we add true
sudo(“aptitude install nginx”) so that the run() always
so that the run() always
succeeds*
succeeds*
* there are other ways...
* there are other ways...
if run("dpkg -s %s | grep 'Status:' ; true" %
package).find("installed") == -1:
sudo("aptitude install '%s'" % (package)
ffunction
inc.
35. Example: retrieving system status
disk_usage = run(“df -kP”)
mem_usage = run(“cat /proc/meminfo”)
cpu_usage = run(“cat /proc/stat”
print disk_usage, mem_usage, cpu_info
ffunction
inc.
36. Example: retrieving system status
disk_usage = run(“df -kP”)
mem_usage = run(“cat /proc/meminfo”)
cpu_usage = run(“cat /proc/stat”
print disk_usage, mem_usage, cpu_info
Very useful for getting
Very useful for getting
live information from
live information from
many different servers
many different servers
ffunction
inc.
37. Fabfile.py
from fabric.api import *
from mysetup import *
env.host = [“server1.myapp.com”]
def setup():
install_packages(“...”)
update_configuration()
create_users()
start_daemons()
$ fab setup
ffunction
inc.
38. Fabfile.py
from fabric.api import *
from mysetup import *
env.host = [“server1.myapp.com”]
def setup():
install_packages(“...”)
update_configuration()
create_users()
start_daemons()
Just like Make, you
Just like Make, you
write rules that do
write rules that do
something
something
$ fab setup
ffunction
inc.
39. Fabfile.py
from fabric.api import *
from mysetup import *
env.host = [“server1.myapp.com”]
def setup():
install_packages(“...”)
update_configuration() ...and you can specify
create_users() ...and you can specify
on which servers the rules
start_daemons() on which servers the rules
will run
will run
$ fab setup
ffunction
inc.
41. Roles
env.roledefs = {
'web': ['www1', 'www2', 'www3'],
'dns': ['ns1', 'ns2']
}
$ fab -R web setup
ffunction
inc.
42. Roles
env.roledefs = {
'web': ['www1', 'www2', 'www3'],
'dns': ['ns1', 'ns2']
}
$ fab -R web setup
Will run the setup rule
Will run the setup rule
only on hosts members
only on hosts members
of the web role.
of the web role.
ffunction
inc.
43. Some facts about Fabric
Fabric 1.0 just released!
On March, 4th 2011
3 years of development
First commit 1161 days ago (on March 10th, 2011)
Related Projects
Opscode's Chef and Puppet
ffunction
inc.
44. What's good about Fabric?
Low-level
Basically an ssh() command that returns the result
Simple primitives
run(), sudo(), get(), put(), local(), prompt(), reboot()
No magic
No DSL, no abstraction, just a remote command API
ffunction
inc.
45. What could be improved ?
Ease common admin tasks
User, group creation. Files, directory operations.
Abstract primitives
Like install package, so that it works with different OS
Templates
To make creating/updating configuration files easy
ffunction
inc.
48. What is Opscode's Chef?
http://wiki.opscode.com/display/chef/Home
Recipes
Scripts/packages to install and configure services and
applications
API
A DSL-like Ruby API to interact with the OS (create
users, groups, install packages, etc)
Architecture
Client-server or “solo” mode to push and deploy your
new configurations
ffunction
inc.
49. What I liked about Chef
Flexible
You can use the API or shell commands
Structured
Helped me have a clear decomposition of the services
installed per machine
Community
Lots of recipes already available from
http://cookbooks.opscode.com/
ffunction
inc.
50. What I didn't like
Too many files and directories
Code is spread out, hard to get the big picture
Abstraction overload
API not very well documented, frequent fall backs to
plain shell scripts within the recipe
No “smart” recipe
Recipes are applied all the time, even when it's not
necessary
ffunction
inc.
51. The question that kept coming...
sudo aptitude install
apache2 python django-
python
Django recipe: 5 files, 2 directories What it does, in essence
ffunction
inc.
52. The question that kept coming...
Is this really necessary
Is this really necessary
for what I want to do ? sudo aptitude install
for what I want to do ? apache2 python django-
python
Django recipe: 5 files, 2 directories What it does, in essence
ffunction
inc.
53. What I loved about Fabric
Bare metal
ssh() function, simple and elegant set of primitives
No magic
No abstraction, no model, no compilation
Two-way communication
Easy to change the rule's behaviour according to the
output (ex: do not install something that's already
installed)
ffunction
inc.
56. What I needed
User/Group
User/Group
File I/O
File I/O Management
Management
Fabric
ffunction
inc.
57. What I needed
User/Group
User/Group Package
Package
File I/O
File I/O Management
Management Management
Management
Fabric
ffunction
inc.
58. What I needed
Text processing & Templates
Text processing & Templates
User/Group
User/Group Package
Package
File I/O
File I/O Management
Management Management
Management
Fabric
ffunction
inc.
59. How I wanted it
Simple “flat” API
[object]_[operation] where operation is something in “create”,
“read”, “update”, “write”, “remove”, “ensure”, etc...
Driven by need
Only implement a feature if I have a real need for it
No magic
Everything is implemented using sh-compatible commands
No unnecessary structure
Everything fits in one file, no imposed file layout
ffunction
inc.
60. Cuisine: Example fabfile.py
from cuisine import *
env.host = [“server1.myapp.com”]
def setup():
package_ensure(“python”, “apache2”, “python-django”)
user_ensure(“admin”, uid=2000)
upstart_ensure(“django”)
$ fab setup
ffunction
inc.
61. Cuisine:Fabric's coreimportedfabfile.py
Example functions
Fabric's core functions
are already
are already imported
from cuisine import *
env.host = [“server1.myapp.com”]
def setup():
package_ensure(“python”, “apache2”, “python-django”)
user_ensure(“admin”, uid=2000)
upstart_ensure(“django”)
$ fab setup
ffunction
inc.
62. Cuisine: Example fabfile.py
from cuisine import *
env.host = [“server1.myapp.com”]
def setup():
package_ensure(“python”, “apache2”, “python-django”)
user_ensure(“admin”, uid=2000)
upstart_ensure(“django”)
Cuisine's API
$ fab setup Cuisine's API
calls
calls
ffunction
inc.
64. Cuisine : File I/O
●
file_exists does remote file exists?
●
file_read reads remote file
●
file_write write data to remote file
●
file_append appends data to remote file
●
file_attribs chmod & chown
●
file_remove
ffunction
inc.
65. Cuisine : File I/O
Supports owner/group
●
file_exists does remote file exists?
Supports owner/group
and mode change
and mode change
●
file_read reads remote file
●
file_write write data to remote file
●
file_append appends data to remote file
●
file_attribs chmod & chown
●
file_remove
ffunction
inc.
66. Cuisine : File I/O (directories)
●
dir_exists does remote file exists?
●
dir_ensure ensures that a directory exists
●
dir_attribs chmod & chown
●
dir_remove
ffunction
inc.
67. Cuisine : File I/O +
●
file_update(location, updater=lambda _:_)
package_ensure("mongodb-snapshot")
def update_configuration( text ):
res = []
for line in text.split("n"):
if line.strip().startswith("dbpath="):
res.append("dbpath=/data/mongodb")
elif line.strip().startswith("logpath="):
res.append("logpath=/data/logs/mongodb.log")
else:
res.append(line)
return "n".join(res)
file_update("/etc/mongodb.conf", update_configuration)
ffunction
inc.
68. Cuisine : File I/O +
This replaces the values for
This replaces the values for
●
file_update(location, updater=lambda _:_) configuration entries
configuration entries
dbpath and logpath
dbpath and logpath
package_ensure("mongodb-snapshot")
def update_configuration( text ):
res = []
for line in text.split("n"):
if line.strip().startswith("dbpath="):
res.append("dbpath=/data/mongodb")
elif line.strip().startswith("logpath="):
res.append("logpath=/data/logs/mongodb.log")
else:
res.append(line)
return "n".join(res)
file_update("/etc/mongodb.conf", update_configuration)
ffunction
inc.
69. Cuisine : File I/O +
●
file_update(location, updater=lambda _:_)
package_ensure("mongodb-snapshot")
def update_configuration( text ):
res = []
The remote file will only be
The remote file line in text.split("n"):
for will only be
changed if the content
changed if the content
if line.strip().startswith("dbpath="):
is different
is different res.append("dbpath=/data/mongodb")
elif line.strip().startswith("logpath="):
res.append("logpath=/data/logs/mongodb.log")
else:
res.append(line)
return "n".join(res)
file_update("/etc/mongodb.conf", update_configuration)
ffunction
inc.
71. Cuisine: User Management
●
user_exists does the user exists?
●
user_create create the user
●
user_ensure create the user if it doesn't exist
ffunction
inc.
72. Cuisine: Group Management
●
group_exists does the group exists?
●
group_create create the group
●
group_ensure create the group if it doesn't exist
●
group_user_exists does the user belong to the group?
●
group_user_add adds the user to the group
●
group_user_ensure
ffunction
inc.
74. Cuisine: Package Management
●
package_exists is the package available ?
●
package_installed is it installed ?
●
package_install install the package
●
package_ensure ... only if it's not installed
●
package_upgrade upgrades the/all package(s)
ffunction
inc.
76. Cuisine: Text transformation
text_ensure_line(text, lines)
file_update(
"/home/user/.profile",
lambda _:text_ensure_line(_,
"PYTHONPATH=/opt/lib/python:${PYTHONPATH};"
"export PYTHONPATH"
))
ffunction
inc.
77. Cuisine: Text transformation
Ensures that the PYTHONPATH
Ensures that the PYTHONPATH
variable is set and exported,
text_ensure_line(text, lines) variable is set and exported,
If not, these lines will be
If not, these lines will be
appended.
appended.
file_update(
"/home/user/.profile",
lambda _:text_ensure_line(_,
"PYTHONPATH=/opt/lib/python:${PYTHONPATH};"
"export PYTHONPATH"
))
ffunction
inc.
78. Cuisine: Text transformation
text_replace_line(text, old, new, find=.., process=...)
configuration = local_read("server.conf")
for key, value in variables.items():
configuration, replaced = text_replace_line(
configuration,
key + "=",
key + "=" + repr(value),
process=lambda text:text.split("=")[0].strip()
)
ffunction
inc.
79. Cuisine: Text transformation
Replaces lines that look like
Replaces lines that look like
VARIABLE=VALUE
text_replace_line(text, old, new, find=.., process=...)
VARIABLE=VALUE
with the actual values from the
with the actual values from the
variables dictionary.
variables dictionary.
configuration = local_read("server.conf")
for key, value in variables.items():
configuration, replaced = text_replace_line(
configuration,
key + "=",
key + "=" + repr(value),
process=lambda text:text.split("=")[0].strip()
)
ffunction
inc.
80. Cuisine: Text transformation
text_replace_line(text, old, new, find=..,process lambda transforms
The process=...)
The process lambda transforms
input lines before comparing
input lines before comparing
them.
them.
configuration = local_read("server.conf")lines are stripped
Here the
Here the lines are stripped
for key, value in variables.items(): of spaces and of their value.
of spaces and of their value.
configuration, replaced = text_replace_line(
configuration,
key + "=",
key + "=" + repr(value),
process=lambda text:text.split("=")[0].strip()
)
ffunction
inc.
82. Cuisine: Text transformation
Everything after the | separator
Everything after the | separator
will be output as content.
will be output as content.
text_strip_margin(text) It allows to easily embed text
It allows to easily embed text
templates within functions.
templates within functions.
file_write(".profile", text_strip_margin(
"""
|export PATH="$HOME/bin":$PATH
|set -o vi
"""
))
ffunction
inc.
83. Cuisine: Text transformation
text_template(text, variables)
text_template(text_strip_margin(
"""
|cd ${DAEMON_PATH}
|exec ${DAEMON_EXEC_PATH}
"""
), dict(
DAEMON_PATH="/opt/mongodb",
DAEMON_EXEC_PATH="/opt/mongodb/mongod"
))
ffunction
inc.
84. Cuisine: Text transformation
This is a simple wrapper
text_template(text, variables) This is a simple wrapper
around Python (safe)
around Python (safe)
string.template() function
string.template() function
text_template(text_strip_margin(
"""
|cd ${DAEMON_PATH}
|exec ${DAEMON_EXEC_PATH}
"""
), dict(
DAEMON_PATH="/opt/mongodb",
DAEMON_EXEC_PATH="/opt/mongodb/mongod"
))
ffunction
inc.
85. Cuisine: Goodies
●
ssh_keygen generates DSA keys
●
ssh_authorize authorizes your key on the remote server
●
mode_sudo run() always uses sudo
●
upstart_ensure ensures the given daemon is running
& more!
ffunction
inc.
87. Cuisine Tips: Structuring your rules
BOOTSTRAP
You just received your new
You just received your new
VPS, and you want to set it
VPS, and you want to set it
up so that you have a base
up so that you have a base
system that you can access
system that you can access
without typing a password
without typing a password
ffunction
inc.
89. Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP
You install your users, groups,
You install your users, groups,
preferred packages and
preferred packages and
configuration. You also
configuration. You also
install you applications.
install you applications.
ffunction
inc.
91. Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP UPDATE
You want to deploy the new
You want to deploy the new
version of the application
version of the application
you just built
you just built
ffunction
inc.
92. Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP UPDATE
def bootstrap():
# Secure SSH, create admin user
# Authorize SSH public keys
# Remove unwanted packages
ffunction
inc.
93. Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP UPDATE
def setup():
# Create directories (ex: /opt/data, /opt/services, etc)
# Create user/groups (ex: apps, services, etc)
# Install base tools (ex: screen, fail2ban, zsh, etc)
# Edit configuration (ex: profile, inputrc, etc)
# Install and run your application
ffunction
inc.
94. Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP UPDATE
def update():
# Download your application update
# Freeze/stop the running application
# Install the update
# Reload/restart your application
# Test that everything is OK
ffunction
inc.
95. Why use Cuisine ?
●
Simple API for remote-server manipulation
Files, users, groups, packages
●
Shell commands for specific tasks only
Avoid problems with your shell commands by
only using run() for very specific tasks
●
Cuisine tasks are not stupid
*_ensure() commands won't do anything if it's
not necessary
ffunction
inc.
96. Limitations
●
Limited to sh-shells
Operations will not work under csh
●
Only written/tested for Ubuntu Linux
Contributors could easily port commands
ffunction
inc.
97. Get started !
On Github:
http://github.com/sebastien/cuisine
1 short Python file
Documented API
ffunction
inc.
98. Part 3
Watchdog
Server and services monitoring
ffunction
inc.
101. The problem
Archive files
Archive files
Rotate logs
Rotate logs
Purge cache
Purge cache
ffunction
inc.
102. The problem HTTP server
HTTP server
has high
has high
latency
latency
ffunction
inc.
103. The problem Restart HTTP
Restart HTTP
server
server
ffunction
inc.
104. The problem
System load
System load
is too high
is too high
ffunction
inc.
105. The problem
re-nice
re-nice
important
important
processes
processes
ffunction
inc.
106. We want to be notified
when problems occur
ffunction
inc.
107. We want automatic actions to be taken
whenever possible
ffunction
inc.
108. (Some of the) existing solutions
Monit, God, Supervisord, Upstart
Focus on starting/restarting daemons and
services
Munin, Cacti
Focus on visualization of RRDTool data
Collectd
Focus on collecting and publishing data
ffunction
inc.
109. The ideal tool
Wide spectrum
Data collection, service monitoring, actions
Easy setup and deployment
No complex installation or configuration
Flexible server architecture
Can monitor local or remote processes
Customizable and extensible
From restarting deamons to monitoring whole
servers
ffunction
inc.
112. Hello, Watchdog!
A service is a
A service is a
collection of
collection of
RULES
RULES
SERVICE
RULE
ffunction
inc.
113. Hello, Watchdog!
SERVICE
HTTP Request
RULE CPU, Disk, Mem %
Process status
I/O Bandwidth
ffunction
inc.
114. Hello, Watchdog!
SERVICE
Each rule retrieves
Each rule retrieves
data and processes it. HTTP Request
data and processes it.
Rules can SUCCEED RULE CPU, Disk, Mem %
Rules can SUCCEED
or FAIL Process status
or FAIL
I/O Bandwidth
ffunction
inc.
115. Hello, Watchdog!
SERVICE
HTTP Request
RULE CPU, Disk, Mem %
Process status
I/O Bandwidth
ACTION
ffunction
inc.
116. Hello, Watchdog!
SERVICE
HTTP Request
RULE CPU, Disk, Mem %
Process status
I/O Bandwidth
Logging
XMPP, Email notifications
ACTION
Start/stop process
….
ffunction
inc.
117. Hello, Watchdog!
SERVICE
HTTP Request
RULE CPU, Disk, Mem %
Process status
I/O Bandwidth
Actions are bound
Actions are bound Logging
to rule, triggered
to rule, triggered
on rule SUCCESS XMPP, Email notifications
on rule SUCCESS ACTION
or FAILURE Start/stop process
or FAILURE
….
ffunction
inc.
119. Execution Model
SERVICE DEFINITION
RULE
MONITOR
(frequency in ms)
ffunction
inc.
120. Services are registered
Services are registered
Execution Model
in the monitor
in the monitor
SERVICE DEFINITION
RULE
MONITOR
(frequency in ms)
ffunction
inc.
121. Execution Model Rules defined in the
Rules defined in the
service are executed
service are executed
every N ms
every N ms
(frequency)
SERVICE DEFINITION (frequency)
RULE
MONITOR
(frequency in ms)
ffunction
inc.
122. Execution Model
SERVICE DEFINITION
RULE
MONITOR
(frequency in ms)
SUCCESS FAILURE
ACTION ACTION
ACTION
ffunction
inc.
123. Execution Model
SERVICE DEFINITION
RULE
MONITOR
(frequency in ms)
SUCCESS FAILURE
ACTION ACTION
ACTION
If the rule SUCCEEDS
If the rule SUCCEEDS
actions will be
actions will be
sequentially executed
sequentially executed
ffunction
inc.
124. Execution Model
SERVICE DEFINITION
RULE
MONITOR
(frequency in ms)
SUCCESS FAILURE
ACTION ACTION
ACTION
If the rule FAIL
If the rule FAIL
failure actions will be
failure actions will be
sequentially executed
sequentially executed
ffunction
inc.
125. Monitoring a remote machine
#!/usr/bin/env python
from watchdog import *
Monitor(
Service(
name = "google-search-latency",
monitor = (
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
Print("Google search query took more than 50ms")
]
)
)
)
).run()
ffunction
inc.
126. Monitoring a remote machine
A monitor is like the
A monitor is like the
“main” for Watchdog.
#!/usr/bin/env python “main” for Watchdog.
It actively monitors
from watchdog import * It actively monitors
Monitor( services.
services.
Service(
name = "google-search-latency",
monitor = (
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
Print("Google search query took more than 50ms")
]
)
)
)
).run()
ffunction
inc.
127. Monitoring a remote machine
#!/usr/bin/env python
from watchdog import *
Monitor(
Service(
name = "google-search-latency",
monitor = (
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
Print("Google search query took more than 50ms")
]
)
)
)
).run() Don't forget to call
Don't forget to call
run() on it
run() on it
ffunction
inc.
128. Monitoring a remote machine
#!/usr/bin/env python The service monitors
from watchdog import * The service monitors
the rules
Monitor( the rules
Service(
name = "google-search-latency",
monitor = (
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
Print("Google search query took more than 50ms")
]
)
)
)
).run()
ffunction
inc.
129. Monitoring a remote machine
#!/usr/bin/env python
from watchdog import * The HTTP rule
The HTTP rule
Monitor( allows to test
allows to test
Service( an URL
name = "google-search-latency", an URL
monitor = (
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
Print("Google search query took more than 50ms")
]
)
)
) And we display a
And we display a
).run() message in case
message in case
of failure
of failure
ffunction
inc.
130. Monitoring a remote machine
#!/usr/bin/env python
from watchdog import *
Monitor(
Service(
name = "google-search-latency",
monitor = (
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
Print("Google search query took more than 50ms")
]
)
)
If it there is a 4XX or
) If it there is a 4XX or
it timeouts, the rule
).run() it timeouts, the rule
will fail and display
will fail and display
an error message
an error message
ffunction
inc.
131. Monitoring a remote machine
$ python example-service-monitoring.py
2011-02-27T22:33:18 watchdog --- #0 (runners=1,threads=2,duration=0.57s)
2011-02-27T22:33:18 watchdog [!] Failure on HTTP(GET="www.google.ca:80/search?
q=watchdog",timeout=0.08) : Socket error: timed out
Google search query took more than 50ms
2011-02-27T22:33:19 watchdog --- #1 (runners=1,threads=2,duration=0.73s)
2011-02-27T22:33:20 watchdog --- #2 (runners=1,threads=2,duration=0.54s)
2011-02-27T22:33:21 watchdog --- #3 (runners=1,threads=2,duration=0.69s)
2011-02-27T22:33:22 watchdog --- #4 (runners=1,threads=2,duration=0.77s)
2011-02-27T22:33:23 watchdog --- #5 (runners=1,threads=2,duration=0.70s)
ffunction
inc.
133. Sending Email Notification
send_email = Email(
"notifications@ffctn.com",
"[Watchdog]Google Search Latency Error", "Latency was over 80ms",
"smtp.gmail.com", "myusername", "mypassword"
)
[…]
HTTP( The Email rule will send
GET="http://www.google.ca/search?q=watchdog", to send
The Email rule will
an email
freq=Time.s(1), an email to
notifications@ffctn.com
timeout=Time.ms(80), notifications@ffctn.com
when triggered
fail=[ when triggered
send_email
]
)
ffunction
inc.
134. Sending Email Notification
send_email = Email(
"notifications@ffctn.com",
"[Watchdog]Google Search Latency Error", "Latency was over 80ms",
"smtp.gmail.com", "myusername", "mypassword"
)
[…]
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
send_email
]
)
This is how we bind the
This is how we bind the
action to the rule failure
action to the rule failure
ffunction
inc.
135. Sending Email+Jabber Notification
send_xmpp = XMPP(
"notifications@jabber.org",
"Watchdog: Google search latency over 80ms",
"myuser@jabber.org", "myspassword"
)
[…]
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
send_email, send_xmpp
]
)
ffunction
inc.
137. Monitoring incident: when something
fails repeatedly during a given period of
time
You don't want to be
You don't want to be
notified all the time,
notified all the time,
only when it really
only when it really
matters.
matters.
ffunction
inc.
139. Detecting incidents
An incident is a “smart”
An incident is a “smart”
action : it will only do
action : it will only do
something when the
HTTP( something when the
condition is met
GET="http://www.google.ca/search?q=watchdog",
condition is met
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
Incident(
errors = 5,
during = Time.s(10),
actions = [send_email,send_xmpp]
)
]
)
ffunction
inc.
140. Detecting incidents
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1), When at least 5 errors...
When at least 5 errors...
timeout=Time.ms(80),
fail=[
Incident(
errors = 5,
during = Time.s(10),
actions = [send_email,send_xmpp]
)
]
)
ffunction
inc.
141. Detecting incidents
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80), ...happen over a 10
...happen over a 10
fail=[ seconds period
seconds period
Incident(
errors = 5,
during = Time.s(10),
actions = [send_email,send_xmpp]
)
]
)
ffunction
inc.
142. Detecting incidents
HTTP(
GET="http://www.google.ca/search?q=watchdog",
freq=Time.s(1),
timeout=Time.ms(80),
fail=[
Incident(
errors = 5,
during = Time.s(10),
actions = [send_email,send_xmpp]
)
]
)
The Incident action will
The Incident action will
trigger the given actions
trigger the given actions
ffunction
inc.
143. Example: Ensuring a service is running
from watchdog import *
Monitor(
Service(
name="myservice-ensure-up",
monitor=(
HTTP(
GET="http://localhost:8000/",
freq=Time.ms(500),
fail=[
Incident(
errors=5,
during=Time.s(5),
actions=[
Restart("myservice-start.py")
])] )))).run()
ffunction
inc.
144. Example: Ensuring a service is running
from watchdog import * We test if we can
We test if we can
Monitor( GET http://localhost:8000
GET http://localhost:8000
Service( within 500ms
within 500ms
name="myservice-ensure-up",
monitor=(
HTTP(
GET="http://localhost:8000/",
freq=Time.ms(500),
fail=[
Incident(
errors=5,
during=Time.s(5),
actions=[
Restart("myservice-start.py")
])] )))).run()
ffunction
inc.
145. Example: Ensuring a service is running
from watchdog import *
Monitor(
Service(
name="myservice-ensure-up",
monitor=(
HTTP( If we can't reach it during
If we can't reach it during
GET="http://localhost:8000/",seconds
5
5 seconds
freq=Time.ms(500),
fail=[
Incident(
errors=5,
during=Time.s(5),
actions=[
Restart("myservice-start.py")
])] )))).run()
ffunction
inc.
146. Example: Ensuring a service is running
from watchdog import *
Monitor(
Service(
name="myservice-ensure-up",
monitor=(
HTTP(
GET="http://localhost:8000/",
freq=Time.ms(500),
fail=[ We kill and restart
We kill and restart
Incident( myservice-start.py
myservice-start.py
errors=5,
during=Time.s(5),
actions=[
Restart("myservice-start.py")
])] )))).run()
ffunction
inc.
149. Monitoring system health
SystemInfo will retrieve
SystemInfo will retrieve
system information and
system information and
from watchdog import * return it as a dictionary
Monitor (
return it as a dictionary
Service(
name = "system-health",
monitor = (
SystemInfo(freq=Time.s(1),
success = (
LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),
LogResult("myserver.system.disk", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),
LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),
)
),
Delta(
Bandwidth("eth0", freq=Time.s(1)),
extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,
success = [LogResult("myserver.system.eth0.sent")]
),
SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,
freq=Time.s(60),
fail=[Log(path="watchdog-system-failures.log")]
),
)
)
).run()
ffunction
inc.
150. Monitoring system health
We log each result by
We log each result by
extracting the given
from watchdog import * extracting the given
value from the result
Monitor ( value from the result
Service( dictionary (memoryUsage,
name = "system-health", dictionary (memoryUsage,
diskUsage,cpuUsage)
monitor = ( diskUsage,cpuUsage)
SystemInfo(freq=Time.s(1),
success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),
LogResult("myserver.system.disk=", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),
LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)
),
Delta(
Bandwidth("eth0", freq=Time.s(1)),
extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,
success = [LogResult("myserver.system.eth0.sent")]
),
SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,
freq=Time.s(60),
fail=[Log(path="watchdog-system-failures.log")]
),
)
)
).run()
ffunction
inc.
151. Monitoring system health
from watchdog import *
Monitor (
Service(
name = "system-health",
monitor = (
SystemInfo(freq=Time.s(1),
Bandwidth collects
success = ( Bandwidth collects
network interface
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),
network interface
LogResult("myserver.system.disk=", extract=lambda
live traffic information
live traffic information
r,_:reduce(max,r["diskUsage"].values())),
LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)
),
Delta(
Bandwidth("eth0", freq=Time.s(1)),
extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,
success = [LogResult("myserver.system.eth0.sent")]
),
SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,
freq=Time.s(60),
fail=[Log(path="watchdog-system-failures.log")]
),
)
)
).run()
ffunction
inc.
152. Monitoring system health
from watchdog import *
Monitor (
Service(
name = "system-health",
monitor But we don't want the
= (
But we don't want the
SystemInfo(freq=Time.s(1),
total amount, we just
total amount, we just
success = (
want the difference.
wantLogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),
the difference.
LogResult("myserver.system.disk=", extract=lambda
Delta does just that.
Delta does just that.
r,_:reduce(max,r["diskUsage"].values())),
LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)
),
Delta(
Bandwidth("eth0", freq=Time.s(1)),
extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,
success = [LogResult("myserver.system.eth0.sent")]
),
SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,
freq=Time.s(60),
fail=[Log(path="watchdog-system-failures.log")]
),
)
)
).run()
ffunction
inc.
153. Monitoring system health
from watchdog import *
Monitor (
Service(
name = "system-health",
monitor = (
SystemInfo(freq=Time.s(1),
success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),
LogResult("myserver.system.disk=", We print the result
extract=lambda
r,_:reduce(max,r["diskUsage"].values())), We print the result
as before
LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
as before
)
),
Delta(
Bandwidth("eth0", freq=Time.s(1)),
extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,
success = [LogResult("myserver.system.eth0.sent=")]
),
SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,
freq=Time.s(60),
fail=[Log(path="watchdog-system-failures.log")]
),
)
)
).run()
ffunction
inc.
154. Monitoring system health
from watchdog import *
Monitor (
Service(
name = "system-health",
monitor = (
SystemInfo(freq=Time.s(1),
success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),
LogResult("myserver.system.disk=", extract=lambda
SystemHealth will
r,_:reduce(max,r["diskUsage"].values())),
SystemHealth will
fail whenever the usage
LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
) fail whenever the usage
), is above the given
is above the given
Delta( thresholds
thresholds
Bandwidth("eth0", freq=Time.s(1)),
extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,
success = [LogResult("myserver.system.eth0.sent=")]
),
SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,
freq=Time.s(60),
fail=[Log(path="watchdog-system-failures.log")]
),
)
)
).run()
ffunction
inc.
155. Monitoring system health
from watchdog import *
Monitor (
Service(
name = "system-health",
monitor = (
SystemInfo(freq=Time.s(1),
success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),
LogResult("myserver.system.disk=", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),
LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)
),
Delta( We'll log failures
Bandwidth("eth0", freq=Time.s(1)), We'll log failures
extract = lambda _:_["total"]["bytes"]/1000.0/1000.0, file
in a log
in a log file
success = [LogResult("myserver.system.eth0.sent=")]
),
SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,
freq=Time.s(60),
fail=[Log(path="watchdog-system-failures.log")]
),
)
)
).run()
ffunction
inc.
157. Watchdog: Decentralized architecture
APP STATIC FILE DB SERVER
SERVER SERVER SERVER
W
Ensures the App is
Ensures the App is
running
running
(pid & HTTP test)
(pid & HTTP test)
ffunction
inc.
158. Watchdog: Decentralized architecture
APP STATIC FILE DB SERVER
SERVER SERVER SERVER
W W
Ensures the static file
Ensures the static file
server is running
server is running
an has low
an has low
latency
latency
ffunction
inc.
159. Watchdog: Decentralized architecture
APP STATIC FILE DB SERVER
SERVER SERVER SERVER
W W W
Ensures the DB is
Ensures the DB is
running and that
running and that
queries
queries
are not too slow.
are not too slow.
ffunction
inc.
162. Watchdog: Centralized Architecture
APP STATIC FILE DB SERVER
SERVER SERVER SERVER
Does high-level (HTTP,
Does high-level (HTTP,
PLATFORM SQL) queries on the
SQL) queries on the
SERVER servers and execute
servers and execute
actions remotely
actions remotely
when problems
W when problems
are detected
are detected
ffunction
inc.
164. Watchdog: Deploying on Ubuntu
# upstart - Watchdog Configuration File
# =====================================
# updated: 2011-02-28
description "Watchdog - service monitoring daemon"
author "Sebastien Pierre <sebastien@ffctn.com>"
start on (net-device-up and local-filesystems)
stop on runlevel [016]
respawn
script
# NOTE: Change this to wherever the watchdog is installed
WATCHDOG_HOME=/opt/services/watchdog
cd $WATCHDOG_HOME
# NOTE: Change this to wherever your custom watchdog script is installed
python watchdog.py
end script
console output
# EOF
ffunction
inc.
165. Watchdog: Deploying on Ubuntu
# upstart - Watchdog Configuration File
# =====================================
# updated: 2011-02-28
description "Watchdog - service monitoring daemon"
author "Sebastien Pierre <sebastien@ffctn.com>" Save this file as
Save this file as
/etc/init/watchdog.conf
/etc/init/watchdog.conf
start on (net-device-up and local-filesystems)
stop on runlevel [016]
respawn
script
# NOTE: Change this to wherever the watchdog is installed
WATCHDOG_HOME=/opt/services/watchdog
cd $WATCHDOG_HOME
# NOTE: Change this to wherever your custom watchdog script is installed
python watchdog.py
end script
console output
# EOF
ffunction
inc.
166. Watchdog: Overview
Monitoring DSL
Declarative programming to define monitoring
strategy
Wide spectrum
From data collection to incident detection
Flexible
Does not impose a specific architecture
ffunction
inc.
167. Watchdog: Use cases
Ensure service availability
Test and stop/restart when problems
Collect system statistics
Log or send data through the network
Alert on system or service health
Take actions when the system stats is above
threshold
ffunction
inc.
168. Watchdog: What's coming?
ZeroMQ channels
Data streaming and inter-watchdog comm.
Documentation
Only the basics, need more love!
Contributors?
Codebase is small and clear, start hacking!
ffunction
inc.
169. Get started !
On Github:
http://github.com/sebastien/watchdog
1 Python file
Documented API
ffunction
inc.
170. Merci !
www.ffctn.com
sebastien@ffctn.com
github.com/sebastien
ffunction
inc.