Monitoring tools record the result of what happened to your web application when a problem arises, but for some classes of problems, monitoring systems are only a starting point. Sometimes it is necessary to take more intrusive steps to plan for the unexpected by embedding mechanisms that will allow you to interact with a live deployed web application and extract even more detailed information.
3. What is debugging?
Debugging is a methodical process of
finding and reducing the number of
bugs, or defects, in a computer program
or a piece of electronic hardware, thus
making it behave as expected.
http://en.wikipedia.org/wiki/Debugging
5. Things we want to avoid.
• Crashing the whole web site.
• Corrupt all your customer data.
• Make you customer data visible to everyone.
• Loose your company lots of money.
• Loose your own job because you did something stupid.
• Cause all your work mates to loose their jobs as well.
• Getting what you did posted on Slashdot.
6. Managing risk.
• Use software to restrict what you can do.
• Script changes and procedures to avoid errors.
• Test what you are going to do on a separate system.
• Develop and document contingency plans.
7. Passive monitoring.
• Collection of log file information.
• Collection of details about Python exceptions.
• Collection of performance data for the server host.
• Collection of performance data for the web server.
• Collection of performance data for the web application.
9. Recording Python exceptions.
• Open Source
• Sentry (http://pypi.python.org/pypi/sentry) - Also as paid service.
• Commercial Services
• New Relic (http://newrelic.com) - Pro feature.
10. Server monitoring.
• Open Source
• Monit (http://mmonit.com)
• Munin (http://munin-monitoring.org)
• Cacti (http://www.cacti.net)
• Nagios (http://www.nagios.org)
• Commercial Services
• New Relic (http://newrelic.com) - Free feature.
30. Introducing ispyd.
• Download site.
• https://github.com/GrahamDumpleton/wsgi-shell
• Aims of the package.
• Provide a generic framework for implementing an interactive console.
• The commands you can run are targeted at a specific purpose.
• Plugin based so can control what is available and also extendable.
• Remotely accessible and execution of commands scriptable.
32. Executing commands.
(ispyd:ll345) shell process
(process:ll345) help
Documented commands (type help <topic>):
========================================
cwd egid euid exit gid help pid prompt uid
(process:ll345) cwd
/Users/graham
33. Power users.
(ispyd:ll345) shell python
(python:ll345) console
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license"
for more information.
(EmbeddedConsole)
>>> import os
>>> os.getcwd()
'/Users/graham'
>>> exit()
42. Ideas for third party plugins.
• Memory.
• Process memory usage.
• Statistics on objects in use (heapy).
• State of the garbage collector.
• Profiling.
• Initiate sampled profiling for selected functions.
• Django.
• Current configuration.
• Details of loaded applications.
• Details of registered middleware.
• Details of template libraries.
• Testing URLs against URL resolver.
• Statistics on cache usage.
43. What am I trying to say?
• Use monitoring so you know when problems arise.
• One tool alone is not going to provide everything.
• Use complimentary tools to get a full picture.
• Build in mechanisms that allow deeper debugging.
• Treat debugging like any other defined process.
44. New Relic
30 Day Free Pro Trail
http://newrelic.com/30
Graham.Dumpleton@gmail.com
@GrahamDumpleton
Notas del editor
\n
\n
So you have written what you believe is the most amazing web site in the world and deployed it to production. Real customers are using it, it is making money for you, but something is going wrong with it. You don't quite know what and because it is a real live production web site, you can't necessarily just go in and start playing with it. What are you going to do, how are you going to debug the problems?\n
For some types of problems where you get a nice Python exception traceback the cause may be obvious, but the cause of other things such as data corruption, memory leaks, thread locking issues and general performance problems can be more elusive. Trying to duplicate issues in a development system may sometimes work, but more often than not things only show up once code is deployed to production.\n
As developers we would love to be able to just dive in and start poking around in the live web application, but operations staff aren't going to like that one bit. If we are going to try and do things with a live web application, it has to be things that aren't going to make things worse. The results of the things we do need to be predictable, with the effect of doing them able to be validated in advance.\n
Whatever we do, it is all about managing risk. We don't want a loose cannon that is going to cause more damage than good. There is no reason though why we can't do things which do have some level of risk. We just need to be controlled in what we do and make sure we understand the consequences. If making changes, script the actions you are going to take, test them before hand and develop contingency plans to cope with when things do go pear shaped. \n
The most benign thing you can do is passive monitoring. That is where you setup in advance mechanisms to collect data on a continual basis. In the event of a problem, you at least then have some forensic information to try and analyse what went wrong. Monitoring can take many forms. This can include collecting log files, details of application exceptions or quite specific performance data.\n
In the case of log files, they can come from many sources including the operating system, your web server, your web application, backend application services and databases. These can be spread all over the place. To make sense of them and make it easier to find and correlate information, various free and commercial products exist to help. These tools in simple terms are search engines for log information.\n
Log file analysis can only work though if an application actually logs something about an event. In web applications, exceptions often are translated to HTTP 500 errors and no details are logged. In this situation an extra step needs to be taken to configure the framework to record details of exceptions, or to add in additional tools which can intercept exceptions and report them back to a service for storage and later analysis.\n
When we move up to server monitoring there are a range of open source choices. What these monitor can be quite extensive but they also can be quite hard to setup and manage depending on the product. For many users the simplicity of a pre configured solution can be just as beneficial, if not easier to deal with, than a highly configurable and highly complex solution. Your mileage may therefore vary depending on the product chosen.\n
Want to dive deeper into what is going on inside of your Python web application and New Relic is definitely your friend. In addition to providing server monitoring, New Relic provides real user monitoring and application performance monitoring. For your web application it gives a deeper level of introspection into where time is being spent within your application code, as well as including time spent calling out to external databases and web services.\n
So you can easily bring together a set of monitoring tools. The question then is what value are they in debugging an issue as opposed to telling you there is a problem in the first place. The big ticket item with web sites is performance. A high level view which looks across end user time, application time and that of back end services allows you to quickly drill down to where the problem may lie.\n
End user monitoring can help you realise that the actual issue is with the page content you are generating rather than the mechanism of generating it. From there you can use various web page performance analysis tools. Keep in mind though that these operate not from the perspective of your actual users but where the online service is located, or your own browser if using a browser plugin.\n
In the future, advances like the browser resource timing specification coming out of the World Wide Web Consortium could make such analysis more representative of what the real users are seeing, as it would then be technically possible to report such information direct from users browsers, giving you a much larger data set to work from.\n
What now for where the problem is in your application. If using New Relic you can start to drill down and look at performance of individual request handlers, seeing their throughput and response times. You can also get a more detailed view of individual sample slow transactions.\n
The performance breakdown in a slow transaction summary gives you a high level overview of where time is being spent for that specific slow transaction. The summary doesn't necessarily though provide you with any context of where in your code the time consuming operation was made.\n
Some level of context can be obtained by drilling down and looking at the details of slow transaction traces, but it is limited to those functions which have been deemed of interest. It needs to be limited in this way to ensure that the overhead of monitoring does not impact the performance of your web application. To do full profiling is just going to be too big of an overhead and affect application performance.\n
Because instrumentation is targeted only to areas such as time spent in middleware, view handlers, template rendering and template blocks, eventually you get situations where you get blocks of time where you lack sufficient detail. This is where a monitoring tool can need a bit more help through you indicating what else is of interest in your specific application.\n
You have a few choices of how you can do this. The first is to make changes to your actual code base. You can apply function decorators to existing functions, or you can use context manager objects to time within blocks of code within a function. Such changes are obviously intrusive though which could be an issue. Plus it also doesn't help when you want to time spent in third party code.\n
A second approach is to nominate functions of interest by way of a configuration file. This avoids you needing to change code and so can be used with any Python code no matter the origin. It would usually though be limited to simple function tracing.\n
A final option is monkey patching. Here you specify a function to be called when a specific module is imported. That function would then go in and monkey patch the code. Which ever approach is used, the problem here is that to get added visibility you need to make a change of some sort and redeploy and restart your application before you will see the additional instrumented functions. It does not provide you a here and now way of delving down any further.\n
A partial solution is thread sampling. This is where when required you start up a profiling session, taking a periodic snapshot of what each thread is doing at a specific point at time and from that produce a call tree showing what percentage of time code at a specific point was executing. Unfortunately right now, New Relic at least doesn't do this for Python, although we have been looking at doing it for a while.\n
Separate thread sampling tools do though exist. Dropbox recently announced 'plop' along with a pretty visualisation tool to try and make sense of the data. Another is 'statprof', which advertises itself as being able to trace down to line level. The premise behind sampling at least is that the overhead is lower than traditional full profiling such as provided by Python profile modules.\n
Ultimately, thread sampling is still an estimate and not as accurate as full profiling. A middle ground though is not to run profiling all the time, but collect samples there as well. That is, don't profile the whole program, target specific functions and only collect a full profile sample for a call every so often. We could for instance have the criteria be that we collect samples a minimum of 1 second apart and write out the aggregated results after 30 successive calls.\n
This can be achieved using the 'cProfile' module, a decorator and a bit of context manager magic. Add in a gating mechanism to control how often it is done and we can achieve full profiling for a function of interest, but where it is done infrequently enough that the overhead need not necessarily be a factor in the context of the overall web application. \n
New Relic is by no means the only way of instrumenting web applications to collect metrics, although it arguably gives you the most value out of the box with immediate actionable data. Whatever the solution used, at this level we have the same problem. You still need to manually modify your code to add new instrumentation to further explore a problem and then redeploy your web application. Getting more in depth useful data can therefore be a long process.\n
What is lacking is the ability to prod your live web application to get it to start yielding the additional data you need while the problem is occurring. Some tools give you this interactivity, but they are only suitable for development environments as they display data back into the browser the request is made from. Sentry provides separate analysis of tracebacks and stack variables after the fact but we still don't have a way of changing the way the application is running.\n
Application backdoors to effect change are not new. The logging module in Python even supplies such a back door. Enable this and it will listen on a socket for connections and allow you to pass the application a new configuration for the logging subsystem. Dangers do exist with such mechanisms. The logging module actually runs eval() on parts of the configuration file meaning that you can actually inject arbitrary code into your application.\n
Not concerned about execution of arbitrary code and you could instead elect to expose a full embedded Python interpreter prompt. Go a step further again and you have the rather scary concept of pyrasite, which uses gdb to perform code injection into an arbitrary unmodified Python process. We want something that allows realtime interaction but we also want that access to be more controlled than a full on interpreter or debugger.\n
Providing a means for interactive access to running processes is something I have toyed with in trying to help people debug WSGI applications. Following on from PyCon US this year I finally sat down and created a package incorporating some of the ideas I had played with and had code lying around for. Initially it was intended as a shell for WSGI applications but it can be used in any long running services. Eventually the package was called ispyd.\n
Depending on your application architecture, the process would listen on either an INET or UNIX domain socket. To hide the details, an ispy client program is used to make and manage the connection. The command interface is driven using the cmd module from Python. Once connected you can list all the plugins which you have configured the system to make available.\n
Change to the context of a specific plugin and you can then issue the specific commands which the plugin makes available. Because it isn't a full interpreter prompt, you can control via what plugins you enable, what commands are available. This way you restrict what can be done and ensure that you can't do too much damage.\n
If you are addicted to power however, then no problem, enable from the configuration file the optional embedded interpreter support and you can jump into the plugin for Python, fire one up and do as much damage as you want.\n
If you are comfortable monkey patching a live web application there are a range of other things one could do. One could introduce a wrapper that catches details of exceptions and enables you to the perform post-mortem debugging within the live process. This is similar to tools like the Flask debugger, but done using pdb directly in the live process.\n
Finally, monkey patching can also help with our original problem of how does one change what is being monitored by a live web application without a restart. With an interactive console like this it becomes feasible to have commands that would allow us to monkey patch the live system to add the additional function traces. These would only exist until the process exited, but it does at least provide us some coverage until we can make a more permanent change.\n
A further problem area where monitoring can be useful is in answering the perennial question of how many processes/threads should I configure my WSGI server to use. Capacity can be viewed relative to normal traffic loads, but can also be used to gauge whether you have sufficient capacity in a farm of servers when you need to perform a rolling restart during a deploy.\n
If you have done your homework and have the available capacity, then although you will see a jump in how much of your capacity is used when some servers are taken offline, the effect on application response times will not be affected. Get it wrong though and you could start to see a backlog, with an increase in request queuing time, overall response times and with users subsequently getting increasingly frustrated as the site slows down.\n
A further cause of back logging due to inadequate capacity is when requests block and the effective number of available threads drops. Monitoring systems will though often only report on a web transaction once it completes. If a request never completes, you will not get any metrics nor a slow transaction trace.\n
This is where an interactive console can again help. In particular you could run a command to dump out details on all active WSGI requests, including request environ details and a Python stack trace. You then just need to find those which have been running for a longer than expected time and see where in the code they appear to be stuck.\n
Being an interactive console though we can only talk to one process at a time. What do we do about multi process web applications? Obviously if interacting with an embedded interpreter or debugger session the answer is that there is nothing we can do. What though if we only wish to dump out details of a process or perform monkey patching. What we want here is an ability in the client program to automatically apply a set of commands across a set of servers.\n
Because a console oriented interface is being used rather than trying to wrap up things in some higher level message oriented service abstraction, writing new plugins is relatively easy. All that is necessary is to provide a method for each command that writes the response to the output stream object setup for that instance of the shell. For more complicated plugins which require further input, such as an embedded interpreter, the input stream would also be used.\n
By virtue of ispyd trying to define one generic infrastructure for managing the console and interaction with it, the goal then is that the wider community will get behind it and develop additional plugins which could be downloaded from PyPi. One could see lots of useful plugins being developed. A good in process memory analysis tool for tracking memory growth would for example be particularly interesting and valuable when trying to debug memory problems.\n
In conclusion, what am I trying to say? That is that production systems need not be treated as this special sanctum that only the anointed operations people can touch. Use monitoring systems so you know what problems arise, but be prepared and also put in place mechanisms to help you debug the issues that do arise. Do it in a way though that is controllable and scriptable so that results are predictable. Doing debugging then becomes a normal procedure in the same way deploys are.\n
Obviously we would hope that you would see New Relic as a part of your tool set. Whatever you do though, use some level of monitoring. If you have no monitoring at all then not only will you not know immediately when there is a problem, but you will not even know where to start looking to debug it. So become a data nerd and deploy New Relic today. If you are interested in ispyd and want to help with that then contact me afterwards.\n