White Paper: Don't Let a Bad Trigger Ruin Your Checkin
1. Don’t Let A Bad Trigger
Ruin Your Checkin!
Mark Harrison
Pixar Animation Studios
Our Trigger Goals.
Perforce checkin triggers are very useful to us. Users can check in their files in any way
they see fit, and we can provide our services using post-commit triggers. This worked
well, and had several benefits:
We could guarantee that the triggers would run on all file checkins. There was no
path by which the code could be avoided.
We were decoupled from the front end application checking in the file. We did
not need to be linked in with or share a release schedule with that code base.
Trigger code could be replayed on a checkin in case of error.
Lesson learned: triggers are good!
First Try: Pure Triggers.
But as we went along, we hit a couple of problems.
As the number of repositories grew (much faster than we anticipated!), it became
more work to make sure the triggers were in sync. Adding a new trigger likewise
became a much larger task, since we needed to do so to several repositories.
Triggers can hang. Sometimes NFS mounts can go bad, or a bad database state
(e.g. an open transaction holding a write lock) can block a trigger.
The most ironic problem: triggers worked so well for us, everybody wanted one! There
were numerous projects that could benefit from being informed when movie assets were
created or modified; many of these would update some database tables or cache some of
the data in the assets.
This amplified the two problems we had with our own triggers. Calling code outside of
our control meant that we couldn't even fix things ourselves when checkin errors would
happen, and each our trigger configurations showed signs of fulfilling the old game
puzzle of "you are in a twisty maze of little passages, all different."
What depots were supposed to have which triggers?
Having to hand-edit numerous trigger specs whenever somebody changed their
software.
Harrison - Perforce 2011 Page 1
2. As more triggers appeared, checkins got slower. Each trigger is run sequentially,
so we couldn't even take advantage of multiple boxes or processors to speed
things up. Some of the triggers would scrape metadata out of each file checked in
(image formatting, color profiles, etc), so we could conceivable end up having to
read each file multiple numbers of times before the checkin would return to the
user.
Having to "slightly" modify trigger parameters ("oh, for that depot can you set the
option --bargle=4, but if it's on a box without NFS patches can you instead use --
bargle=4 and --nopts=2?")
As more triggers started appearing, the number of checking problems due to the
triggers started to rise. We certainly didn't want that to happen, since one of
Perforce's selling points is that it's really stable.
Lesson learned: lots of triggers are bad!
Second Try: Using Triggers to Enqueue Work.
We looked at the problem again, focusing on these questions:
How can we allow multiple groups to benefit from check-in driven triggers?
How can we avoid the slowness involved with running multiple triggers?
How can we eliminate the administrative overhead of managing triggers?
How can we eliminate the runtime errors and required troubleshooting with
triggers?
We came up with these two rules:
Every set of post-submit triggers must be the same across all depots.
The post-submit triggers must execute as quickly as possible.
Additionally, we wanted to ensure:
We would be able to accommodate any groups that needed special backend
execution.
We would have some means of telling front-end systems that their trigger was
finished or that it failed. Preferably this would be a non-blocking mechanism, so
that the applications could for example keep their GUIs alive. For non-interactive
applications (e.g. thumbnail generation) we would log the errors and provide an
error notification.
We could execute these tasks in parallel on different boxes for speed.
Our solution was to execute exactly two post-submit triggers:
The LINKATRON (presented at the 2009 conference), which would ensure that
the trigger-like programs would have access to the files checked in via NFS, and
Harrison - Perforce 2011 Page 2
3. they wouldn't have to check out the file to process it. This was especially
important for media files... think of a several-gig video clip where where some
information needed to be extracted from a header record in the file.
Our database backend, which would handle the enqueuing of the files and
changelists to other backend applications.
We would ensure the backend processors would be first-class members of our perforce
infrastructure by writing all of our own processors as plugins. This also gave us the
advantage of being able to process certain items (e.g. thumbnail generation) in parallel.
Our Implementation and Usage.
We implemented this system as a workflow queue manager. There are several off-the-
shelf queueing systems that could be used, but due to our particular requirements and
development environment we ended up implementing our own.
Each application has its own queue, and can register to receive notifications at either:
The file level. This allowed an application such as our thumbnail generator to
start processing files quickly, without having to perform the extra processing
necessary to read a changelist, break it apart, and start processing each item. It
also has the advantage that each of the files can be treated as atomic work units --
if a thumbnail fails for one file, there's no reason all the other thumbnails
shouldn't be generated.
The changelist level. For some other applications, it was better to receive exactly
one notification per checkin. For these notifications, we included the depot name
and the changelist number; if the application wanted to see the contents of the
changelist, it could examine that on its own.
This has several advantages, both for the end user and for the groups providing the
triggers:
A single broken queue processor does not break a checkin. Of course, if your
workflow depends on work being done by that processor you will be blocked, but
many tasks (e.g. thumbnail generation or keyword mining) can be done after the
fact.
It is easy to identify a queue processor that is broken, and notify the responsible
party. If a queue is filling up and nothing is being processed, we issue a warning
to the queue owner.
It is easy to see what work needs to be caught up when breakage is repaired. By
the nature of the queue system, all uncompleted work is still in the queue, ready to
be processed when the processor is restarted.
Synchronous Operation
Harrison - Perforce 2011 Page 3
4. In order to handle the requirement that the queue processors operate in a synchronous
manner, we use our internally developed Templar Broadcasting System. This messaging
system uses multicast UDP. Measurements on our network showed that the there was
minimal (microsecond) latency, and we could handle a sustained rate of 30,000 or more
messages/second reliably. Of course, delivery is not guaranteed, so applications need to
provide an alternate method for verifying that their work has been completed. A typical
application might query the database for a particular file or changelist.
However, since in our environment multicast is "mostly reliable", we can set a relatively
long timeout period before having to fall back to the polling mechanism. Most
applications are therefore able to continue almost immediately when the notification is
sent.
Summary
We followed these steps in our implementation process and are happy with the results.
They allows several groups to write checkin-time code, and give protection to any
breakage of these bits of code.
Triggers
Lots of triggers
Small number of triggers, feeding work queues
Lesson Learned: Trigger + Work Queues are Great!
Harrison - Perforce 2011 Page 4