The document summarizes research on formal verification of transactional interaction contracts (TICs). It presents frameworks for specifying TICs using statecharts and activity diagrams. It then describes using computational tree logic (CTL) model checking to verify properties like exactly-once execution semantics for TICs modeled as Kripke structures. An example demonstrates verifying exactly-once semantics for a web service transaction involving multiple application servers.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Formal Verification of Transactional Interaction Contract
1.
2.
3. E-Business Scenario Review and Your server command (process id #20) has been terminated. Re-run your command (severity 13) in /opt/www/your-reliable-eshop.biz/mb_1300_db.mb1 place your order!
4.
5.
6. Transactions are great. However, … Web Client Web Application Server Database Server Timeline Non-idempotent execution ! ACK Purchase Request Order Confirmation Start Transaction SQL Request SQL Response SQL Request SQL Response Commit Transaction ACK Transaction Restart Purchase Request Resubmission
7.
8. Real-World n -Tier Application Expedia Sabre Server Amadeus Expedia App Server Sabre App Server Amadeus App Server Client Web Server DB 1 DB 2 DB 3 DB 4
9.
10.
11.
12.
13.
14.
15. Committed IC Sender * EVENT_OK = EVENT LINK_OUTAGE STABLE_S SENDING INSTALLED_S RECOVERY MSG_LOOKUP PREPARE_PERSISTENCE SNDR_MSG_TM and not (STABLE_OK or INSTALLED_OK)/ SEND_MSG SNDR_ND/ SEND_MSG SNDR_TRIGGER [SNDR_LAST_LOGGED=='']/ SNDR_ND MSG_RECOVERED_TM/ SEND_MSG GET_MSG_OK [SNDR_LAST_LOGGED=='INSTALLED'] INSTALLED_OK/ SNDR_LAST_LOGGED:='INSTALLED' STABLE_OK SNDR_STABLE_TM and not (INSTALLED_OK or GET_MSG_OK)/ IS_INSTALLED CIC_SNDR_SC STABLE_S SENDING MSG_LOOKUP SNDR_MSG_TM and INSTALLED_OK)/ SEND_MSG SNDR_ND/ SEND_MSG [SNDR_LAST_LOGGED=='']/ SNDR_ND MSG_RECOVERED_TM/ SEND_MSG GET_MSG_OK INSTALLED_OK/ SNDR_STABLE_TM and not (INSTALLED_OK or GET_MSG_OK)/ IS_INSTALLED SNDR_CRASH T T STABLE_S SENDING MSG_LOOKUP SNDR_MSG_TM and INSTALLED_OK)/ SEND_MSG SNDR_ND/ SEND_MSG [SNDR_LAST_LOGGED=='']/ SNDR_ND MSG_RECOVERED_TM/ SEND_MSG GET_MSG_OK INSTALLED_OK/ SNDR_STABLE_TM and not (INSTALLED_OK or GET_MSG_OK)/ IS_INSTALLED CIC_SNDR_SC STABLE_S SENDING MSG_LOOKUP INSTALLED_OK/ SNDR_MSG_TM and INSTALLED_OK)/ SEND_MSG SNDR_ND/ SEND_MSG SNDR_LAST_LOGGED SNDR_ND MSG_RECOVERED_TM/ SEND_MSG GET_MSG_OK INSTALLED_OK/ SNDR_STABLE_TM and not (INSTALLED_OK or GET_MSG_OK)/ IS_INSTALLED T T SNDR_LAST_LOGGED:='INSTALLED' _TM means TIMEOUT
16. Committed IC Receiver MSG_RECOVERY STABLE_R INSTALLED_R MSG_RECEIVED RECOVERY MSG_PROCESSED RCVR_INSTALL_TM/ RCVR_LAST_LOGGED:='INSTALLED'; INSTALLED [RCVR_LAST_LOGGED=='INSTALLED'] [RCVR_LAST_LOGGED=='STABLE'] SEND_MSG_OK [RCVR_LAST_LOGGED=='STABLE']/ GET_MSG [ICIC]/ RCVR_LAST_LOGGED:='INSTALLED'; INSTALLED MSG_EXEC_TM/ RECEIVED; ( RCVR_STABLE_TM or RCVR_ND [MSG_ORDER_MATTERS] ) [not ICIC and RCVR_LAST_LOGGED=='']/ RCVR_LAST_LOGGED:='STABLE'; SEND_MSG_OK [RCVR_LAST_LOGGED==''] not SEND_MSG_OK and GET_MSG_TM/ GET_MSG RCVR_CRASH T CIC_RCVR_SC MSG_RECEIVED RECOVERY MSG_PROCESSED [RCVR_LAST_LOGGED=='INSTALLED'] [RCVR_LAST_LOGGED=='STABLE'] SEND_MSG_OK [RCVR_LAST_LOGGED=='STABLE']/ GET_MSG [ICIC]/ RCVR_LAST_LOGGED:='INSTALLED'; INSTALLED MSG_EXEC_TM/ RECEIVED; [not ICIC and RCVR_LAST_LOGGED=='']/ RCVR_LAST_LOGGED:='STABLE'; SEND_MSG_OK [RCVR_LAST_LOGGED==''] not SEND_MSG_OK and GET_MSG_TM/ GET_MSG RCVR_CRASH T SEND_MSG or IS_INSTALLED/ SEND_MSG or IS_INSTALLED/ INSTALLED STABLE_R INSTALLED_R MSG_RECEIVED RECOVERY MSG_PROCESSED [RCVR_LAST_LOGGED=='INSTALLED'] [RCVR_LAST_LOGGED=='STABLE'] SEND_MSG_OK [RCVR_LAST_LOGGED=='STABLE']/ GET_MSG [ICIC]/ RCVR_LAST_LOGGED:='INSTALLED'; INSTALLED MSG_EXEC_TM/ RECEIVED; STABLE SEND_MSG_OK [RCVR_LAST_LOGGED==''] not SEND_MSG_OK and GET_MSG_TM/ GET_MSG RCVR_CRASH T CIC_RCVR_SC MSG_RECEIVED RECOVERY MSG_PROCESSED [RCVR_LAST_LOGGED=='INSTALLED'] [RCVR_LAST_LOGGED=='STABLE'] SEND_MSG_OK [RCVR_LAST_LOGGED=='STABLE']/ GET_MSG [ICIC]/ RCVR_LAST_LOGGED:='INSTALLED'; INSTALLED MSG_EXEC_TM/ RECEIVED; SEND_MSG_OK [RCVR_LAST_LOGGED==''] not SEND_MSG_OK and GET_MSG_TM/ GET_MSG RCVR_CRASH T SEND_MSG or IS_INSTALLED/ STABLE SEND_MSG or IS_INSTALLED/ INSTALLED * EVENT_OK = EVENT LINK_OUTAGE, _TM means TIMEOUT RCVR_LAST_LOGGED:='INSTALLED'
Welcome to my colloquium. Today I present research results of my dissertation entitled "integrated data, process, and message recovery for failure masking in Web Services".
My presentation consists of the following points. I will state the problem of providing recovery guarantees for multi-tier applications. Then I will introduce our solution comprising a family of recovery protocols coined the "interaction contracts framework". I show you a generic state-and-activity chart specification of the committed IC easily adaptable to a concrete application scenario. First we verify a single instance of the generic specification. The we prove that it also behaves correctly in a composed Web Service model that uses IC instances as building blocks. In the second part of my talk I present a prototype system, EOS, I have built to demonstrate the IC framework viability for Web services. It enables failure masking in arbitrarily distributed Web applications written in the PHP programming language. Beyond that it provides the recovery guarantees for the end-user by incorporating the IC functionality into the Web browser, specifically, Microsoft Internet Explorer And I conclude the talk with a short summary.
The problem of doing Business over the Internet, or with a distributed Application infrastructure in general can be characterized by the term "non-idempotence". The mathematic definition of this term is rather simple: the results of a single and multiple function applications are not the same. With a distributed information system, the developers and the users need to realize that a timeout of a request may simply result from high delays during the peak load of the system rather than from a failure. The users have learned that hitting the refresh or a submit button several times is tempting but leads to unexpected results. For instance, a friend of mine applied for a new healthcare insurance and got 8 smart cards for his 3-member family. It does not always sound like a bad deal when you order one and get many goods unless you have to pay for all of them.
A traditional approach of doing business in a failure-prone environment manages the application state in a transactional database. Suppose we have a banking application with accounts stored in a relational table that maps account numbers to corresponding balances. The transaction shown on this slide transfers 100 euros from account 1 to account 2 as indicated by these 2 SQL statements. Declaring this operation sequence as a transaction , using begin and commit statements, guarantees that the sequence is executed atomically, either completely or not at all. A situation where account 1 simply loses 100 euros isn't possible even if the transaction is interrupted in the middle. To achieve this, each operation is logged ahead. The log entry contains the log sequence number and the information how to undo and redo this operation. Logging is initially done in the main memory. However, on transaction commit all log entries have to be written to disk synchronously, which is 6 orders of magnitude slower. This operation is called log forcing. After a failure the log on disk is analyzed and the operations of committed transactions are redone, whereas the transactions without a commit log entry are undone. Since the database server may fail several times before recovery completion, we need to make sure that undo and redo operations are not applied more than once. This is achieved by stamping the disk pages with the LSN of the most recent operation they reflect. A simple LSN test guarantees recovery idempotence.
Consider now a scenario with a 3-tier Web application where an end user submits a purchase request to the Web Application server. A transaction is started on the database server on behalf of the user. Assume that the database successfully commits the transaction, but the acknowledgement message does not reach the web application server either due to a database server crash or a network failure. Developers handle this failure as usually by retrying the transaction because they assume that the transaction has been aborted, which is not necessarily true as we have seen. Unfortunately this is not the end of story. How is the end user supposed to react on the server timeout message ?? People love hitting the refresh button of the browser. I am aware of some of those in this room. It is a very bad idea because Web servers normally do not eliminate duplicates. The bottom line is that recovery needs to treat messages as well as states to ensure correct execution.
When that simple 3-tier system was complicated. How long does it take to analyze all possible failure combinations and their implications in a system with 10 components spread over 4 tiers. How about ad-hoc interactions in a Peer-to-Peer network .
This problems have motivated the IC framework. It considers applications as consisting of a set of components that exchange messages. In this talk we concentrate on persistent components. They can recreate state and messages after a failure and can determine whether they have executed a particular message. Another relevant component type (external) covers the end users and conventional components outside the IC framework. Interaction contracts define the way how components need to exchange messages to keep the interactions recoverable. We will cover the Committed IC (CIC) in this talks as it is the most important IC in the framework. The main design goal is to ensure the exactly once semantics that guarantees that once an interaction has started, it will be executed exactly once. All failures are masked.
To provide recovery guarantees all Pcoms such as client and server components need to be equipped with logging and recovery capabilities. Unlike database systems, we do not want and do not need to enable undo. Components are piecewise deterministic, they execute deterministically between two consecutive non-deterministic events such incoming messages from other components or reading the system clock. SO, logging of nondeterministic events turns piecewise-deterministic components into truly deterministic ones. We can recreate Pcom's state and messages by simply replaying the log from some initial state. To accelerate the deterministic replay the component needs to truncate the log on a regular basis. before doing this it has to dump its current state to disk. We call such state dumps "installation points". Out failure model includes crashes of the sending and receiving components as well as network failures causing message losses. Such transient failures are due to nondeterministic so-called Heisenbugs that are impossible to reproduce to take them out. We do not consider malicious manipulations called commission failures. And we do not deal with the corruption of stable storage as this can be avoided by a sufficient replication.
To provide recovery guarantees all Pcoms such as client and server components need to be equipped with logging and recovery capabilities. Unlike database systems, we do not want and do not need to enable undo. Components are piecewise deterministic, they execute deterministically between two consecutive non-deterministic events such incoming messages from other components or reading the system clock. SO, logging of nondeterministic events turns piecewise-deterministic components into truly deterministic ones. We can recreate Pcom's state and messages by simply replaying the log from some initial state. To accelerate the deterministic replay the component needs to truncate the log on a regular basis. before doing this it has to dump its current state to disk. We call such state dumps "installation points". Out failure model includes crashes of the sending and receiving components as well as network failures causing message losses. Such transient failures are due to nondeterministic so-called Heisenbugs that are impossible to reproduce to take them out. We do not consider malicious manipulations called commission failures. And we do not deal with the corruption of stable storage as this can be avoided by a sufficient replication.
The CIC can be informally described as follows: By sending a message to a different component the CIC sender commits its state. Usually, it forces the log to disk to make its state and the message recoverable. The sender deterministically tags its message with a unique id, a message sequence number MSN The sender keeps sending the message periodically until it gets a stable notification from the receiver. It keeps the message for the receiver may request the message again after a failure. The sender is released from all of its obligations when it gets an installed notification from the receiver. The CIC receiver eliminates message duplicates based on MSN. It persists an interaction before sending a stable notification to the sender. Normally this is done by logging the message header and forcing the log. The receiver requests the original message from the sender after a failure, when its log contains only the message header. The receiver ensures its autonomous recovery by forcing the complete message to disk or creating an installation point before sending an installed notification to the sender.
We use the state-and-activity chart language to formally specify the interaction contracts. The State-and-Activity chart language is provided with a leading tool for specification of reactive systems Statemate. The specification process begins with an activity chart providing the functional view on the system. Internal activities are represented by solid-line boxes. Dashed-line boxes specify external activities, an execution environment, and external applications. The arrows represent the data flow. Labels indicates which data or events are concerned. In this concrete scenario we specify an activity ensuring that a message is passed from one CIC component to an other one according to the CIC rules in a failure-prone environment that non-deterministically supplies failure events (crashes and link outages). What the application needs to know about it that it should activate the "sender trigger" and await an occurrence of the event "message processed" . This is important, please memorize that. The system administrator specifies the timeout values suitable for the given application along with some other options. The manager may stop the specification process at this stage. Activities are hierarchical and allow for a step-wise refinement. The next employee will say that actually the behavior of the cic activity is controlled by a so-called control activity cic_sc (sc stands for statechart) depicted as a green rounded box and has two further sub-activities: cic_sender and cic_receiver exchanging the messages and notifications as I have described informally before. The behaviors of these subactivities are defined by the corresponding control activities.
A Control activity is defined by a statechart. A Statechart is basically a finite state automaton with some additional features. First again we have nested states. Dashed-lines separate so called orthogonal components that represent processes that run simultaneously. In this case, the orthogonal components are the sndr and rcvr. The system is initialized by entering states through a default transition , a transition without a source state. A state targeted by a default transition is called a default substate. When a state is entered, its orthogonal substates are entered within the very same step. When a state is entered, its default substate is entered in the same step as well . Usual transitions are labeled with event-condition-action rules. The transition is taken if the event was generated in the previous step while the condition was true. When the transition is taken, the action is executed. The action might be as simple as an event generation or starting an activity as in this example o r a complex branching or loop statement. The only purpose of the given statechart is to restart the sender and the receiver activities after a crash. The condition "not active" guards the system from starting duplicate activity instances while the original one is still running. The set of entered states is called a configuration . Current variable valuation define the execution context of the system. Based on the current configuration and execution context the system performs a step by computing a new configuration and execution context.
This is the statechart controlling the behavior of a CIC sender. It is of course impossible to work out all the detail in this short talk. Let us however take a look on some important specification techniques. The systems starts in the default substate recovery. Further behavior depends on the content of the log. If the log is empty, the sender does not start sending, it awaits a trigger event . The log is modeled by a string variable, SNDR_LAST_LOGGED in this example. Log forcing is represented by value assignments to the log variable. A regular message or an acknowledgement is considered delivered i f its generation does not coincide with a LINK_OUTAGE event which is represented by compound events suffixed _OK. before sending message, the sender signals sender nondeterminism. Sending out a message usually commits the order of the received messages. Normal operation can be non-deterministically interrupted by a sender crash event . Transitions originating in a higher-level state dominate all transitions connecting substates. So the sender activity stops due to entering the termination connector represented by an orange circle labeled T. The activity terminates logically when it enters the state "installed"
This statechart defines the behavior of the sender's counterpart, the receiver component. The difference to the sender is in that the log variable can assume two values: stable and installed. And that log is forced-written only when we have a non-deterministic situation and the message order matters for the given application as specified by the developper. The receiver nondeterminism event is usually coupled with the sndr non determinism events generated by the sender activities running on the same component. Again, the receiver activity terminates logically in the state installed .
Before we start with the verification of the IC we need some additional definitions. A finite state computational system, e.g. a Statemate specification, can be represented as a Kripke structure. It contains a finite state transition graph with nodes labeled with atomic propositions that are valid in this node. These atomic propositions would refer to individual memory bits in a software system. If we unwind the state transition diagram we obtain a computation tree with potentially infinite branches.
A computation tree over the set of atomic propositions P can be characterized by the temporal logic called CTL. Its syntax is inductively defined as shown on this slide. The temporal aspects of the execution paths originating in the given state can be characterized by the Path quantifiers Exists and All combined with the temporal modalities Next and Util, finally, and globally. The modality Finally is used in a sense that some property holds eventually. Globally means that a property holds in every state of a path.
In my dissertation, I have proved many interesting safety and liveness properties using the Statemate's integrated model checker. I present the most important ones here. I show that my CIC specification for the sender as well as for the receiver never logs an interaction twice. We show for all execution paths that if a value is written to a log variable as indicated by the internal Statemate event written, it is never written again. To show liveness we use the Statemate-specific modality F less than meaning that the property holds eventually after at most so many steps. So I have proved that if failures do no longer occur after at most 500 steps. The CIC terminates after at most 700 steps if the maximum timeout value does not exceed 30 steps. Altogether this shows the exactly once character of the CIC specification
As the next step we would like to specify and verify the interaction contract framework applied to a complex Web Service scenario. We consider a 4-tier application encompassing a browser, a Web server, two application servers, and last but not least a database server. Internal activities are instances of the generic IC specifications. The arrows couple the event MSG_PROCESSED in one interaction with the SNDR_TRIGGER in another one. User submits a request to the Web Server. The web server calls both application servers asynchronously. One app server starts a transaction on the database server. The other responds immediately. When both app server replies arrive, the web server generates a reply to the browser that is displayed to the user. An interesting observation here is that some instances share the same failure events. For example, the sender crash in the web server reply is the same as the receiver crash in the application server reply. Analogously, the sender nondeterminism event of the web server reply and the receiver nondeterminism event in the application server replies are identical. Consequently the web server reply commits the order of the application server reply messages. Which we can verify by stating the following CTL formula. It says that when the web server reply is sent, the application server interaction are already captured in the log.
Explicit model checking is a rather simple recursive algorithm with the quadratic run-time. There are heuristic solutions using ordered binary decision diagrams as in the Statemate's symbolic model checker. Other model checkers use SAT solvers.
At the end, we learned that we need to make compromises between the realism of the models and their verifiability. A web service model using integer expressions to generate timeouts periodically as it would happen in a real system could not be verified. We succeeded after replacing the integer-based timeouts by nondeterministic 1-bit timeouts, which is a more general case. No engineering tricks however have helped to obtain any results for a multi-user model and for the liveness of the single-user-model.
Now I would like to briefly introduce the prototype system EOS.
I implemented the committed and external interaction contracts for PHP-based Web-services. PHP is a scripting language that is embedded into usual HTML pages. PHP is interpreted by the Zend engine that has a great variety of modules extending the capabilities of the PHP language. With PHP we can manage the application state across multiple HTTP requests using the Session module. There is a number of options of invoking remote Web services to build a complex multi-tier Application. In my work I concentrated on the CURL module. A reply message of a PHP script is normally an HTML page that is displayed by the browser.
Our prototype implements the exactly sematics. It delivers the recovery guarantees to the end-user by implementing the external and the committed interaction contracts for the Internet Explorer. On the PHP side we can recover concurrent request accessing shared objects. We can recover calls to the nondeterminisatic functions, time, curl_exec, and the random number generator rand. We do really support n-tier for any n with any fanout in the call structure. We have enhanced performance of the original PHP implementation with Regard to disk I/Os and made the conccurency control. For instance it is now possible to access the session data read only.
We performed measurements to evaluate the overhead of the interaction contracts in a 3-tier application that has a similar structure as an ebay like auction service. The front-end server manages private user setting that are accessed simultaneously without contention. The backend server manages the current highest bids for auction items that are accessed concurrently. The load was generated by a synthetic load generator Apache Jmeter from 5 different machines
The run-time overhead of EOS-PHP is on average about 100% in terms of both the elapsed and the CPU time. At this price we support failure making which radically simplifies the development process and provides a correct and highly available service to customers.
I conclude my talk. I presented formal specifications of recently proposed interaction contracts that have been just informally described in the original literature. We mathematically proved many safety and liveness properties of the ICs. We have learned that the model checking technology has its limitations due to the state-explosion problem. There are several directions how to cope with this. For example, some researchers have explore opportunities of combining manual induction proofs with the model checker. Last but not least, there are other verification technologies such as theorem proving. Another major part of my dissertation is a viable implementation of the IC framework for PHP-based Web services. We provided rigorous recovery guarantees for applications and end-users at a reasonable price. In the context of this work, we added some brand-new features as well as optimizations to the existing ones for both Internet Explorer and the PHP language.
Thank you very much for your attentions. And I know you have questions.