The document discusses static analysis, which allows checking program code before execution. It describes the static analysis process, which involves lexing, parsing, and abstract syntax tree (AST) creation. Various static analysis techniques are then performed on the AST, including AST walker analysis, data flow analysis, and path-sensitive data flow analysis. The document provides examples of how ASTs represent code and are used in static analysis.
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Â
Using Static Analysis in Program Development
1. Using Static Analysis in Program
Development
Authors: Alexey Kolosov
Date: 31.01.2008
Abstract
Static analysis allows checking program code before the tested program is executed. The static analysis
process consists of three steps. First, the analyzed program code is split into tokens, i.e. constants,
identifiers, reserved symbols, etc. This operation is performed by lexer. Second, the tokens are passed
to parser, which builds an abstract syntax tree (AST) based on the tokens. Finally, the static analysis is
performed over the AST. This article describes three static analysis techniques: AST walker analysis, data
flow analysis and path-sensitive data flow analysis.
Introduction
Application testing is an important part of software development process. There are many different
types of software testing. Among them there are two types involving the application's code: static
analysis and dynamic analysis.
Dynamic analysis is performed on executable code of a compiled program. Dynamic analysis checks only
user-specific behavior. That is, only the code, executed during a test is checked. Dynamic analyzer can
provide the developer with information on memory leaks, program's performance, call stack, etc.
Static analysis allows checking program code before the tested program is executed. Compiler always
performs static analysis during the compilation process. However, in large, real-life projects it is often
necessary to make the entire source code fulfill some additional requirements. These additional
requirements may vary from variables naming to portability (for example, the code should be
successfully executed both on x86 and x64 platforms). The most common requirements are:
• Reliability - a lower amount of bugs in the tested program.
• Maintainability - better understanding of the source code by others so that it is easier to
upgrade/change the source code.
• Testability - shorter testing time due to more effective testing process.
• Portability - flexibility when the tested program is launched on different hardware platforms (for
example, x86 and x64, as it has already been mentioned above).
• Readability - better understanding of the code by others and therefore shorter review times and
clearer code reading[1].
All requirements can be divided into two categories: rules and guidelines. Rules describe what is
mandatory, while guidelines describe what is recommended (by analogy with errors and warnings
produced by built-in code analyzers of standard compilers).
Rules and guidelines form a coding standard. A coding standard defines the way a developer must and
should write program code.
2. A static analyzer finds source code lines, which presumably do not fulfill the specified coding standard
and displays diagnostic messages so that the developer can understand what is wrong with these lines.
The static analysis process is similar to compilation except that no executable or object code is
generated. This article describes the static analysis process step by step.
The Analysis Process
Static analysis process consists of two steps: abstract syntax tree creation and abstract syntax tree
analysis.
In order to analyze source code, a static analysis tool should "understand" the code, that is, parse it and
create a structure, describing the code in a convenient form. This form is named abstract syntax tree
(often referred to as AST). To check, whether source code fulfils a coding standard, this tree should be
built.
In general case, an abstract syntax tree is built only for an analyzed fragment of a source code (for
example, for a specific function). Before the tree can be built, the source code is first processed by a
lexer and then by a parser.
The lexer is responsible for dividing the input stream into individual tokens, identifying the type of the
tokens, and passing tokens one at a time to the next stage of the analysis. The lexer reads text data line
by line and splits a line to reserved words, identifiers and constants, which are called tokens. After a
token is retrieved, the lexer identifies the type of the token.
If the first character of the token is a digit the token is a number, or if the first character is a minus sign
the token is a negative number. If the token is a number it might be a real or an integer. If it contains a
decimal point or the letter E (which indicates scientific notation) then it is a real, otherwise it is an
integer. Note that this could be masking a lexical error. If the analyzed source code contains a token
"4xyz" the lexer will turn it into an integer 4. It is likely that any such error will cause a syntax error,
which the parser can catch. However, such errors can also be processed by lexer.
If the token is not a number it could be a string. String constants can be identified by quote marks, single
quote marks or other symbols, depending on the syntax of the analyzed language.
Finally, if the token is not a string, it must be an identifier, a reserved word, or a reserved symbol. If the
token is not identified as one of them, it is a lexical error. The lexer does not handle errors itself, so it
simply notifies the parser that an unidentified token type has been found. The parser will handle the
error[2].
The parser has an understanding of the language's grammar. It is responsible for identifying syntax
errors and for translating an error free program into internal data structures, abstract syntax trees, that
can be processed by static analyzer.
While lexer knows only language's syntax, parser also recognizes context. For example, let's declare a C
function::
int Func(){return 0;}
Lexer will process this line in the following way (see table 1):
int Func ( ) { return 0 ; }
3. reserved identifier reserved reserved reserved reserved integer reserved reserved
word symbol symbol symbol word constant symbol symbol
Table 1. Tokens of the "int Func(){return 0};" string.
The line will be identified as 8 correct tokens and these tokens will be passed to parser. The parser will
check the context and find out that it is a declaration of function, which takes no parameters, returns an
integer, and always returns 0.
The parser will find it out by creating an abstract syntax tree from the tokens provided by the lexer and
analyzing the tree. If the tokens and the tree built from them will be considered to be correct, the tree
will be used for the static analysis. Otherwise, the parser will report an error.
However, building an abstract syntax tree is not just organizing a set of tokens in a tree form.
Abstract Syntax Trees
An abstract syntax tree captures the essential structure of the input in a tree form, while omitting
unnecessary syntactic details. ASTs can be distinguished from concrete syntax trees by their omission of
tree nodes to represent punctuation marks such as semi-colons to terminate statements or commas to
separate function arguments. ASTs also omit tree nodes that represent unary productions in the
grammar. Such information is directly represented in ASTs by the structure of the tree.
ASTs can be created with hand-written parsers or by code produced by parser generators. ASTs are
generally created bottom-up.
When designing the nodes of the tree, a common design choice is determining the granularity of the
representation of the AST. That is, whether all constructs of the source language are represented as a
different type of AST nodes or whether some constructs of the source language are represented with a
common type of AST node and differentiated using a value. One example of choosing the granularity of
representation is determining how to represent binary arithmetic operations. One choice is to have a
single binary operation tree node, which has as one of its attributes the operation, e.g. "+". The other
choice is to have a tree node for every binary operation. In an object-oriented language, this would
results in classes like: AddBinary, SubtractBinary, MultiplyBinary, etc. with an abstract base class of
Binary[3].
For example, let us parse two expressions: 1 + 2 * 3 + 4 * 5 and 1+ 2 * (3 + 4) * 5 (see figure 1):
4. Figure 1. Parsed expressions: 1 + 2 * 3 + 4 * 5 (left) and 1+ 2 * (3 + 4) * 5 (right).
As one can see from the figure, the expression can be restored to its original form if you walk the tree
from left to right.
After the abstract syntax tree is created and verified, the static analyzer will be able to check, whether
the source code fulfils the rules and guidelines specified by the code standard.
Static Analysis Techniques
There are many different analysis techniques, such as AST walker analysis, dataflow analysis, path-
sensitive data flow analysis, etc. Concrete implementations of these techniques vary from tool to tool.
Static analysis tools for different programming languages can be based on various analysis frameworks.
These frameworks contain core sets of common techniques, which can be used in static analysis tools so
that these tools reuse the same infrastructure. The supported analysis techniques and the way these
techniques are implemented varies from framework to framework. For example, a framework may
provide easy way to create an AST walker analyzer, but has no support for data-flow analysis[4].
Although all the three above mentioned analysis techniques use the AST created by parser, the
techniques differ by their algorithms and purposes.
AST walker analysis, as one can see from the term, is performed by walking the AST and checking
whether it fulfils the coding standard, specified as a set of rules and guidelines. This is the analysis
performed by compilers.
Data flow analysis can be described as a process to collect information about the use, definition, and
dependencies of data in programs. The data flow analysis algorithm operates on a control flow graph
(CFG), generated from the source code AST. The CFG represents all possible execution paths of a given
computer program: the nodes represent pieces of code and the edges represent possible control
transfers between these code pieces. Since the analysis is performed without executing the tested
program, it is impossible to determine the exact output of the program, i.e. to find out which execution
path in the control flow graph is actually taken. That is why data flow analysis algorithms make
approximations of this behavior, for example, by considering both branches of an if-then-else statement
and by performing a fixed-point computation for the body of a while statement. Such a fixed-point
always exists because the data flow equations compute sets of variables and there are only a finite
number of variables available since we only consider programs with a finite number of statements.
Therefore, there is a finite upper limit to the number of elements of the computed sets which means
that a fixed-point always exists. In terms of control flow graphs, static analysis means that all possible
5. execution paths are considered to be actual execution paths. The result of this assumption is that one
can only obtain approximate solutions for certain data flow analysis problems[5].
The data flow analysis algorithm described above is path-insensitive, because it contributes all execution
paths - whether feasible or infeasible, heavily or rarely executed - to a solution. However, programs
execute only a small fraction of their potential paths and, moreover, execution time and cost is usually
concentrated in a far smaller subset of hot paths. Therefore, it is natural to reduce the analyzed CFG
and, therefore, to reduce the amount of calculations so that only a subset of the CFG paths are
analyzed. Path-sensitive analysis operates on a reduced CFG, which does not include infeasible paths
and does not contain "dangerous" code. The paths selection criteria are different in different tools. For
example, a tool may analyze only the paths containing dynamic arrays declaration, which is considered
to be "dangerous" according to the tool's settings.
Conclusion
The number of static analysis tools and techniques grows from year to year and this proves the growing
interest in static analyzers. The cause of the interest is that the software under development becomes
more and more complex and, therefore, it becomes impossible for developers to check the source code
manually.
This article gave a brief description of the static analysis process and analysis techniques.
References
1. Dirk Giesen Philosophy and practical implementation of static analyzer tools [Electronic
resource]. -Electronic data. -Dirk Giesen, cop. 1998. -Access mode:
http://www.viva64.com/go.php?url=63
2. James Alan Farrell Compiler Basics [Electronic resource]. -Electronic data. -James Alan Farrell,
cop 1995. -Access mode: http://www.viva64.com/go.php?url=64
3. Joel Jones Abstract syntax tree implementation idioms [Electronic resource]. -Proceedings of the
10th Conference on Pattern Languages of Programs 2003, cop 2003.
4. Ciera Nicole Christopher Evaluating Static Analysis Frameworks [Electronic resource].- Ciera
Nicole, cop. 2006. - Access mode: http://www.viva64.com/go.php?url=64
5. Leon Moonen A Generic Architecture for Data Flow Analysis to Support Reverse Engineering
[Electronic resource]. - Proceedings of the 2nd International Workshop on the Theory and
Practice of Algebraic Specifications, cop. 1997.