Program Analysis Notes 1

bearbear
9 min read

1 Introduction

What is static program analysis(SPA)? Let's have an overview of Programming Languages(PL) before talking about SPA, PL can be divided into three parts:

  1. Theory

    1. Language Design

    2. Type System

    3. Semantics and Logics

  2. Environment

    1. Compilers

    2. Runtime System

  3. Application

    1. Program Analysis

    2. Program Verification

    3. Program Synthesis

In the last few years, the language core has had few changes, but the program has become larger and more complex.

To ensure reliability, we need the Static Program Analysis: Static analysis analyzes a program P to reason about its behaviors and determines whether it satisfies some properties before running P.

  • Does P contain any private information leaks?

  • Does P dereference any null pointers?

  • Are all the cast operations in P safe?

  • Can v1 and v2 in P point to the same memory location?

  • Will certain assert statements in P fail?

  • Is this piece of code in P dead (so that it could be eliminated)?

  • ... ...

However, Rice's Theorem tells us that: "Any non-trivial property of the behavior of programs in a r.e. language is undecidable."

What are Sound and Complete?

Sound = Overapproximate: e.g.: Assuming there are 10 def errors, report those 10 errors along with any others.

Complete = Underapproximate e.g.: Assuming there are 10 def errors, report fewer than 10 errors but all reported errors are real errors.

Truth = Sound && Complete = All possible true program behaviors

We always choose to compromise completeness, which denotes "false positives"

Static Analysis = Abstraction + Over-approximation

Static Analysis: ensure (or get close to) soundness, while making good trade-offs between analysis precision and analysis speed.

1. Abstraction determines the sign of all the variables of a given program.

2. Over-approximation contains two parts:

transfer function: which defines how to evaluate different program statements on abstract values (transfer function is defined according to the "analysis problem" and "semantics" of different programs).

control flow: define how to merge flow.

Q for readers:

  • What are the differences between static analysis and (dynamic) testing?

  • Understand soundness, completeness, false negatives, and false positives.

  • Why soundness is usually required by static analysis?

  • How to understand abstraction and over-approximation?

  • The relation between compilers and static analyzers

  • Understand 3AC and its common forms

  • How to build basic blocks on top of IR

  • How to construct control flow graphs on top of BBs?

A compiler is a single box that maps a source program into a semantically equivalent program. If we open up this box a little, we see there are two parts to this mapping: analysis and synthesis.

Analysis (front end)

  1. Lexical Analysis: reads the input characters of the source program, groups them into lexemes and produces a sequence of tokens for each lexeme.

  2. Syntax Analysis: checks whether the input program conforms to the grammar rules of the programming language.

  3. Semantic Analysis: checks whether the input program is semantically correct. machine-independent optimization Static Program Analysis Synthesis (back end) constructs the desired target program from the intermediate representation and the information in the symbol table.

After the front end of compiling, we need a proper representation for further analysis. The reasons for choosing 3AC instead of AST are below:

  1. 3AC is low-level while AST is high-level

  2. 3AC is language independent while AST is language dependent

  3. 3AC contains control flow information while AST does not

  4. and more

Definitions of Basic Block and CFG are omitted (the whole details are in PDF).

2 Data Flow Analysis -- Applications

What is data flow analysis? how does application-specific data flow through the Nodes (Basic Blocks/Statements) and Edges (Control Flows) of CFG (a program)?

Why need data flow analysis? High-level programming languages generate a lot of redundancy.

There are two kinds of DFA: may analysis: outputs information that may be true (over-approximation) must analysis: outputs information that must be true (under-approximation) Both are for safe approximation.

In each data-flow analysis application, we associate with every program point a data-flow value that represents an abstraction of the set of all possible program states that can be observed for that point.

Data-flow analysis is to find a solution to a set of safe-approximation-directed constraints on the IN[s]'s and OUT[s]'s, for all statements s.

  • constraints based on the semantics of statements (transfer functions)

  • constraints based on the flows of control

2.1 Reaching Definitions Analysis

A definition d at program point p reaches a point q if there is a path from p to q such that d is not "killed" (overwritten) along that path.

Application: Reaching Definitions can be used to detect possible undefined variables. e.g. introduce a dummy definition for each variable v at the entry of CFG, and if the dummy definition of v reaches a point p where v is used, then v may be used before the definition (as undefined reaches v)

Understanding Reaching Definitions D: v = x op y: This statement "generates" a definition D of variable v and "kills" all the other definitions in the program that define variable v, while leaving the remaining incoming definitions unaffected.

  • Transfer function: OUT[B] = gen[B] ∪ (IN[B] - kill[B])

  • Control Flow IN[B] = ∪p a predecessor of B OUT[P] The reason for using ∪: A definition reaches a program point as long as there exists at least one path along which the definition reaches.

Algorithm of Reaching Definition Analysis: Version-Stanford

input: control flow graph CFG = (N, E, Entry, Exit)
// Boundary condition
   out[Entry] = null

// Initialization for iterative algorithm
   for each basic block B other then Entry
       out[B] = null

// iterate
   while (Changes to any out[] occur) {
       For each basic block B other then Entry {
           in[B]  = ∪(out[p]), for all predecessors p of B
           out[B] = fb(in[B]) // out[B] = gen[B] ∪ (in[B] - kill[B])
       }
   }

Version-NJU

INPUT: CFG (killB and genB computed for each basic block B)
OUTPUT: IN[B] and OUT[B] for each basic block B

METHOD:
  OUT[entry] = null
  for (each basic block B\\entry)
      OUT[B] = null
  while (change to any OUT occur)
      for (each basic block B\\entry) {
          IN[B] = ∪p a predecessor of B OUT[P];
          OUT[B] = genB ∪ (IN[B] - killB);
      }

In each data-flow analysis application, we associate with every program point a data-flow value that represents an abstraction of the set of all possible program states that can be observed for that point.

Summary of Reaching Definitions

  • Domain: Sets of definitions

  • Transfer function fb(x): forward: out[b] = fb(in[b]) where fb(x) = Gen[b] ∪ (x - Kill[B]), Gen[b] = definitions in B, Kill[b]: killed def

  • Meet Operation: in[b] = ∪out[predecessors]

  • Boundary Condition: out[entry] = null

  • Initial interior points: out[b] = null

Data-flow analysis is to find a solution to a set of safe-approximation directed constraints on the IN[s]’s and OUT[s]’s, for all statements s.

  • constraints based on the semantics of statements (transfer functions)

  • constraints based on the flows of control

2. 2 Live Variable Analysis

Live variable analysis tells whether the value of variable v at program point p could be used along some path in CFG starting at p. If so, v is life at p; otherwise, v is dead at p.

The use of a name in a three-address statement is defined as follows. Suppose a three-address statement assigns a value to x. If statement j has x as an operand, and control can flow from statement I to j along a path that has no intervening assignments to x, then we say statement j uses the value of x computed at statement i. We further say that x is live at statement i.

  • Information on live variables can be used for register allocations. e.g., at some point all registers are full and we need to use one, then we should favor using a register with a dead value.

Trace uses back to the definitions: backward in[b] = fb(out[b]) Transfer function for statement s: x = y + z

  • generate live variable: Use[s] = {y, z}

  • propagate live variables: out[s] - def[s], def[s] = x

  • in[s] = Use[s] ∪ (out[s] - Def[s])

Algorithm of Live Definition Analysis: Version-Stanford

input: control flow graph CFG = (N, E, ENtry, Exit)

// Boundary condition
   in[Exit] = null
// Initialization for iterative algorithm
   For each basic block B other then Exit
       in[B] = null

// iterate
  While (Changes to any in[] occur) {
      For each basic block B other then Exit {
        out[B] = ∪ (in[s]), for all successors s of B
        in[B]  = fb(out[B]) // in[B] = Use[B] ∪ (out[B] - Def[B])
      }
  }

INPUT: CFG (defB and genB computed for each basic block B) OUTPUT: IN[B] and OUT[B] for each basic block B METHOD: IN[entry] = null for (each basic block B\entry) IN[B] = null while (change to any IN occur) for (each basic block B\entry) { OUT[B] = ∪p a predecessor of B IN[P]; IN[B] = gen[B] ∪ (OUT[B] - def[B]); }

2.3 Available Expressions Analysis

An expression x op y is available at program point p if (1) all paths from the entry to p must pass through the evaluation of x op y (2) after the last evaluation of x op y, there is no redefinition of x or y

  • This definition means at program p, we can replace the expression x op y with the result of its last evaluation

  • The information on available expressions can be used for detecting global subexpressions

For the safety of the analysis, it may report an expression as unavailable even if it is truly available (must analysis -> under-approximation)

  • OUT[B] = genB ∪ (IN[B] - killB)

  • IN[B] = ∩ p a predecessor of B OUT[P] All paths from entry to point p must pass through evaluation of x op y

  • Understand the three data flow analyses:

    • reaching definitions

    • live variables

    • available expressions • Can tell the differences and similarities of the three data flow analyses • Understand the iterative algorithm and can tell why it can terminate

3 Data Flow Analysis Foundation

Another view of Iterative Algorithm:

  • Given a CFG (program) with k nodes, each node has IN and OUT facts, algorithm updates every OUT fact for every node in each iterative.

  • Assume the domain of the values in data flow analysis is V, then we can define k-tuple ==(OUT[n1], OUT[n2], OUT[n3], ..., OUT[nk])== as an element of set ==(V1 × V2 ... × Vk)== denoted as ==V(k)==, to hold the values of the analysis after each iteration.

  • Each iteration can be considered as taking an action to map an element of V(k) to a new element of V(k), through applying the transfer functions and control-flow handing, abstracted as a function: ==F: V(k) -> V(k)==

  • Then the algorithm outputs a series of k-tuples iteratively until a k-tuple is the same as the last one in two consecutive iterations

A data flow analysis framework (D, L, F) consists of:

  • D: a direction of data flow: forwards or backward

  • L: a lattice including the domain of the values V and a meet ∩ or join operator

  • F: a family of transfer functions from V to V

Monotonicity A function f: L -> L (L is lattice) is monotonic if for all x,y ∈ L, x ⊑ y => f(x) ⊑ f(y)

Fixed-Point Theorem

Given a complete lattice (L, ⊑), if (1) f: L -> L is monotonic and (2) L is finite, then the least fixed point can be found by iterating f(⊥), f(f(⊥)), ...m fk(⊥) until a fixed point is reached the greatest fixed point of f can be found by iterating f(⊥), f(f(⊥)), …, fk(⊥) until a fixed point is reached.

Assignments

Assignments-1-LiveVariableAnalysis

Assignments-2-ConstantProp

0
Subscribe to my newsletter

Read articles from bear directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

bear
bear

bear with us, while we think.