Type Inference Algorithms and Call Graph Construction

Home | DMP | Project | Journal | Contact
Relevant Java Classes | Delete Call Site | Template Call Graph | Research | Final Report
Previous (Polymorphism)
Next (Flow-Sensitive Analysis of Variable Accesses)
Through dependency analysis and type inference along call chains, a call graph of the analyzed program takes shape. Many type inference algorithms have been developed to handle the difficulties of parametric polymorphism, each improving upon the last. The Cartesian product algorithm is one of the most powerful; it produces a minimum of unrealizable call chains and a very precise call graph. But the discussion must start with a brief review of the basic type inference algorithm, which is the core of the improved algorithms.

THE BASIC ALGORITHM:
INTRODUCTION TO CONSTRAINT-BASED ANALYSIS
The basic type inference algorithm was presented by Palsberg and Schwartzbach in 1991. Agesen discusses this algorithm from a control and data flow perspective, dividing the process into three major steps:

Allocation of type variables
Seeding of type variables
Establishment and propagation of constraints
Constraint-based analysis begins by allocating a type variable for each variable slot and expression in the target program. Initially, all type variables hold the empty set, but the remaining steps in the algorithm assign them meaningful values, adding possible object types to ensure that the type variables hold sound types for the expressions and variables with which they are associated. No types are ever removed from a type variable - an important property of type analysis algorithms called monotonicity. The type inferencer allocates type variables when it first encouters variables and expressions during inference; only the program constructs that the inferencer determines may take part in target execution are given type variables.
The second step alters the type variables allocated in the first step of the analysis. The goal is to capture the initial state - the state just before execution - of the target program through the seeding of type variables. In seeding, type variables that correspond to the initial locations of variables and expressions are initialized. Upon completion of this step, some type variables will hold a set with a single element (one type), and the rest will remain empty. In practice, both seeding and allocation of type variables are performed incrementally, with seeding occuring immediately upon allocation.
In the final step, type constraints are established and propagated. While the preceding stage captured the initial state of the target program, this stage will reflect target execution, building a data flow graph around the type variables previously allocated and seeded. A directed graph - with the type variables as nodes - is constructed gradually, as edges representing constraints are added one by one.
According to Agesen, a constraint is the type inference-time equivalent of an execution-time data flow. If the target program executes the assignment y = x, for instance, there is a data flow from x to y; any object possible for the result of x can also be assigned to the variable y. Therefore, to ensure soundness of inferred types, any type in the type variable of x should also be included in y's type variable. When this data flow is encountered by the analysis algorithm, an edge is added from x's type variable to y's type variable. In mathematical terms, type(x) is a subset of type(y).
Types are propagated along each edge (constraint) added to the graph. A constraint is generated for every possible data flow in the target program to ensure soundness of inferred types. Since a new constraint involves adding an edge to the flow graph, more propagation becomes possible; types may now propagate along the new edge. And when propagation widens the type of a receiver object, the send may be able to invoke new methods through dynamic dispatch. Additional constraints are then needed to follow the flow of these invocations. Thus, step 3 is a repeated process of constraint generation and propagation, until all possibilities have been exhausted.
As more and more constraints are added, the types in the type variables seeded in step 2 reach more and more other type variables. The subset relation described is maintained through analysis; as a new type is added to type(x) from the example above, it is immediately propagated to type(y). Upon completion of the basic algorithm, a sound type for an expression or variable can be located by examining the corresponding type variable node in the graph.

TEMPLATES
Even for small programs, constraint-based analysis yields a large network of thousands of type variables and constraints. The abundance of details and lack of structure make it very difficult to compare the effectiveness of the various type inference algorithms simply by examining these unwieldy graphs. Agesen developed the concept of templates to "impose a structure on the constraint network and raise the abstraction level by encapsulating constraints, just as methods raise the abstraction level of programming by encapsulating expressions".
But what are templates? We will examine Agesen's definition as well as the simplified version used in the FLEX compiler and Professor Souter's work with incremental call graph updates.

AGESEN
Templates are, on a basic level, subgraphs of the unstructured constraint graph produced by the basic algorithm, each corresponding to a method in the analyzed program. The template for a method M is the subgraph consisting of:

The type variable nodes corresponding to the expressions, local variables, formal parameters, and return value of M
The edges - constraints - originating from these nodes
In other words, a template is the network generated by the basic algorithm from a single method. Nodes corresponding to instance variables are not contained in a template, because the instance variables do not "belong" to any particular method. The type inference algorithm determines the type at a call site by propagating the actual arguments through the template(s) of the methods that may be invoked at that call site. When a call site has several potential callees, its type is the union of the result types of each callee method.
Templates were created to provide an analysis-time parallel for familiar runtime concepts. Just as classes are the analysis-time representations of objects and constraints the analysis-time representation of data flows, templates correspond to activation records. While activation records are created to evaulate methods, templates are generated to analyze methods. Moreover, types of parameters and local variables can be found using the type variables in templates, while the contents of these entities can be determined by examining the fields in activation records at exeuction time.
The Cartesian product algorithm often creates several templates for a single method. An activation record is represented by one template, regardless of the number of templates generated for the corresponding method, and a template generally represents many activation records. A potentially unbounded number of activation records may be generated during execution. Thus, Agesen concludes, templates are templates for activation records.

SIMPLIFIED VIEW
A template is viewed as an instance of a method representing a particular combination of actual parameter types. The template has a singleton set of types - a concrete type - for each formal parameter, including the receiver (for non-static methods, in the case of Java), equal to or narrower than the declared type. Thus, a template is a specialization of a given method - an exact type signature of that method. This view, adopted by the FLEX compiler, allows one to concentrate on relationships between callers and callees in the target program without worrying about individual type variables.
The FLEX data structure representing a template is the meta-method. According to the Javadoc documentation for the MetaMethod class, a meta-method is a method specialization, a function of the types of its arguments. In other words, it is simply a method plus types for its parameters. As mentioned in the previous paragraph, these types should be identical to or narrower than the types declared for that base method. Remember, many meta-methods can be generated from one method (represented by an HMethod in FLEX).
More information on templates as used in the current research, along with concrete examples, will be provided in later sections, notably that featuring the Cartesian product algorithm.

SHORTCOMINGS OF THE BASIC ALGORITHM
The basic type inference algorithm works well when all call sites invoking a method use arguments of the same concrete type, but it loses precision when calls to the same method supply different argument types, even if all of these types are monomorphic. This failure stems from the fact that the algorithm generates only one template for each method; the result type of this template must be general (and imprecise) enough to cover all possible uses of the method in the target program. Improved algorithms aim to avoid this problem by creating multiple templates for each method.

POLYVARIANT ALGORITHMS
Algorithms - like the basic type inference algorithm - that analyze each method only once are monovariant. The improved algorithms presented in Agesen's dissertation are all polyvariant; they may analyze a method multiple times in order to improve precision, creating many different templates for some or all methods in the target program. FLEX's meta-methods fit in well with polyvariant analysis, as many meta-methods can be specialized from a single method.

THE CARTESIAN PRODUCT ALGORITHM (CPA)
CARTESIAN PRODUCTS
CPA relies on the computation of the Cartesian product of the receiver type and the types of the arguments at a call site. The Cartesian product of two sets A and B is defined as follows:
A × B = {(a, b) | a is an element of A and b is an element of B}
This definition can be extended to products of any number of sets; the Cartesian product simply includes all possible combinations containing one element from each set. The elements of the Cartesian product set are ordered n-tuples, where n is the number of original sets. As an example, the Cartesian product of four sets A, B, C, and D:
A × B × C × D = {(a, b, c, d) | a is in A, b is in B, c is in C, and d is in D}

CPA IN ACTION
The key insight of CPA is turning the analysis of each call site into a case analysis. During the analysis of a call site, the algorithm computes the Cartesian product of the types of the actual arguments (including the receiver) and then analyzes each tuple in the product as an independent case. This eliminates the dependency on iteration of many other type inference algorithms, as exact type information for each case is immediately available. The type information is used to curb undesired type merging and to avoid redundant analysis through the sharing of cases.
Agesen makes the important observation that there is no such thing as a polymorphic method; only polymorphic call sites may exist. A call site that can invoke a method f() with either int or float receivers, for example, will sometimes invoke f with an integer receiver and other times with a floating point receiver. But, given a particular invocation, the receiver is either an int or a float; it cannot be both. CPA makes use of this property by generating templates in which all formal arguments have monomorphic types.
Let receiver.f(argument) be a call site, and let R and A represent the type sets of the receiver and argument, respectively. Suppose that R = {r₁, r₂,..., r_m} and A = {a₁, a₂,..., a_n}. To analyze the call site, CPA takes the Cartesian product of the receiver types and the argument types:
R × A = {(r₁, a₁),..., (r₁, a_n),..., (r_i, a_j),..., (r_m, a₁),..., (r_m, a_n)}
The algorithm then produces an f template for each element of R × A. If previous analysis has already generated a template for a given (r, a) pair, it is reused; otherwise, a new template is created and will be available for future sends. The union of the result types of the templates connected to a call site yields the type of that call site!
Each method has what Agesen refers to as a "repository of templates". All template specializations of the particular method are included in this repository, regardless of the call sites that initiated their creation. Such a relation is maintained to facilitate the sharing of templates among different call sites with Cartesian products having tuples in common.
With the templates and repositories in place, Agesen's approach to the analysis of a call site can be summarized by four steps:

Determine the methods that the call may invoke (callees).
Generate the Cartesian product for current type information. (See the following subsection, "Avoiding Iteration".)
For each callee M and each tuple in the Cartesian product, look up the tuple in M's template repository. If no match is present, create a new template and add it to the repository.
Connect the send to the (old or newly created) template.
Template repositories are not filled in advance; rather, templates are added gradually as new argument combinations are encountered during analysis of the target program's call sites.

AVOIDING ITERATION
How can CPA use the types of receivers and arguments while these types are still being computed? Other algorithms have tackled the problem through iterating and approximating types based on previous iterations, but CPA uses neither iteration nor approximation. The Cartesian product (unlike a type comparison to determine whether a template may be shared) can be computed correctly - gradually, but correctly - even if the types are not fully known until analysis has been completed. When a call site is processed initially, the Cartesian product computed is that of the current types of the receiver and arguments. The call site is connected to the resulting templates. If one or more of the types involved later grows, CPA returns to the call site and extends the Cartesian product, adding the new combinations. Because types grow monotonically during type inference, such an approach produces a correct result. (The Cartesian product is a monotonic function; if either |A| or |B| increases, |A × B| increases.)

ADDITIONAL READING
For more information on CPA and its advantages over other type inference algorithms in handling parametric polymorphism, see the following documents. Information on resolution of dynamic dispatch and inheritance in CPA is also provided:

Ole Agesen. The Cartesian Product Algorithm: Simple and Precise Type Inference of Parametric Polymorphism
Ole Agesen. Concrete Type Inference: Delivering Object-Oriented Applications

Home | DMP | Project | Journal | Contact
Relevant Java Classes | Delete Call Site | Template Call Graph | Research | Final Report
Previous (Polymorphism)
Next (Flow-Sensitive Analysis of Variable Accesses)