Final Report
Kathleen Repine

The goal of my project was to research, design and implement an algorithm for differencing two programs.  The main reason for this is that the results could be very useful for determining the effects of changes to a program (impact analysis).

Initially, I read a lot of papers in the field.  Then after some discussions with Mary Jean and Alex, I started focusing on the algorithm described in [2].  The definition of cluster in the paper left out some cases I thought should be a cluster.  After reading some more papers, I found that the definition of cluster I thought was appropriate because it didn't exclude cases.  The definition was not for a cluster, but a hammock [1], and stated that a hammock was a group of nodes with all incoming edges going to one node in the hammock and all the outgoing edges going to one node not in the hammock.  However, I didn't want every hammock, only ones that had at least two nodes and were minimal for a given decision node in the hammock.

After researching the problem, I started working on the design of the algorithm.  I modified the general hammock algorithm in [1] to get an algorithm to find the minimal hammock for any decision node in a single entry, single exit graph.  Then I wrote up the differencing algorithm from [2] in pseudo code and made sure it made sense because in [2] the description of some important parts of the algorithm were very short and vague or completely omitted.  Next, we discussed what the structure of the graph fed to the hammock algorithm should be.  Since the hammock algorithm needs a single entry single exit graph, several modifications to the CFG must be made.  Mainly, anywhere there are method calls, the call site must be expanded.  This means an entry node for every possible method that could be called must be inlined and any uncaught exceptions must be propagated.  We called this new graph the DIG.

Once the algorithms were designed, I worked on implementation.  To implement, I needed to use and work with JABA.  I spent some time familiarizing myself with JABA and DOTTY, a graph drawing program.  First I implemented the hammock algorithm, this includes a method that creates a dotty file that has each hammock is in a subgraph box.  Then I implemented the method level differencing.  The visualization I implemented for that is a dotty file with dotted edges representing the matchings between nodes.  I also added labels to the matching edges describing the differences, if any.  I had only been concentrating on the method level differences up to this point.  I hadn't really considered or thought about whole program differencing, but after Alex and I discussed program level differencing, I quickly threw together class and program level differencing.  Both still have some work that needs done, for example currently, both only categorize components as added, removed, or perfectly matched.  Then when the analysis is run on the perfectly matched components, the component could be changed to modified.  Then I very quickly partially implemented the DIG, it still needs exception propagation information.

There are several areas of future work: continued implementation, visualization, optimization, and impact analysis.  The dig needs to have exception propagation implemented.  The program and class level differencing need better matchings, and difference descriptions.  Class level differencing needs to create the method differences in the order specified by the call graph.  The dotty output might need to be modified because sometimes the subgraphs don't work right, even though I can't find an error in the file.  For visualization, it would be nice to have some method level output for tarantella where lines were color coded based on perfect match, matched mod, added or deleted.  It would also be nice to be able to view everything at an upper level and click to view more internal details.  The dotty visualization output could be improved by having the matching edges be different colors, perhaps green for perfect match and red for matched mod, and by having the added and deleted nodes distinguishable, perhaps outlining added and deleted nodes in red.  Some ideas for optimization are hammocks on demand, a cache for globally qualified class name look ups, and making dejavoo only search between two nodes (we could give it the actual nonplace holder hammock header and exit).  For impact analysis, back propagation needs fixed for recursion (other than that it is done), and it would be nice to have a mapping from a type difference back to the declaration line number.

[1] R. Komondoor and S. Horwitz. Semantic-Preserving Procedure Extraction. Symposium on Principles of Programming Languages. (University of Wisconsin-Madison, 2000). 155-169.

[2] J. Laski and W. Szermer. Identification of Program Modifications and its Applications in Software Maintenance. Proceedings of the Conference on Software Maintenance - 1992. (November 1992). 282-290.