More About Access Paths

The Problem Thus Far

I want to use access paths in Cornelius to address heap locations. This turns out to be tricky (as evidenced in my waffling in my last post). In this post I want to refine the problem, and I identify several properties I want my heap addressing scheme to have.

Desired Properties of Heap Addressing Schemes

Syntax/rewrite rules should be sound (obviously)
Access path representation should be sensitive to changes in stack memory
Access path representation should be independent from heap state
Access path representation should handle weak path decomposition

Access-Path Representations

Representation 1: Strictly Lexical Access Paths

Traditionally, I’d have an abstract heap state H, and I’d index into this with an access path v.a.b.c, where v is a variable that reaches into the heap (say a method parameter, a local variable, or the this keyword), and each of .a, .b, and .c are field dereferences. I’ll call this representation of access path a lexical access-path. This would make sense if I were tracking data at each program point, but I’m not. Consider the following program:

boolean example1(Foo f, Foo g) {
   int x = f.a;    // POINT 1
   f = g;
   int y = f.a;    // POINT 2
   return x == y;
}

Using lexical access-paths makes this method always return true. First, notice that this method doesn’t update the heap. Suppose H is the initial heap state at the start of the method. The PEG representation of the value stored to x at POINT 1, namely f.a, is (rd "f.a" H). Likewise, the PEG representation of the value stored to y at POINT 2, namely f.a, is (rd "f.a" H). This is syntactically equivalent to the value stored in x at POINT 1, and using this strictly lexical access-paths leads to unsoundness (property 1) because it doesn’t track changes on the stack (property 2).

I’m emphasizing strictly in “strictly lexical access-paths” because the only part of the access path that gives us problems is the variable: we can use lexical field dereferences (we might run into aliasing issues, but we can handle these soundly).

Representation 2: Access Paths with Base Values

In my previous post I suggested using base values instead of variables. As I mentioned above, the problem with using lexical access-paths in the above example is that the "f" in access-path "f.a" is ambiguous: it is a local variable and is stored on the stack. When it is updated on the stack, say via an assignment f = g;, this is not tracked in the heap parameter H. To fix this, I can replace the formal name of the variable at the base of an access path with the value that is stored at that variable.

I immediately run into another choice that I need to make: do I track access paths as a single entity of the form (path BASE THE-WHOLE-DEREFERENCING-PATH), e.g.:

(path (var f) "a.b.c")

Or do I decompose paths into dereferencing chains, e.g.:

(deref (deref (deref (var f) "a") "b") "c")

If I use the path form, I might end up violating property 4 (see the double dereferencing example) from my previous post.

Conversely, if I use the deref form, I run into the problem that I’m using paths as values. The (var f) node is clearly a value, but (deref (var f) "a") is a path, not a value; to be a value, I’d need to read that location in a heap state. This becomes

(deref (rd (deref (rd (deref (var f) "a") H) "b") H) "c")

But now my representation depends on heap information, and this feels very fragile.

Another Approach: Using Egg’s Analyses

I’m pretty convinced that there isn’t a nice clean syntax-only solution to this. Instead I think I’m going to rely on egg’s Analysis feature. I already use some basic Analysis to do constant folding and tracking of vars in eclasses.

I need to think more about how I can use Analyses, but I think that I’ll modify the path solution above. This has the following advantages:

It actually encapsulates the deref version as a special case where every path is length 1.
It won’t have arbitrary length nesting of path nodes since this will all be handled implicitly by theta nodes (again, assuming I don’t do anything like loop unrolling in my rewrite rules)
It’s probably the easiest to implement
While I’d like my representation to handle weak path decomposition, it isn’t necessary, and might not actually be that useful in a lot of the cases Cornelius will be handling.

Summary

I don’t need to make this perfect yet. A good starting point is to use the path form of the access-paths with base values.