Potential Challenges With Modeling Heaps in EGraphs
EDIT: This post is concerned with unbounded heap sizes and access paths lengths in the presence of loops. The semantics of theta nodes don't actually construct these, though, so there is no problem. I'm holding on to this post for posterity (pardon the pun)...
Handling Heaps in EGraphs
Cornelius is trying to handle arbitrary Java code, and this means dealing with the heap. Due to the nature of the heap (unbounded, dynamically allocated, etc), Cornelius won’t be able to reason about it fully (…Cornelius works statically). So the question is, how much can we reason about?
To begin with I’m not considering loops. This makes the problem a lot easier. I’m also not considering method calls yet, but I have some ideas for how to handle those in a very rough way.
While my first stab at handling heapy code will elide loops and method calls, I still need to think about the challenges these language features introduce. That is the point of this posting.
Peggy
Peggy uses sigma nodes as a heap summary, which aren’t really explained very well. Something about lists of reads and writes, where reads commute? I think this seems like the right direction.
Cornelius and Heap Models
There are two primary approaches to modeling the heap: store-based and
storeless. Store-based heap models use abstract symbolic addresses to
represent concrete addresses. Storeless heap models use access paths. An access
path is a variable followed by zero or more field dereferences, such as
list.size
or x.y.z
, or foo
. The first item in the access path is a
variable that points to the heap (a reference) while each other item is a field
name, also pointing to the heap.
Cornelius is a rewrite-based approach and reasons about the syntax of a program. Therefore I think it’s logical to use a storeless heap model. Cornelius will be receiving access paths explicitly in PEG inputs, so I think I should use this form directly.
Here is my first stab at defining a heap:
A heap is a list of writes.
A write takes three arguments: an access path, a value, and a heap. Thus
if I have a heap which I know nothing about, represented as nil
, and I want to
process field assignment x.y = 3;
, I will represent the heap after the field
assignment as:
(wr "x.y" 3 nil)
The field access path is just a string right now, but later I will need to update it to be a list.
An important point: in this formulation, I’m not tracking allocations explicitly on the heap. Consider the following (not-actually-legal-)Java program:
The following table lists the state after each line is executed
Line | Heap After Executing | Comments |
---|---|---|
ENTRY | nil |
Start off with unknown heap nil |
Foo f = this.foo |
nil |
This line didn't update the heap---we don't know anything new about
it's structure. You could argue that we now know that there is a
something called this.foo , but that doesn't actually matter
to our analysis. |
f.x = 3 |
(wr "f.x" 3 nil) |
write value 3 to access path "f.x" |
new Y |
(wr "f.x" 3 nil) |
This doesn’t change our heap since we can’t actually access this value through an access path |
f.y = new Y |
(wr "f.y" (new Y) (wr "f.x" 3 nil)) |
This time we’ve created a new Y and wr it to access path "f.y" |
Creating a new object without invoking a constructor isn’t actually legal in
Java (but it is legal in JVM bytecode). In practice, invoking a constructor
might actually update the heap in some way, just as any other method invocation
might update the heap in some way. I’ll reason about this later, but the easiest
(and least precise) thing to do is to replace the heap with nil
every time we
invoke a method.
Over-identifications
There is a problem with the above. nil
can mean different things in
different places. Consider the following example:
Line | Heap After Executing | Local Vars |
---|---|---|
ENTRY | nil |
NA |
x1 = this.x |
nil |
x1 = (rd this.x nil) |
methodThatMessesWithHeap() |
nil |
x1 = (rd this.x nil) |
x3 = this.x |
nil |
x1 = (rd this.x nil) ,x2 = (rd this.x nil) |
In practice we don’t know if this returns true
or false
, but an EGraph will
think that this program alwasy returns true
!
Really nil
just means “I don’t know anything about this heap”. However, since
EGraphs will identify two different nil
s through deduplication, for now I’ll
assume that each nil
is parameterized (i.e., (nil 1)
, (nil 2)
, etc). Once
we introduce loops and method calls this will become a problem due to an
arbitrary number of unknown heaps (i.e., every time a method is invoked). This
will be a theme in the interplay between EGraphs and heaps: abstract states
lead to unsound identifications. This is the fact that I’ll need to overcome.
Cornelius and Heap Summaries
I won’t need to summarize heaps until I include loops or recursion. This is because all access paths will be known statically. However at some point I’ll need to handle access paths of arbitrary length. For instance, consider the following Java code summing the elements of a linked list.
The following table gives the access path of n
for each iteration of the loop
condition:
Iteration | Access Path | k-limiting, k = 2 |
---|---|---|
0 | (var n) |
(var n) |
1 | (var n).next |
(var n).next |
2 | (var n).next.next |
(var n).next.next |
3 | (var n).next.next.next |
(var n).next.next(.next)* |
4 | (var n).next.next.next.next |
(var n).next.next(.next)* |
… | … | … |
It’s plain that once we introduce loops we can get unbounded access paths. To address this we should use some sort of heap summary. A heap summary gives a bounded representation of a potentially unbounded heap model. There are a number of ways to do this, but they all involve the same technique: represent multiple abstract heap locations from the model by a single summary node.
k-limiting
One such summarization technique is called k-limiting. In k-limiting a
positive integer k is chosen. All access paths with at most k field
dereferences are treated normally. However, if more than k dereferences take
place, these are summarized with a Kleene closure *
.
Summaries and EGraphs
There is a pretty severe problem with summaries and EGraphs, and it’s a problem
we already encountered with nil
. Each summary node by definition represents
more than a single abstract heap state but has a single syntactic
representation. As a result, congruence can misfire! Taking the above table,
suppose that we wanted to read the access paths:
Iteration | Reading Access Path | Reading w/ k-limiting, k = 2 |
---|---|---|
0 | (rd (var n)) |
(rd (var n)) |
1 | (rd (var n).next) |
(rd (var n).next) |
2 | (rd (var n).next.next) |
(rd (var n).next.next) |
3 | (rd (var n).next.next.next) |
(rd (var n).next.next(.next)*) |
4 | (rd (var n).next.next.next.next) |
(rd (var n).next.next(.next)*) |
… | … | … |
The main problem with heap summaries as they relate to EGraphs is that they take an infinite set and turn it into a finite set (that’s the point, after all). This means that if I’m not careful I will lose soundness.