Time Co-op Writeup

Overview
Mutation Testing
- Example
  - POP QUIZ 1
Mutant Kill Ratio
Equivalent and Redundant Mutants
Detecting Equivalent Mutants with Egraphs
- POP QUIZ
- Handling Local Equalities
Solutions to Pop Quizes
- Pop Quiz 1
- Pop Quiz 2

Overview

This week I want to introduce you to Cornelius, a static analysis tool to detect equivalent and redundant mutants in Java programs. Cornelius takes in the Java source files of a program and some mutants of that program, translates them into Program Expression Graphs (PEGs), stuffs them into an Egraph, and runs a rewrite system until equality saturation is reached. Finally, Cornelius checks if any of the programs (the original program or the mutants) have become equivalent under the rewrites. Cornelius is heavily inspired by Tate et al’s Peggy tool. Under the hood, Cornelius uses egg.

For this Time Co-op I want to

Introduce you to mutation analysis
Introduce you to Cornelius
Get your input on how Cornelius can find some equivalent and redundant mutants

If you’re interested in tinkering, check out the Cornelius Repository. It will require Java 8 to use the shell script wrapper. I haven’t gotten around to writing a full tutorial but I’ll try to get that together for next time.

Mutation Testing

Imagine you’ve just finished writing a test suite for one of your projects. How good is your test suite? If you’re anything like me, then not very good. Still, it would be nice to quantify just how not-good our test suites are.

What should tests do? Catch bugs, of course! Finding real live bugs in code is hard, so we can’t measure this property directly. However, we can seed a bunch of syntactic faults in our program, producing new broken programs called mutants. Each of these mutants is a proxy for a real world bug.

After mutating the original program, we try to kill mutants by running the test suite on each mutant. If a mutant causes one of the tests in the suite to fail, that mutant is killed. We killed it. It ded.

Example

Consider the following program which computes the max of two ints:

int max(int a, int b){
  if (a > b) return a;
  return b;
}

Let’s look at four of this program’s mutants:

Mutant 1:

int max(int a, int b){
  // `>` -> `!=`
  if (a != b) return a;
  return b;
}

Mutant 2:

int max(int a, int b){
  // `>` -> `>=`
  if (a >= b) return a;
  return b;
}

Mutant 3:

int max(int a, int b){
  // `>` -> `false`
  if (false) return a;
  return b;
}

Mutant 4:

int max(int a, int b){
  // `return a;` -> `;`
  if (a > b) ;
  return b;
}

Let Clarence be a test suite with only a single test.

public class ClarenceTheTestSuite {
    @Test
    public void theOnlyTestClarenceHas() {
        assertEquals(max(1,2), 2);
    }
}

POP QUIZ 1

Which of the above mutants does Clarence kill?
What proportion (written as a decimal) of mutants does Clarence kill? This number is called the mutant kill ratio.
Can you write tests to kill the other mutants? If not, why not? If so, what’s your favorite color?
(mandatory bonus question) Can you write tests to differentiate all mutants? That is, can you write a test suite such that for every pair of mutants m1 and m2, there is a test t that behaves differently on m1 and m2? If not, why not? If so, what is your favorite smell?

Mutant Kill Ratio

Problem 2 asks you to compute the mutant kill ratio. You should have gotten 0.25: ClarenceTheTestSuite kills mutant 1 but not mutants 2-4. Thus it kills 1/4 = 0.25 mutants. In general we want the mutant kill ratio to be close to 1; let’s say that for the rest of the day, any mutation kill ratio of above 0.85 is “good”.

Equivalent and Redundant Mutants

Unless you broke reality you should have found it impossible to kill mutant 2. This is because it is semantically equivalent to the original program, even though it differs syntactically. Such a mutant is called an equivalent mutant.

Likewise, you should have found it impossible to write a test that behaves differently on mutants 3 and 4. These two mutants are semantically equivalent to one another. Such mutants are called redundant mutants.

Problems with Equivalent And Redundant Mutants

Equivalent and redundant mutants cause two problems in mutation testing.

They skew metrics: The best a test suite can possibly do is to kill mutants 1, 3, and 4. This means that at best, the mutant kill ratio will be 0.75, which is not ‘good’ according to our above definition.

Similarly, a killing a redundant mutant kills all the mutants in its redundancy class. This means that as long as a redundancy class is left unkilled, it as has an overly negative effect on the mutant kill ratio, and once it has been killed it has an overly positive effect on the mutant kill ratio.
They waste resources: We have to run testing infrastructure on each mutant. At scale this is expensive, and especially so with equivalent mutants. This is because they can never be detected, so we have to run the entire testing infrastructure, wasting CPU cycles. What’s worse, a human developer might have to come in and waste human cycles trying to kill it by hand.

The equivalent mutant problem is one of the reasons why mutation testing isn’t more widely adopted in practice.

How We Handle the Equivalent And Redundant Mutant Problem

The only way to handle equivalent and redundant mutants is to detect them before running the test suite. This is undecidable in general, but there are a lot of mutants that are ‘obviously’ equivalent or redundant.

There have been a number of attempts to detect equivalent and redundant mutants. These attempts fall into two categories:

Use of constraints: by modeling the semantics of two programs with constraints, we can a query a constraint solver to determine if there are inputs that force the programs to behave differently.
Using compiler techniques: compiler optimizations rewrite programs to more efficient forms. These rewrites preserve program semantics. Therefore, if two different programs are rewritten to the same optimized version they must have begun as equivalent.

Cornelius uses a new approach: store all mutants in an Egraph and compute equality saturation.

How we Actually Handle The Equivalent And Redundant Mutant Problem

Use test coverage metrics.

Detecting Equivalent Mutants with Egraphs

Java is complicated, and I won’t be doing anything with heaps or referency type things. Instead we will focus on loop free programs with primitive types.

Cornelius starts by translating a program into a PEG. This currently involves regularizing the AST so that there is a single return location. For instance, after regularization the original program becomes:

int max(int a, int b) {
    // Default value to satisfy Javac's flow checking. Definitely NOT a hack
    int __RETURN_RESULT__ = -2147483648;
    boolean __method_has_returned__ = false;
    if (a > b) {
        __method_has_returned__ = true;
        __RETURN_RESULT__ = a;
    }
    /* --- Auto-generated guard statement --- */
    if (!__method_has_returned__) {
        __method_has_returned__ = true;
        __RETURN_RESULT__ = b;
    }
    return __RETURN_RESULT__;
}

And let’s be honest, this is really an improvement, and is probably how the developer should have written the code in the first place.

It is future work to make this transformation implicit, both saving time and getting rid of pesky artifacts like the default initialization value in __RETURN_RESULT__. This shouldn’t be too hard, but it will be subtle so I haven’t gotten around to it yet.

Cornelius then translates the reguralized AST into a PEG. I’ll represent PEGs as S-expressions, but this is inaccurate: PEGs use deduplication or node sharing to cut down on size and allow for very fast application of rewrite rules. I can’t actually represent this as an S-expression so I’ll just undeduplicate the PEG.

(method-root
  ;; Return value
  (phi (! (phi (> (var a) (var b)) true false))
       (var b) 
       (phi (> (var a) (var b)) (var a) -2147483648))
  ;; Resulting heap (pay no mind!)
  (heap 0))

Since I like you I’m gonna rewrite this to a nicer form; I’ll extract the return value, omitting the heapy bit, and I’ll apply the rewrite rule (phi ?c true false) => ?c to make this more legible:

(phi (! (> (var a) (var b)))
     (var b)
     (phi (> (var a) (var b))
          (var a)
          -2147483648))

The phi nodes are if-then-else expressions, and the var nodes represent free variables (in this case, parameters passed in to the method).

The (simplified) pegs for the four mutants are:

Mutant 1:

(phi (! (!= (var a) (var b)))
     (var b)
     (phi (!= (var a) (var b))
          (var a)
          -2147483648))

Mutant 2:

(phi (! (>= (var a) (var b)))
     (var b)
     (phi (>= (var a) (var b))
          (var a)
          -2147483648))

Mutant 3:

(phi (! false) 
     (var b)
     (phi false
          (var a)
          -2147483648))

Mutant 4:

(var b)

POP QUIZ

Can you find a set of rewrite rules that will discover the equivalence between mutants 3 and 4? Generalize your results.
Can you find a set of rewrite rules that will discover the equivalence between the original program and mutant 2? (Warning! This is tricky!)

Solving the first problem is easier than solving the second. The following two rewrite rules will get us there:

(! false) => true
(phi true ?a ?b) => ?a

Handling the second problem, though, involves reasoning about local equalities.

Handling Local Equalities

First, if you got this far, you’re awesome! You get a free mutant!!!!! If you’ve reached your hour (which you probably have), GET OUT OF HERE! YOU’RE DONE! GO WALK YOUR MUTANT!

Alright, let’s take a look at a contrived program and some of its mutants.

    boolean areEqual(int a, int b) {
        if (a == b) {
            if (a == b) {
                return true;
            }
            return false;
        }
        return false;
    }

There are 8 mutants generated for this subject, and when Cornelius runs on them it outputs the following discovered equivalence classes (each line contains a single equivalence class, and mutant id 0 corresponds to the original program):

That means that Cornelius detected that mutants 3, 6, and 7 are redundant to one another, and no other equivalences were detected.

This is ground truth:

3 6 7
0 1 2 4 5 8

The good news is Cornelius hasn’t reported any incorrect equivalences! SWEET!

The bad news is that Cornelius missed a bunch of easy equivalences. Here are a couple of the equivalent mutants Cornelius missed:

Mutant 1:

    boolean areEqual(int a, int b) {
        if (a <= b) {                // `==` -> `<=`
            if (a == b) {
                return true;
            }
            return false;
        }
        return false;
    }

Mutant 5:

    boolean areEqual(int a, int b) {
        if (a == b) {
            if (a >= b) {            // `==` -> `>=`
                return true;
            }
            return false;
        }
        return false;
    }

Mutant 8:

    boolean areEqual(int a, int b) {
        if (a == b) {
            if (a == b) {
                return true;
            }
                ;                    // `return false;` -> `;`
        }
        return false;
    }

Let’s take a closer look at mutant 8 as it compares to the original program. The original program has the following PEG (for readability I’ve applied a couple rewrite rules):

Original Program:

(phi (! (phi (== (var a) (var b))
             (phi (! (== (var a) (var b)))
                  true
                  (== (var a) (var b)))
             false))
     false
     (phi (== (var a) (var b)) 
          (phi (! (== (var a) (var b)))
               false
               (== (var a) (var b)))
          false))

Mutant 8:

(phi (! (phi (== (var a) (var b)) 
             (== (var a) (var b))
             false))
     false
     (phi (== (var a) (var b)) 
          (== (var a) (var b))
          false))

These should be rewritable to one another. We can break these into sub problems:

Can we find a way to rewrite

(phi (== (var a) (var b))
     (phi (! (== (var a) (var b)))
          true
          (== (var a) (var b)))
     false)

(phi (== (var a) (var b)) 
     (== (var a) (var b))
     false)

Can we find a way to rewrite

(phi (== (var a) (var b))
     (phi (! (== (var a) (var b)))
          false
          (== (var a) (var b)))
     false)

(phi (== (var a) (var b)) 
     (== (var a) (var b))
     false)

In the above questions, is there a way to do this purely with rewrite rules? Remember that we don’t want any term to rewrite to multiple ‘value’ terms. So, for instance, rewriting (== (var a) (var b)) to true will rewrite it globally.

Solutions to Pop Quizes

Pop Quiz 1

Clarence kills mutant 1.
Clarence has a mutant kill ratio of 0.25
The following test kills mutants 3 and 4
```
 @Test
 public void aNewTestForClarence() {
     assertEquals(max(2,1), 2);
 }
```
There are no tests can kill mutant 2 because it is equivalent.
No, mutants 3 and 4 are in the same redundancy class.

Pop Quiz 2

The rewrite rules

(! false) => true
(phi true ?a ?b) => ?a

Let’s try to solve problem 2 from above. First, I’m gonna use a rule to rewrite the above programs to get rid of that stupid default value:

rule1: (phi (! ?c) (phi ?c ?x ?y) ?z) => (phi (! ?c) ?y ?z)

This makes sense because if we’re in the then branch of a phi node we know the condition is true.

This gives us

;; Original
(phi (! (> (var a) (var b)))
     (var b)
     (var a))
   
;; Mutant 2
(phi (! (>= (var a) (var b)))
     (var b)
     (var a))

Next, let’s distribute the negation over the comparison operators:

rule2: (! (>  ?a ?b)) => (<= ?a ?b)
rule3: (! (>= ?a ?b)) => (< ?a ?b)

This gives us

;; Original
(phi (<= (var a) (var b))
     (var b)
     (var a))
   
;; Mutant 2
(phi (< (var a) (var b))
     (var b)
     (var a))

Now I want to split the <= into two cases: < and ==:

rule4: (<= ?a ?b) => (|| (< ?a ?b) (== ?a ?b))

This produces

;; Original
(phi (|| (< (var a) (var b)) (== (var a) (var b)))
     (var b)
     (var a))
   
;; Mutant 2
(phi (< (var a) (var b))
     (var b)
     (var a))

Okay, let’s turn our || into nested phi nodes.

rule5: (phi (|| ?c1 ?c2) ?a ?b) => (phi ?c1 ?a (phi ?c2 ?a ?b))

This produces

;; Original
(phi (< (var a) (var b))
     (var b)
     (phi (== (var a) (var b))
          (var b)
          (var a)))
   
;; Mutant 2
(phi (< (var a) (var b))
     (var b)
     (var a))

Finally, if we can rewrite (phi (== (var a) (var b)) (var b) (var a)) to (var a), we win. One way to do this would be to rewrite the (var b) in the then branch of the phi node to (var a), and use the rule

rule6: (phi ?c ?a ?a) => ?a

This is where I’m stuck. I want to do local rewrites , but all rewrites are global in an Egraph.

I’ve tried to use a technique called equality refinement to make local rewrites, but this still introduces soundness errors. Another idea is to use a second Egraph to do local reasoning, and this seems like it could be promising, but would also have a few downsides as well.