Gojomo

2006-02-11
Wilkerson & McPeak, Delta @ CodeCon 2006 12:30pm Saturday

Continuing prejudicial CodeCon session previews:

Daniel S. Wilkerson, Scott McPeak: Delta, 12:30pm Saturday @ CodeCon 2006

Delta assists you in minimizing "interesting" files subject to a test of their interestingness. A common such situation is when attempting to isolate a small failure-inducing substring of a large input that causes your program to exhibit a bug.
'Delta' has a number of glowing testimonials from the gcc team and users on their project page. The two big wins there seem to be: (1) When a giant range of proprietary code shows a gcc bug, minimizing it via delta makes it concisely reportable, where it couldn't be effectively reported at all otherwise; (2) people often submit larger-than-minimal trigger cases, and delta is used to (automatically always?) minimize these.

Very neat. There must be other applications, too. Considering yesterday's theme of dissecting novel malware, perhaps this could be used to discover the minimal effective countermeasures against a new threat? (Apply all draconian countermeasures. Loosen progressively until finding smallest set that works against the threat.) Or bioinformatics? One can easily imagine real heredity, mutation, and immunology working this way at the microscopic level to explore the survivability possibility space.

I'm most curious to see the actual efficient 'winnowing' processes used by the tool. (I hope there's a visualization or animation of some sort.) Is it completely systematic, guaranteed to determine the smallest possible trigger case possible by dropping lines? That seems impossible, given the size of input files: a thousand-line input has 2^1000 possible minimizations, too many to search exhaustively. How many trials does it run, how fast, in a typical minimization (independent of the user-supplied interestingness predicate)? Perhaps there's a random element -- so that repeated reruns (with different seeds) or a willingness to wait longer could discover different/better minimal cases.

Update (Saturday 1:26pm): The search for lines that are droppable is dumb and deterministic -- essentially a binary search for logarithmically-sized subranges that can be eliminated. It's thus somewhat sensitive to the alignment of content with respect to the subranges. Its performance is thus improved a lot by a 'flattening' preprocessing which manages to group units of the input (eg C code blocks) into a single line. Wilkerson suggested a randomization element might help.

This project, too, deserves a better name. (Am I obsessed with names or what?) Something like 'occamizer'. It's available and descriptive.

Technorati Tags: , , , ,


Comments: Post a Comment