SourceForge.net Logo

What needs to be done

Making a boomerang This page describes the major changes that need to be made to Boomerang in the medium to long term to make it a decent binary decompiler. There is no particular plan to implement any of these, or any implied order. They are noted here so that prospective developers can get a feel for the issues, and whether they might be able to contribute.

1. Overall design

The original intent was for Boomerang to be very modular, using "plug ins". This has not been achieved, largely because of the differences between Unix and Windows with respect to the way that dynamically linked libraries are handled.

It would be great to sort out a proper dynamic linking system compatible with all the various systems, and all automatically configured. Then more parts of the decompiler could be dynamically linked: e.g. the type analysis module, each back end could be invoked dynamically (when we have more than one), etc.

Boomerang could be recast more as a series of transformations, each of which is generated from a concise, pattern matching or parser style file. For a discussion of ideas on redesigning Boomerang, see the Boomerang 2 forum on Sourceforge. Note that Booemrang 2 is unlikely to happen in the immediate future.

Ordering of analyses (phase ordering) is really important. It would be great to document the ordering and the reasons why. Mike has a dotty diagram that could make a good start.

To generate good output, either a good interactive environment is needed (a la IDA Pro), or perhaps good machine learning. The current QT GUI (in the qtgui2/ directory) needs some basic work, such as handling the long list of Boomerang switches in a settings dialog. The current code is not 64-bit clean; this will require some thought because the assumption that sizeof(int) == sizeof(pointer) is made throughout the code. Getting Boomerang to even compile succesfully for 64-bits might be a good beginner's project. You could not have to understand too much of the details, but you'd need to scan nearly all the code, so you'd get a bit of a feel for how it works.

2. Front end

The front end needs a cleaner separation between handling machines and loader formats. At present, some loader formats imply processors, but they should be completely independent.

There are only two main loader formats handled (ELF and Windows/PE). With a little work, three more loaders could be ported (UQBT, on which Boomerang is largely based, supported PA/RISC, Palm files, and a DOS/80286 EXE loader). It would be good to add support for popular debug (symbol) formats, such as for gcc and MSVC/Visual Studio. Perhaps these symbols would be converted into a format suitable for the -sf switch (symbol file).

The DOS loader (which needs to be renamed to something like DosBinaryFile, as ExeBinaryFile is confusing) might not be too much work. The loader probably works well enough; it just needs adding to the MSVC project for Windows users. However, it needs more support. It needs a 16-bit decoder (not the same as a subset of the Pentium decoder; there are differences in the mod/rm format in 16-bit mode). I think there is a 16-bit decoder around; contact Mike. It also needs some support for segments; this may be as simple as just giving the four segment registers arbitrary values, or initial values from the loader, and allowing constant propagation to take care of the details. But I suspect that there will be more pain than that. For example, you may need to tag offsets with segment numbers somehow. (But surely adding in the shifted segment register value would amount to the same thing.)

Only three input processors are reasonably fully handled: X86, SPARC, and PPC. The PPC decoder needs more work. There is the beginnings of an ST-20 decoder. The X86 decoder does not handle SIMD instructions (MMX, SSE, etc) or exotic processor variants (e.g. X86/64). It does not handle 16 bit instructions either, which is becoming more important as DOS applications become legacy code.

A possible replacement for the NJMC toolkit, at least for x86 instructions, is diStorm. It may be that we need to wait for version 3 which promises the ability to output expression trees.

Mike Van Emmerik has the beginnings of an ARM decoder (a .spec and .ssl file). There is a significant issue with guards covering multiple statements, so this will not be a trivial implementation. Anyone interested should contact Mike.

A bytecode target would be interesting; we could then compare the performance of Boomerang with that of the successful Java decompilers.

The present instruction decoders are automatically generated by the New Jersey Machine Code (NJMC) Toolkit. The machine semantics are specified in a language called SSL (Semantic Specitication Language), which was designed to work with the NJMC instruction names. It could be argued that these could be replaced by something more widely known, such as the GNU opcodes library. This would be a lot of work, and we would lose many years of careful bug fixing. Whatever we end up with, the use of machine specifications should be retained, so that a new machine doesn't have to have a lot of code written and debugged. (As an example, the generated X86 decoder is over 60,000 lines of C code, while the spec files are under 2500 lines total.)

We also need a way of detecting statically linked library functions, a la the dcc compiler, or FLIRT in IDA Pro. Without these, library code adds a lot of code to be looked at, and ultimately thrown away.

The front end expects a single entry point ("main") that all code to be decoded can be found from. (Unless the -e switch is used, but we'd like things to be as automatic as possible). There needs to be some more patterns to detect main for some executable formats, e.g. Boomerang currently does not find the main procedure for Cygwin executables.
For file formats like Windows executables, we should be interpreting things message maps and the like to find potentially hundreds of entry points.

There needs to be a mechanism to allow labels in SSL. This is needed for the more complex instructions, e.g. bit scanning. There is a hack at present which suffices for the Pentium bit scan instructions, but labels would allow a much cleaner, more general implementation.

The Mac OS/X frontend needs some work, e.g. the loader/MachOBinaryFile.cpp doesn't compile under Cygwin. Also, there is no support as yet for Intel Mac programs, or for Universal binaries. There is even the possibility of somehow making use of both binaries in a fat Mac binary to resolve things like pointers verses integers. For example, if an immediate value is say a CRC value or a floating point constant, then it will be identical in the two binaries, whereas if it is a pointer, it will be different in the two binaries. How to compare the two will be a challenge, though.

3. Parsers

There are several parsers in Boomerang: for the SSL language, for symbol/header files, for transformations, and so on. These are all basic Yacc-like parsers, but unfortunately GNU Bison has weak support for C++ parsers. So the parsers currently use the Coetmeur Bison++ parser and Flex++ lexer. Note that the Coetmeur Flex++ is not compatible (even at the command line switch level) with the GNU Flex lexer included on most Unix and Linux systems.

There is information at http://www.progtools.org/compilers/tutorials/cxx_and_bison/cxx_and_bison.html indicating how the GNU Bison and Flex can be used with C++ parsers. The main problem seems to be keeping track of objects to sensibly delete them, but Boomerang uses a garbage collector, and so doesn't have that problem.

It would be good to port the existing parsers over to the GNU tools, for several reasons. Firstly, finding those Coetmeur tools is one more thing people have to do if they need to modify any of the parsers. Secondly, the Coetmeur tools are causing some problems with Windows versions of Boomerang (a multiply defined symbol at link time), and there are hundreds of warnings when compiling the generated code on picky modern compilers. Thirdly, there is confusion over the different flex++ tools (typically, even after you install the Coetmeur flex++, the GNU one is earlier in your $PATH).

There appears to be a modern alternative to Coetmeur's bison++: Brokken's bisonc++. It may be able to use Boomerang's parse files (.y and maybe .l) almost as-is. Unfortunately, making bisoncpp requires icmake, and won't compile on my Fedora Core 4 distribution. I did manage to use alien to cludge together an rpm file (download here), which installed for me. I tried just drop-in replacing bison++ with bisonc++, but it didn't understand any of the switches, and produced several errors in the file. I suspect that all the meta commands like %union and %define CONSTRUCTOR_INIT have to be translated. This should not be such a chore, for anyone interested.

4. Back end

There is only one target language at present (the C language). It would certainly be good to get a second back end working; this should also clarify the backend interface. It is likely that some assumptions about the back end (in the intermediate representation, for example) will be revealed, and these should be eliminated as far as possible.

The existing C back end was only meant to be temporary, and needs a good cleanup. Some time ago, the fixed length string code was replaced with ostringstream code. Parentheses code has been written, and seems to work well. Perhaps a tree-based code generator would be good.

There is considerable interest in a C++ back end. The back end should be easy, but the support in the analyses ("middle end?") is currently lacking.

An ML or other functional programming language back end could make an interesting project. The correspondance between the SSA form and functional programs is well known (e.g. see A. Appel, SSA is functional programming, ACM Sigplan Notices, 33(4):17-20, Apr 1998).

5. Database and GUI

It was always intended that Boomerang become an interactive tool, along the lines of IDA Pro. The intention was to have some sort of database of transformations, so that automatic analyses and user changes alike could be viewed, changed, and reversed if necessary. (This is actually the reason for the "db" directory; it was to have the database code. There have been as many as three different GUIs for Boomerang, but so far there is no interaction with the decompiler itself. The latest thinking is that perhaps the GUI and (batch mode) decompiler engine should be separate, and the two should communicate with command line switches and XML files. Boomerang does at present generate XML files, but the flow of information is all one way (from the engine to the GUI, not back again). The design of this interface needs to be carefully considered.

A decent, free (as in free speech) cross platform GUI needs to be found. One abandoned GUI uses Qt (in the qtgui/ directory), for which the Windows version costs money (to develop on), and the ability to use the Unix version for commercial purposes is not certain.

One possibility may be to leave Boomerang as a console mode program (we would probably leave the current Windows GUI), and make a separate GUI in something very machine independent, such as Java or Tcl/Tk. We would need an interface of some sort, although the command line mode (see Boomerang's -k option) might even be extended for this. A student project interfaced Boomerang with Eclipse using this technique.

6. Analyses

Much remains to be done in the analyses that transform the machine instruction semantics in to higher level code. The main one of these is Type Analysis. One of the main authors of Boomerang (Mike Van Emmerik) is working on this as part of his PhD. It is requsted that this area be left to Mike for the time being, so that he has no problems with claims of originality. (When the PhD is complete, hopefully some time in 2005, then as for all other parts of Boomerang, contrubutions are welcome).

The current analysis relies heavily on Static Single Assignment (SSA) representation. This seems to be fairly robust, although there are problems with "phi loops". There is a simple theorem prover, which among other thigs attempts to prove whether a register is preserved in a function. (This fails a bit more than we would like).

Current ad-hoc prevention of propagation (e.g. into conditionals) has to be fixed (it is preventing some expressions from transforming to local variables, for example). Perhaps "propagate absolutely everything then reverse some" might work. "Reverse some" probably implies some form of Common Subexpression Elimination (CSE).

The code to "simplifying" an expression is long and somewhat unwieldy. It performs what in a compiler would be called peephole optimisation (e.g. x+0 -> x), and canonicalisation (e.g. 4+x is always rewritten as x+4, so we don't have to test for both cases later on). This work is partly started; see the transform/ directory.

Idiomatic code sequences need to be recognised and replaced by equivalent RTL. This should probably be done after propagation, so that minor variations and irrelevant instructions don't affect the results. There are plenty of examples from MSVC alone: inlining calls to exp(), using a multiply instruction and various shifts and copies to divide by a constant, various idioms related to the ?: operator, etc.

The entire issue of aliasing is pretty much ignored at present. It is possible that some alias analysis will be incorporated into the type analysis.

Internal representation of predication and paralellism (e.g. that exposed explicitly in VLIW and EPIC) are not considered as yet. To be able to decompile VLIW code, this will be necessary.

Overhapped registers (e.g. al/ax/eax) are currently only handled in X86, (-O) and not in a general way. This might better be handled with alias analysis. The technique used for X86 is not suitable for SPARC, for example, since you can't really do arithmetic opertations on floating point registers. A more general solution is needed.

Swich code recognition was changed (in early 2004) from being part of the front end, to happening after propagations. This makes the analysis more "high level" and resistant to changes in the representation (SSA and simplification take care of that). It has also been easy to extend the code to recognise high level patterns for virtual function code (so it handles both indirect jumps and indirect calls). This code needs refinement to handle all the various implementations of switch code (there are surprising variations). The virtual call recognition is embryonic, and needs more work.

The Statement class hierarchy was implemented in late 2003. Earlier, RTLs (Register Transfer Lists) were lists of RTs (Register Transfers), which were essentially expressions. Now, an RTL is a list of Statements (there are Assigns, GotoStatements, CaseStatements, CallStatements, ReturnStatements, etc). One RTL still represents the semantics of one source machine instruction. The RTL class is now more or less redundant, and should be removed.

In UQBT (the University of Queensland Binary Translator), from which Boomerang is derived (especially the front end), expressions are represented in "linearised strings". These were considered unsatisfactory in various ways (in particular, search and replace was awkward). So from the beginning, Boomerang replaced these with the more common tree representation. There were surprising problems, all of which have been overcome. Overall, the verdict seems to be that while each representation has it good and bad points, the tree representation is slightly better. In any case, the effort of changing the representation of expressions was found to be much greater than any potential rewards.

. Alias analysis

. More on type analysis, particularly with respect to running pointers. It may be necessary to abstract away the control flow to achieve this in a general way; see Eben Upton's work and hopefully soon his thesis.
. Excessive propagation causes unreadable code, but some propagation is essential. A sort of propagation on demand is required, and is partly implemented in Boomerang (you need -X to activate it, as it doesn't always work).

. Translating out of SSA form can generate extraneous variables. At present, Boomerang can generate thousands of such temporary variables per procedure, bogging it down and causing decompilations to use gigabytes of memory and days of cpu time. Code motion may be able to minimise these extraneous variables, as well as limiting the propagation of statements of the form x := f(x) (e.g. increment/decrement, two operand adds, etc). Or perhaps SSA is not the best IR for a decompiler after all. Also when generating a copy statement to resolve a liveness overlap, it may be possible to choose the one which will lead to minimal extraneous temporary variables.

. It would be nice to be able to recover global registers as global variables, rather than passing them as reference parameters to many procedures.

. Subfields are a problem for traditional data flow analyses. For example, if cl and ch are allocated to ecx, or if four boolean variables are allocated to edx, they are not treated separately. Similarly, if at the source code level, they are allocated to a struct which happens to fit one word of memory.

. Decompilation seems to be about 1.5 orders of magnitude slower than compilation, and the IR for the whole program is required at once. For the largest programs, this may require some form of parallelism. How to do this effectively, how to partition the problem, and how to communicate the necessary information from thread to thread or machine to machine is an interesting problem.

. How to deal with obfuscated code.

. How to deal with operators such as short circuit & &, ||, and the comma operator in the predicates of conditionals and loops.

. Is def-use information required in a decompiler? If needed a lot, how to represent it; perhaps the dependence flow graph combined with SSA may make a lot of sense.

. Recovery of class hierarchy, removal of auto generated code like contructors and destructors, and recovery of exception processing code for C++ programs.

7. Structuring

Structuring is the process of replacing low level control flow (such as branch instructions) with high level constructs (such as if then statements and do while loops).

The current Boomerang structuring code was imported from a student honours project. As a result, it does not fit particularly well with the rest of the code. For example, dominators are calculated in this code, independently of the SSA dominators code. It seems likely that these could be combined.
Also, the structuring does not handle some cases very well. For example, MSVC for loops end up with this structure:
  if (precicate) {
    do {
      loop code;
    } while (predicate);
  }
While perfectly valid, this code is obviously harder to read than a single while or for loop.

The current structuring code doesn't handle short cut conditionals (e.g. if (predicate1 && predicate2) generate gotos). Way too much data is presently held in the BasicBlock class.

For a long time, I believed that structuring issues were largely solved, mainly as shown in chapter 6 of [Cif94]. However, it is possible that while the issue of compound and/or short circuit conditional (if/then/else) statements is treated there, the issue of compound and/or short circuit predicates in loop statements may be more complex, and it seems to me is as yet unaddressed. Certainly, Boomerang does a very poor job of such code at present. It may require significant research to solve this issue. There were changes around July 2006 which may have solved the problem; simple conditionals with short circuit are now handled correctly. Time and further testing will tell if there are more problems, particularly where short circuit predicates are used in loops.

I've decided that the whole idea of the ImplicitReference is wrong. The idea is this: suppose you pass esp-16 to a procedure that happens to be a CString constructor. Isn't that in effect implicitly defining the location m[esp-16]? Don't we need a place to put the type for this implicit definition? Well, yes it is and yes we do, but we don't need an ImplicitReference object to do it. It is part of the summary of the call; as well as taking say ecx as a parameter, it also defines *ecx as a sort of modified or return. It's the definition in the call that should store the type (in this case, class CString). The implicit references are somewhat untidy and take up a lot of memory unnecessarily. So they need to be pulled out, and the CallStatement needs to be improved to take care of these indirect definitions.

7. Testing and debugging

There is still much to be done with both functional and unit (module) testing. The current unit testing generates a giant executable linking all object files of Boomerang, and is such a pain to keep updated that it is no longer useful. It would be convenient to automatically generate stub files so that testing could be separated into separate programs. (For example, you could make a change to the expression code which causes other parts of the code to break, but still test the expression code in isolation.) Several of the modules (e.g. Cfg, Proc) need many more tests. At this stage of Boomerang's development, however, functional testing is much more important.

The functional test script (functest.sh) works pretty well at present for Unix and variants, including Cygwin and (by using "sh functest.sh") MinGW. However, there is no native Windows port. It would be nice to have a .bat file for running at a standard Windows command prompt, using GnuWin32 versions of grep and sed, and using %errorlevel% in place of $?. This might not be very hard to do. Some system whereby changes to functest.sh could be transferred automatically to functest.bat would be great but not essential.

Debugging is another issue. At present, one of the main debugging tools is the log file; there are various command line switches to direct diagnostic information from various modules to the log file. This is not ideal, and the log file can become quite large and difficult to search. Some sort of interactive debugger would be quite useful, although the QT GUI is becoming useful for this.

Last modified 19/Nov/06: Implicit References are a bad idea; 64-bit cleaness; update testing; diStorm