WHOPR - A scalable whole program optimizer for GCC

Thursday, November 20, 2008

Traditional compilation proceeds one file at a time. The compiler optimizes and generates code for each file in isolation and then the final executable is created by linking all the individual files together.

This model of compilation has the advantage that all the files can be compiled concurrently, which greatly reduces build time. This is particularly useful with current multiprocessor machines and applications consisting of hundreds and even thousands of files. However, this model presents a fairly significant barrier for optimization.

Consider this program:

for (;;) {
x += g (f (i, j), f (j, i));

float f(float i, float j)
return i * (i - j);

float g(float x, float y)
return x - y;

From an optimization perspective, inlining f and g inside foo is likely going to provide significant performance improvements. However, the compiler never sees both files at the same time. When it is compiling foo.c it does not have access to the functions in bar.c and vice-versa. Therefore, inlining will never take place.

One could get around this problem by moving the bodies of f and g to a common header file and declaring them inline. But that is not always desirable or even possible. So, several optimizing compilers introduce a new feature called Link-Time Optimization (LTO) that allows these kinds of cross-file manipulations by the compiler.

The LTO model essentially splits code generation in two major phases

1. Generation of Intermediate Representation (IR). The original source code is parsed as usual and an intermediate representation (IR) for the program is generated. The IR is a succinct representation of the original program, symbols and types. This contains all the information that the compiler needs to generate final code, except that instead of using it to generate final code for each file, the compiler saves it to an intermediate file for later processing.

2. Once the IR has been emitted for all the source files, the intermediate files generated in the previous step are loaded in memory and the whole set is analyzed and optimized at once.

To visualize this process, imagine that you simply concatenated all the source files together and compiled the result. Since the compiler has visibility over every function in every compilation unit, decisions that would normally use conservative estimates can instead be based on data-flow information crossing file boundaries. Additionally, the compiler is able to perform cross-language optimizations that are not possible when the compilation scope is restricted to individual files.

The LTO model of compilation is useful but it has severe scalability issues. A basic implementation has the potential to incur massive memory consumption during compilation. Since every function body in every file may be needed in memory, only relatively small programs will be able to be compiled in whole program mode.

At Google, we deal with several massively large applications, so we are working on a scalable alternative to traditional LTO called WHOPR (WHOle Program optimizeR), which introduces parallelism and distribution to be able to handle arbitrarily large programs. The basic observation is that to do many whole program optimizations, the compiler rarely needs to have all the functions loaded in memory, and final code generation can be parallelized by partitioning the program into independent sets.

WHOPR then proceeds in 3 phases

1. Local Generation (LGEN). This is the same as traditional LTO. Every source file is parsed and its IR saved to disk. This phase is trivially parallelizable using make -j or distcc or any other similar technique.

2. Whole program analysis (WPA). After all the IR files are generated, they are sent to the linker, but in this case the linker will not know what to do with them (there is no object code in them). So, the linker turns around and passes them back to the compiler which will collect summary information from every function in every file. This per-function summary information contains things like number of instructions, symbols accessed, functions called, functions that call it, etc. It is used to decide what optimizations to apply, but no optimizations are applied at this time. The compiler simply decides what to do and partitions the input files into new files that contain the original IR plus an optimization plan for each new file.

3. Local transformations (LTRANS). The new IR files generated by the previous phase are now compiled to object code using the optimization plan decided by WPA. Since each file contains everything needed to apply the optimization plan, it can also proceed in parallel.

This diagram shows the process. The only sequential step during optimization is the WPA phase, which does not need to operate on too much data and it is not computationally expensive. Everything else proceeds in parallel, suitable for multiprocessors or distributed machines.

After all the LTRANS processes are finished, the final object files are returned to the linker and the final executable is generated.

We are currently in the initial stages of implementation. The work is being implemented in the LTO branch in GCC. We expect to have an initial prototype by summer 2009. The branch can currently deal with some applications, but there are the usual rough spots. You can read more information about the project at