Efficient Predicated Execution with CoSy

In a compiler, programming constructs such as if-statements, for-loops, and do-loops are translated to conditional branches. Runtime performance gets worse because of pipeline effects of branch instructions. For instance, a mispredicted branch on the CELL SPE processor costs 18 cycles [See: IBM Research and Development]. Conditional branches also prohibit certain compiler optimizations. Predicated Execution of instructions is an effective technique to remove the control flow that results from branch instructions. Thus, under short-time to market constraints, it is crucial that the compiler framework directly supports predicated execution to deliver high-performance compilers for processors with predicated execution [See: Retargetable Code Optimization for Predicated Execution, Leupers et al.]. The latest CoSy release provides a unique and efficient solution to implement Predicated Execution. That is, it provides a target-independent framework for predicated execution. Not only does it result in better runtime performance as shown in Figure 1, but also in better retargetability as it does not rely on a target-specific implementation.

Figure 1: The CoSy Predicated Execution framework results in up to 28% improvement.

How is it done?

CoSy's predicated execution algorithms operate at the CCMIR Intermediate Representation level storing their decisions in annotations. These decisions are steered via CoSy's unique feedback mechanism for decision making, exploiting the necessary target information from the code generator. In this way ease of retargeting is combined with high-quality code generation driven by the annotations. Because the annotation is done at the abstraction level of CoSy's CCMIR, the predicated execution framework is decoupled: on the one hand, the bulk of the algorithms are implemented independently from the target of the compiler in the middle-end; and on the other hand, the target parameterization is implemented in the code generator.

For all if-then-end and if-then-else-end sequences of basic blocks the cost of branching or executing conditionally is estimated with feedback from the code generator. It provides realistic and accurate cost-metrics to help the decision making of the predicated execution algorithms. This unique CoSy feature allows the code generator to be called during high-level transformations. The assembly code that results from this call is never used; but the size and other characteristics can be accurately measured at the instruction level. These metrics are then fed back into the high-level decision flow. The metrics used for predicated execution are qualitative (i.e. do predicated instructions exist at all for the then and else branches) and quantitative (i.e. cycle-counts for the two branches).

Predicated Execution Scenarios

CoSy extends a range of high-level optimizations with support for predicated execution. This highlights three of them:

  • Software Pipelining works best on loop bodies without control flow. Predicated execution removes control flow, thus enables more loops to be Software Pipelined.
  • Loop unrolling increases a program's speed by reducing (or eliminating) instructions that control the loop. For loops with low iteration count, we have found that the efficiency of the prologue created by loop unrolling can be critical to application performance. In these prologues, predicated execution can be put to good use to avoid a chain of branches to be executed. CoSy's loop unroller can generate several variants for the prologue, all based on plain control code. The compiler developer chooses the best variant and that code is then automatically converted as applicable for the target architecture.
  • Providing early decisions on predication candidates in high level CoSy engines improves the quality of the generated code. Figure 2 shows a code example where the right decision to perform loop transformations is critical to obtain the best performance possible.
    Because the annotations are set on the intermediate representation, it is easy to write your own engine in CoSy that has knowledge of predicated execution.

Code Snippet from matinv.

Figure 2: This code-snippet is the kernel of a benchmark in which matrices are inverted. The body of this loop contains 2 conditional branches; CoSy makes the decision at a high level to transform these branches into predicated execution. This results in better code.

Benchmarking

The selected benchmarks in Figure 3 represent different forms of control flow encapsulated within loops. Therein predicated execution is crucial for best performing code. The bubble-sort benchmark contains two-level loops with nested if-statements. Livermore 15, selected from the Livermore Loops, consists of different loop forms with complicated control flow. The telecom benchmark is an audio processing filter - Adaptive Differential Pulse-code Modulation - which contains nested control flow that offers many opportunities for predicated execution.

Figure 3: Performance of predicated execution intensive benchmarks compiled with CoSy improve up to 75% over LLVM 3.0, architecture ARM9 processor.

Alternatives? No!

Other compiler frameworks lack efficient and reliable support for predicated execution in comparison to ACE's CoSy compiler development system. In LLVM's case each time a new backend is implemented the compiler builder has to take care of implementing its own predicated execution support. This is cumbersome and makes it uncertain to the compiler builder to meet with performance requirements. [See: 2012 LLVM Conference, LLVM mailing list]

Spending a lot of time crafting a target-specific solution in the code generator not only adds development overhead but does not necessarily pay off in terms of performance. The contrary is observed when comparing the LLVM compiler with hand-crafted predicated execution support in the code generator against CoSy's solution on the 5-billion-selling ARM9 processor [See: ARM.com] as shown in Figure 3. GCC's predicated execution for ARM is found to be better than LLVM; but it is also specific for the code generator.

One might wonder why other compiler frameworks do not have support for predication at the high level IR. The answer is that this is non-trivial. To predicate or not can be a thin line and accurate knowledge of the cost of instructions is needed. This is much easier to determine in the code generator. But the consequence is a target specific solution. This is where CoSy has its unique edge, it gives you the ability to reason about high level IR and have the cost of low level instructions at your disposal at the same time.

Less Work, Superior Results

Implementing predicated execution within the CoSy framework is not just easy, it also delivers outstanding performance.

Whatever job your compiler has to accomplish, for example compiling software for real-time systems with data-intensive computations, CoSy's high-level predicated execution support boosts run-time performance to the limit.