Compiler optimizations are often driven by specific assumptions about
the underlying architecture and implementation of the target machine. For
example, when targeting shared-memory multiprocessors, parallel programs
are compiled to minimize sharing, in order to decrease high-cost, inter-processor
communication.
This paper reexamines several compiler optimizations in the context of
simultaneous multithreading (SMT), a processor architecture that issues
instructions from multiple threads to the functional units each cycle.
Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained
sharing of processor and memory system resources; unlike current uniprocessors,
SMT exposes and benefits from inter-thread instruction-level parallelism
when hiding latencies. Therefore, optimizations that are appropriate for
these conventional machines may be inappropriate for SMT. We revisit three
optimizations in this light: loop-iteration scheduling, software speculative
execution, and loop tiling. Our results show that all three optimizations
should be applied differently in the context of SMT architectures: threads
should be parallelized with a cyclic, rather than a blocked algorithm;
non-loop programs should not be software speculated, and compilers no longer
need to be concerned about precisely sizing tiles to match cache sizes.
By following these new guidelines, compilers can generate code that improves
the performance of programs executing on SMT machines.