To achieve high performance, contemporary computer systems rely on two
forms of parallelism: instruction-level parallelism (ILP) and thread-level
parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing
multiple instruction from a signel program in a single cycle. Multiprocessors
(MP) exploit TLP by executing different threads in parallel on different
processors. Unfortunately, both parallel- processing styles statically
partition processor resources, thus preventing them from adapting to dynamically-changing
levels of TLP and ILP in a program. With insufficient TLP, processors in
an MP will be idle; with insufficient ILP, multiple-issue hardware on a
superscalar is wasted.
This paper explores parallel processing on an alternative architecture,
simultaneous multithreading (SMT), which allows multiple threads
to compete for and share all of the processor's resources every
cycle. The most compelling reason for running parallel applications on
an SMT processor is its ability to use thread-level parallelism and instruction-
level parallelism interchangeably. By permitting multiple threads to share
the processor's functional units simultaneously, the processor can use
both ILP and TLP to tolerate variations in parallelism. When a program
has only a single thread, all of the SMT processor's resources can be dedicated
to that thread; when more TLP exists, this parallelism can compensate for
a lack of per-thread ILP.
In this work, we examine two alternative on-chip parallel architectures
for the next generation of processors. We compare SMT and small-scale,
on-chip multiprocessors (MP) in their ability to exploit both ILP and TLP.
First, we identify the hardware bottlenecks that prevent multiprocessors
from effectively exploiting ILP. Then, we show that because of its dynamic
resource sharing, SMT avoids these inefficiencies and benefits from being
able to run more threads on a single processor. The use of TLP is especially
advantageous when per-thread ILP is limited. The ease of adding additional
thread contexts on an SMT (relative to adding additional processors on
an MP) allows simultaneous multithreading to expose more parallelism, further
increasing functional unit utilization and attaining a 52% average speedup
(versus a four- processor, single-chip multiprocessor with comparable execution
resources).
This study also addresses an often-cited concern regarding the use of thread-level
parallelism or multithreading: interference in the memory system and branch
prediction hardware. We find that multiple threads cause inter-thread interference
in the caches and also place greater demands on the memory system, increasing
average memory latencies. By exploiting thread-level parallelism, however,
SMT hides these additional latencies, so that they only have a small impact
on total program performance. We also find that for parallel applications,
the additional threads have minimal effects on branch prediction.
To get the PostScript file, click here.