Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 53
2
What Is Computer Performance?
F
ast, inexpensive computers are now essential to numerous human
endeavors. But less well understood is the need not just for fast
computers but also for ever-faster and higher-performing comput-
ers at the same or better costs. Exponential growth of the type and scale
that have fueled the entire information technology industry is ending. 1
In addition, a growing performance gap between processor performance
and memory bandwidth, thermal-power challenges and increasingly
expensive energy use, threats to the historical rate of increase in transistor
density, and a broad new class of computing applications pose a wide-
ranging new set of challenges to the computer industry. Meanwhile, soci -
etal expectations for increased technology performance continue apace
and show no signs of slowing, and this underscores the need for ways
to sustain exponentially increasing performance in multiple dimensions.
The essential engine that has met this need for the last 40 years is now in
considerable danger, and this has serious implications for our economy,
our military, our research institutions, and our way of life.
Five decades of exponential growth in processor performance led to
1 Itcan be difficult even for seasoned veterans to understand the effects of exponential
growth of the sort seen in the computer industry. On one level, industry experts, and even
consumers, display an implicit understanding in terms of their approach to application and
system development and their expectations of and demands for computing technologies.
On another level, that implicit understanding makes it easy to overlook how extraordinary
the exponential improvements in performance of the sort seen in the information technology
industry actually are.
53
OCR for page 54
54 THE FUTURE OF COMPUTING PERFORMANCE
the rise and dominance of the general-purpose personal computer. The
success of the general-purpose microcomputer, which has been due pri-
marily to economies of scale, has had a devastating effect on the develop-
ment of alternative computer and programming models. The effect can be
seen in high-end machines like supercomputers and in low-end consumer
devices, such as media processors. Even though alternative architectures
and approaches might have been technically superior for the task they
were built for, they could not easily compete in the marketplace and
were readily overtaken by the ever-improving general-purpose proces-
sors available at a relatively low cost. Hence, the personal computer has
been dubbed “the killer micro.”
Over the years, we have seen a series of revolutions in computer
architecture, starting with the main-frame, the minicomputer, and the
work station and leading to the personal computer. Today, we are on
the verge of a new generation of smart phones, which perform many
of the applications that we run on personal computers and take advan-
tage of network-accessible computing platforms (cloud computing) when
needed. With each iteration, the machines have been lower in cost per
performance and capability, and this has broadened the user base. The
economies of scale have meant that as the per-unit cost of the machine has
continued to decrease, the size of the computer industry has kept growing
because more people and companies have bought more computers. Per-
haps even more important, general-purpose single processors—which all
these generations of architectures have taken advantage of—can be pro-
grammed by using the same simple, sequential programming abstraction
at root. As a result, software investment on this model has accumulated
over the years and has led to the de facto standardization of one instruc -
tion set, the Intel x86 architecture, and to the dominance of one desktop
operating system, Microsoft Windows.
The committee believes that the slowing in the exponential growth
in computing performance, while posing great risk, may also create a
tremendous opportunity for innovation in diverse hardware and software
infrastructures that excel as measured by other characteristics, such as low
power consumption and delivery of throughput cycles. In addition, the
use of the computer has becomes so pervasive that it is now economical
to have many more varieties of computers. Thus, there are opportunities
for major changes in system architectures, such as those exemplified by
the emergence of powerful distributed, embedded devices, that together
will create a truly ubiquitous and invisible computer fabric. Investment in
whole-system research is needed to lay the foundation of the computing
environment for the next generation. See Figure 2.1 for a graph showing
flattening curves of performance, power, and frequency.
Traditionally, computer architects have focused on the goal of creating
OCR for page 55
55
WHAT IS COMPUTER PERFORMANCE?
10,000,000
Num Transistors (in Thousands)
Relative Performance
1,000,000
Clock Speed (MHz)
Power Typ (W)
100,000
NumCores/Chip
10,000
1,000
100
10
1
0
1985 1990 1995 2000 2005 2010
Year of Introduction
FIGURE 2.1 Transistors, frequency, power, performance, and cores over time
(1985-2010). The vertical scale is logarithmic. Data curated by Mark Horowitz with
input from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris
Batten, and Krste Asanoviç.
computers that perform single tasks as fast as possible. That goal is still
important. Because the uniprocessor model we have today is extremely
powerful, many performance-demanding applications can be mapped to
run on networks of processors by dividing the work up at a very coarse
granularity. Therefore, we now have great building blocks that enable us
to create a variety of high-performance systems that can be programmed
with high-level abstractions. There is a serious need for research and
education in the creation and use of high-level abstractions for parallel
systems.
However, single-task performance is no longer the only metric of
interest. The market for computers is so large that there is plenty of eco-
nomic incentive to create more specialized and hence more cost-effective
machines. Diversity is already evident. The current trend of moving com-
putation into what is now called the cloud has created great demands for
high-throughput systems. For those systems, making each transaction run
as fast as possible is not the best thing to do. It is better, for example, to
have a larger number of lower-speed processors to optimize the through-
put rate and minimize power consumption. It is similarly important to
conserve power for hand-held devices. Thus, power consumption is a
OCR for page 56
56 THE FUTURE OF COMPUTING PERFORMANCE
BOX 2.1
Embedded Computing Performance
The design of desktop systems often places considerable emphasis on gen-
eral CPU performance in running desktop workloads. Particular attention is
paid to the graphics system, which directly determines which consumer games
will run and how well. Mobile platforms, such as laptops and notebooks, at-
tempt to provide enough computing horsepower to run modern operating
systems well—subject to the energy and thermal constraints inherent in mobile,
battery-operated devices—but tend not to be used for serious gaming, so high-
end graphics solutions would not be appropriate. Servers run a different kind of
workload from either desktops or mobile platforms, are subject to substantially
different economic constraints in their design, and need no graphics support
at all. Desktops and mobile platforms tend to value legacy compatibility (for ex-
ample, that existing operating systems and software applications will continue
to run on new hardware), and this compatibility requirement affects the design
of the systems, their economics, and their use patterns.
Although desktops, mobile, and server computer systems exhibit important
differences from one another, it is natural to group them when comparing them
with embedded systems. It is difficult to define embedded systems accurately
because their space of applicability is huge—orders of magnitude larger than
the general-purpose computing systems of desktops, laptops, and servers. Em-
bedded computer systems can be found everywhere: a car’s radio, engine con-
troller, transmission controller, airbag deployment, antilock brakes, and dozens
of other places. They are in the refrigerator, the washer and dryer, the furnace
controller, the MP3 player, the television set, the alarm clock, the treadmill and
stationary bike, the Christmas lights, the DVD player, and the power tools in
the garage. They might even be found in ski boots, tennis shoes, and greeting
cards. They control the elevators and heating and cooling systems at the office,
the video surveillance system in the parking lot, and the lighting, fire protection,
and security systems.
Every computer system has economic constraints. But the various systems
tend to fall into characteristic financial ranges. Desktop systems once (in 1983)
cost $3,000 and now cost from a few hundred dollars to around $1,000. Mobile
systems cost more at the high end, perhaps $2,500, down to a few hundred dol-
lars at the low end. Servers vary from a few thousand dollars up to hundreds of
thousands for a moderate Web server, a few million dollars for a small corporate
key performance metric for both high-end servers and consumer hand-
held devices. See Box 2.1 for a discussion of embedded computing per-
formance as distinct from more traditional desktop systems. In general,
power considerations are likely to lead to a large variety of specialized
processors.
The rest of this chapter provides the committee’s views on matters
related to computer performance today. These views are summarized in
OCR for page 57
57
WHAT IS COMPUTER PERFORMANCE?
server farm, and 1 or 2 orders of magnitude more than that for the huge server
farms fielded by large companies, such as eBay, Yahoo!, and Google.
Embedded systems tend to be inexpensive. The engine controller under the
hood of a car cost the car manufacturer about $3-5. The chips in a cell phone
were also in that range. The chip in a tennis shoe or greeting card is about 1/10
that cost. The embedded system that runs such safety-critical systems as eleva-
tors will cost thousands of dollars, but that cost is related more to the system
packaging, design, and testing than to the silicon that it uses.
One of the hallmarks of embedded systems versus general-purpose com-
puters is that, unlike desktops and servers, embedded performance is not an
open-ended boon. Within their cost and power budgets, desktops, laptops, and
server systems value as much performance as possible—the more the better.
Embedded systems are not generally like that. The embedded chip in a cell
phone has a set of tasks to perform, such as monitoring the phone’s buttons,
placing various messages and images on the display, controlling the phone’s
energy budget and configuration, and setting up and receiving calls. To ac-
complish those tasks, the embedded computer system (comprising a central
processor, its memory, and I/O facilities) must be capable of a some overall
performance level. The difference from general-purpose computers is that once
that level is reached in the system design, driving it higher is not beneficial; in
fact, it is detrimental to the system. Embedded computer systems that are faster
than necessary to meet requirements use more energy, dissipate more heat,
have lower reliability, and cost more—all for no gain.
Does that mean that embedded processors are now fast enough and have
no need to go faster? Are they exempt from the emphasis in this report on
“sustaining growth in computing performance”? No. If embedded processor
systems were to become faster and all else were held equal, embedded-system
designers would find ways of using the additional capability, and delivering
new functionalities would come to be expected on those devices. For example,
many embedded systems, such as the GPS or audio system in a car, tend to
interface directly with human beings. Voice and speech recognition capability
greatly enhance that experience, but current systems are not very good at the
noise suppression, beam-forming, and speech-processing that are required to
make this a seamless, enjoyable experience, although progress is being made.
Faster computer systems would help to solve that problem. Embedded systems
have benefited tremendously from riding an improvement curve equivalent to
that of the general-purpose systems and will continue to do so in the future.
the bullet points that follow this paragraph. Readers who accept the com -
mittee’s views may choose to skip the supporting arguments and move
on to the next chapter.
· Increasing computer performance enhances human productivity.
· One measure of single-processor performance is the product of
operating frequency, instruction count, and instructions per cycle.
OCR for page 58
58 THE FUTURE OF COMPUTING PERFORMANCE
· Performance comes directly from faster devices and indirectly
from using more devices in parallel.
· Parallelism can be helpfully divided into instruction-level paral-
lelism, data-level parallelism, and thread-level parallelism.
· Instruction-level parallelism has been extensively mined, but
there is now broad interest in data-level parallelism (for example,
due to graphics processing units) and thread-level parallelism
(for example, due to chip multiprocessors).
· Computer-system performance requires attention beyond pro-
cessors to memories (such as,dynamic random-access memory),
storage (for example, disks), and networking.
· Some computer systems seek to improve responsiveness (for
example, timely feedback to a user’s request), and others seek
to improve throughput (for example, handling many requests
quickly).
· Computers today are implemented with integrated circuits
(chips) that incorporate numerous devices (transistors) whose
population (measured as transistors per chaip) has been doubling
every 1.5-2 years (Moore’s law).
· Assessing the performance delivered to a user is difficult and
depends on the user’s specific applications.
· Large parts of the potential performance gain due to device inno -
vations have been usefully applied to productivity gains (for
example, via instruction-set compatibility and layers of software).
· Improvements in computer performance and cost have enabled
creative product innovations that generated computer sales
that, in turn, enabled a virtuous cycle of computer and product
innovations.
WHY PERFORMANCE MATTERS
Humans design machinery to solve problems. Measuring how well
machines perform their tasks is of vital importance for improving them,
conceiving better machines, and deploying them for economic bene-
fit. Such measurements often speak of a machine’s performance, and many
aspects of a machine’s operations can be characterized as performance.
For example, one aspect of an automobile’s performance is the time it
takes to accelerate from 0 to 60 mph; another is its average fuel economy.
Braking ability, traction in bad weather conditions, and the capacity to
tow trailers are other measures of the car’s performance.
Computer systems are machines designed to perform information
processing and computation. Their performance is typically measured
by how much information processing they can accomplish per unit time,
OCR for page 59
59
WHAT IS COMPUTER PERFORMANCE?
but there are various perspectives on what type of information process -
ing to consider when measuring performance and on the right time scale
for such measurements. Those perspectives reflect the broad array of
uses and the diversity of end users of modern computer systems. In gen -
eral, the systems are deployed and valued on the basis of their ability to
improve productivity. For some users, such as scientists and information
technology specialists, the improvements can be measured in quantitative
terms. For others, such as office workers and casual home users, the per-
formance and resulting productivity gains are more qualitative. Thus, no
single measure of performance or productivity adequately characterizes
computer systems for all their possible uses.2
On a more technical level, modern computer systems deploy and
coordinate a vast array of hardware and software technologies to pro-
duce the results that end users observe. Although the raw computational
capabilities of the central processing unit (CPU) core tend to get the most
attention, the reality is that performance comes from a complex balance
among many cooperating subsystems. In fact, the underlying perfor-
mance bottlenecks of some of today’s most commonly used large-scale
applications, such as Web searching, are dominated by the character-
istics of memory devices, disk drives, and network connections rather
than by the CPU cores involved in the processing. Similarly, the interac -
tive responsiveness perceived by end users of personal computers and
hand-held devices is typically defined more by the characteristics of the
operating system, the graphical user interface (GUI), and the storage com-
ponents than by the CPU core. Moreover, today’s ubiquitous networking
among computing devices seems to be setting the stage for a future in
which the computing experience is defined at least as much by the coor-
dinated interaction of multiple computers as it is by the performance of
any node in the network.
Nevertheless, to understand and reason about performance at a high
level, it is important to understand the fundamental lower-level contribu-
tors to performance. CPU performance is the driver that forces the many
other system components and features that contribute to overall perfor-
mance to keep up and avoid becoming bottlenecks
PERFORMANCE AS MEASURED BY RAW COMPUTATION
The classic formulation for raw computation in a single CPU core
identifies operating frequency, instruction count, and instructions per cycle
2 Consider the fact that the term “computer system” today encompasses everything from
small handheld devices to Netbooks to corporate data centers to massive server farms that
offer cloud computing to the masses.
OCR for page 60
60 THE FUTURE OF COMPUTING PERFORMANCE
(IPC) as the fundamental low-level components of performance.3 Each
has been the focus of a considerable amount of research and discovery
in the last 20 years. Although detailed technical descriptions of them are
beyond the intended scope of this report, the brief descriptions below will
provide context for the discussions that follow.
· Operating frequency defines the basic clock rate at which the CPU
core runs. Modern high-end processors run at several billion
cycles per second. Operating frequency is a function of the low-
level transistor characteristics in the chip, the length and physi-
cal characteristics of the internal chip wiring, the voltage that
is applied to the chip, and the degree of pipelining used in the
microarchitecture of the machine. The last 15 years have seen
dramatic increases in the operating frequency of CPU cores. As
an unfortunate side effect of that growth, the maximum operat-
ing frequency has often been used as a proxy for performance by
much of the popular press and industry marketing campaigns.
That can be misleading because there are many other important
low-level and system-level measures to consider in reasoning
about performance.
· Instruction count is the number of native instructions—instructions
written for that specific CPU—that must be executed by the CPU
to achieve correct results with a given computer program. Users
typically write programs in high-level programming languages—
such as Java, C, C++ , and C#—and then use a compiler to translate
the high-level program to native machine instructions. Machine
instructions are specific to the instruction set architecture (ISA) that
a given computer architecture or architecture family implements.
For a given high-level program, the machine instruction count
varies when it executes on different computer systems because
of differences in the underlying ISA, in the microarchitecture that
implements the ISA, and in the tools used to compile the pro-
gram. Although this section of the report focuses mostly on the
low-level raw performance measures, the role of the compiler and
other modern software system technologies are also necessary to
understand performance fully.
· Instructions per cycle refers to the average number of instructions
that a particular CPU core can execute and complete in each cycle.
IPC is a strong function of the underlying microarchitecture,
or machine organization, of the CPU core. Many modern CPU
3 John L. Hennessy and David A. Patterson, 2006, Computer Architecture: A Quantitative
Approach, fourth edition, San Francisco, Cal.: Morgan Kauffman.
OCR for page 61
61
WHAT IS COMPUTER PERFORMANCE?
cores use advanced techniques—such as multiple instruction dis-
patch, out-of-order execution, branch prediction, and speculative
execution—to increase the average IPC.4 Those techniques all
seek to execute multiple instructions in a single cycle by using
additional resources to reduce the total number of cycles needed
to execute the program. Some performance assessments focus on
the peak capabilities of the machines; for example, the peak per-
formance of the IBM Power 7 is six instructions per cycle, and that
of the Intel Pentium, four. In reality, those and other sophisticated
CPU cores actually sustain an average of slightly more than one
instruction per cycle when executing many programs. The differ-
ence between theoretical peak performance and actual sustained
performance is an important aspect of overall computer-system
performance.
The program itself provides different forms of parallelism that differ-
ent machine organizations can exploit to achieve performance. The first
type, instruction-level parallelism, describes the amount of nondependent
instructions5 available for parallel execution at any given point in the
program. The program’s instruction-level parallelism in part determines
the IPC component of raw performance mentioned above. (IPC can be
viewed as describing the degree to which a particular machine organiza-
tion can harvest the available instruction-level performance.) The second
type of parallelism is data-level parallelism, which has to do with how data
elements are distributed among computational units for similar types of
processing. Data-level parallelism can be exploited through architectural
and microarchitectural techniques that direct low-level instructions to
operate on multiple pieces of data at the same time. This type of process-
ing is often referred to as single-instruction-multiple-data. The third type
is thread-level parallelism and has to do with the degree to which a program
can be partitioned into multiple sequences of instructions with the intent
of executing them concurrently and cooperatively on multiple processors.
To exploit program parallelism, the compiler or run-time system must
map it to appropriate parallel hardware.
Throughout the history of modern computer architecture, there have
been many attempts to build machines that exploit the various forms of
4 Providing the details of these microarchitecture techniques is beyond the scope of
this publication. See Hennessey & Patterson for more information on these and related
techniques.
5 An instruction X does not depend on instruction Y if X can be performed without using
results from Y. The instruction a = b + c depends on previous instructions that produce
the results b and c and thus cannot be executed until those previous instructions have
completed.
OCR for page 62
62 THE FUTURE OF COMPUTING PERFORMANCE
parallelism. In recent years, owing largely to the emergence of more gen -
eralized and programmable forms of graphics processing units, the inter-
est in building machines that exploit data-level parallelism has grown
enormously. The specialized machines do not offer compatibility with
existing programs, but they do offer the promise of much more per-
formance when presented with code that properly exposes the avail-
able data-level parallelism. Similarly, because of the emergence of chip
multiprocessors, there is considerable renewed interest in understanding
how to exploit thread-level parallelism on these machines more fully.
However, the techniques also highlight the importance of the full suite of
hardware components in modern computer systems, the communication
that must occur among them, and the software technologies that help to
automate application development in order to take advantage of parallel -
ism opportunities provided by the hardware.
COMPUTATION AND COMMUNICATION’S
EFFECTS ON PERFORMANCE
The raw computational capability of CPU cores is an important com -
ponent of system-level performance, but it is by no means the only one. To
complete any useful tasks, a CPU core must communicate with memory, a
broad array of input/output devices, other CPU cores, and in many cases
other computer systems. The overhead and latency of that communica -
tion in effect delays computational progress as the CPU waits for data to
arrive and for system-level interlocks to clear. Such delays tend to reduce
peak computational rates to effective computational rates substantially.
To understand effective performance, it is important to understand the
characteristics of the various forms of communication used in modern
computer systems.
In general, CPU cores perform best when all their operands (the inputs
to the instructions) are stored in the architected registers that are internal
to the core. However, in most architectures, there tend to be few such
registers because of their relatively high cost in silicon area. As a result,
operands must often be fetched from memory before the actual compu -
tation specified by an instruction can be completed. For most computer
systems today, the amount of time it takes to access data from memory
is more than 100 times the single cycle time of the CPU core. And, worse
yet, the gap between typical CPU cycle times and memory-access times
continues to grow. That imbalance would lead to a devastating loss in
performance of most programs if there were not hardware caches in these
systems. Caches hold the most frequently accessed parts of main memory
in special hardware structures that have much smaller latencies than the
main memory system; for example, a typical level-1 cache has an access
OCR for page 63
63
WHAT IS COMPUTER PERFORMANCE?
time that is only 2-3 times slower than the single cycle time of the CPU
core. They leverage a principle called locality of reference that characterizes
common data-access patterns exhibited by most computer programs. To
accommodate large working sets that do not fit in the first-level cache,
many computer systems deploy a hierarchy of caches. The later levels
of caches tend to be increasingly large (up to several megabytes), but as
a result they also have longer access times and resulting latencies. The
concept of locality is important for computer architecture, and Chapter 4
highlights the potential of exploiting locality in innovative ways.
Main memory in most modern computer systems is typically imple -
mented with dynamic random-access memory (DRAM) chips, and it
can be quite large (many gigabytes). However, it is nowhere near large
enough to hold all the addressable memory space available to appli -
cations and the file systems used for long-term storage of data and
programs. Therefore, nonvolatile magnetic-disk-based storage 6 is com-
monly used to hold this much larger collection of data. The access time
for disk-based storage is several orders of magnitude larger than that of
DRAM, which can expose very long delays between a request for data
and the return of the data. As a result, in many computer systems, the
operating system takes advantage of the situation by arranging a “con -
text switch” to allow another pending program to run in the window of
time provided by the long delay in many computer systems. Although
context-switching by the operating system improves the multiprogram
throughput of the overall computer system, it hurts the performance of
any single application because of the associated overhead of the context-
switch mechanics. Similarly, as any given program accesses other system
resources, such as networking and other types of storage devices, the
associated request-response delays detract from the program’s ability to
make use of the full peak-performance potential of the CPU core. Because
each of those subsystems displays different performance characteristics,
the establishment of an appropriate system-level balance among them is
a fundamental challenge in modern computer-system design. As future
technology advances improve the characteristics of the subsystems, new
challenges and opportunities in balancing the overall system arise.
Today, an increasing number of computer systems deploy more than
one CPU core, and this has the potential to improve system performance.
In fact, there are several methods of taking advantage of the potential of
parallelism offered by additional CPU cores, each with distinct advantages
and associated challenges.
6 Nonvolatile storage does not require power to retain its information. A compact disk
(CD) is nonvolatile, for example, as is a computer hard drive, a USB flash key, or a book
like this one.
OCR for page 69
69
WHAT IS COMPUTER PERFORMANCE?
for a given program is important because it provides the results
that are an integral part of the iterative research loop directed by
the researcher. That is an example of performance as time to solu -
tion. (See Box 2.4 for more on time to solution.)
· In small-business settings, the computer system tends to be used
for a very wide array of applications, so high general-purpose
performance is valued.
· For computer systems used in banking and other financial mar-
kets, the reliability and accuracy of the computational results,
even in the face of defects or harsh external environmental con-
ditions, are paramount. Many deployments value gross transac -
tional throughput more than the turnaround time of any given
program, except for financial-transaction turnaround time.
· In some businesses, computer systems are deployed into mission-
critical roles in the overall operation of the business, for example,
e-commerce-based businesses, process automation, health care,
and human safety systems. In those situations, the gross reliabil-
ity and "up time" characteristics of the system can be far more
important than the instantaneous performance of the system at
any given time.
· At the very high end, supercomputer systems tend to work on
large problems with very large amounts of data. The underlying
performance of the memory system can be even more important
than the raw computational capability of the CPU cores involved.
That can be seen as an example of throughput as performance
(see Box 2.5).
Complicating matters a bit more, most computer-system deployments
define some set of important physical constraints on the system. For
example, in the case of a notebook-computer system, important energy-
consumption and physical-size constraints must be met. Similarly, even in
the largest supercomputer deployments, there are constraints on physical
size, weight, power, heat, and cost. Those constraints are several orders
of magnitude larger than in the notebook example, but they still are fun -
damental in defining the resulting performance and utility of the system.
As a result, for a given market opportunity, it often makes sense to gauge
the value of a computer system according to a ratio of performance to
constraints. Indeed, some of the metrics most frequently used today are
such ratios as performance per watt, performance per dollar, and perfor-
mance per area. More generally, most computer-system customers are
placing increasing emphasis on efficiency of computation rather than on
gross performance metrics.
OCR for page 70
70 THE FUTURE OF COMPUTING PERFORMANCE
BOX 2.3
Hardware Components
A car is not just an engine. It has a cooling system to keep the engine running
efficiently and safely, an environmental system to do the same for the drivers
and passengers, a suspension system to improve the ride, a transmission so
that the engine’s torque can be applied to the drive wheels, a radio so that the
driver can listen to classic-rock stations, and cupholders and other convenience
features. One might still have a useful vehicle if the radio and cupholders were
missing, but the other features must be present because they all work in har-
mony to achieve the function of propelling the vehicle controllably and safely.
Computer systems are similar. The CPU tends to get much more than its
proper share of attention, but it would be useless without memory and I/O
subsystems. CPUs function by fetching their instructions from memory. How
did the instructions get into memory, and where did they come from? The in-
structions came from a file on a hard disk and traversed several buses (commu-
nication pathways) to get to memory. Many of the instructions, when executed
by the CPU, cause additional memory traffic and I/O traffic. When we speak of
the overall performance of a computer system, we are implicitly referring to
the overall performance of all those systems operating together. For any given
workload, it is common to find that one of the “links in the chain” is, in fact,
the weakest link. For instance, one can write a program that only executes CPU
operations on data that reside in the CPU’s own register file or its internal data
cache. We would refer to such a program as “CPU-bound,” and it would run
as fast as the CPU alone could perform it. Speeding up the memory or the I/O
system would have no discernible effect on measured performance for that
benchmark. Another benchmark could be written, however, that does little else
but perform memory load and store operations in such a way that the CPU’s
internal cache is ineffective. Such a benchmark would be bound by the speed
THE INTERPLAY OF SOFTWARE AND PERFORMANCE
Although the amazing raw performance gains of the microproces-
sor over the last 20 years has garnered most of the attention, the overall
performance and utility of computer systems are strong functions of both
hardware and software. In fact, as computer systems have deployed more
hardware, they have depended more and more on software technologies
to harness their computational capability. Software has exploited that
capability directly and indirectly. Software has directly exploited increases
in computing capability by adding new features to existing software,
by solving larger problems more accurately, and by solving previously
unsolvable problems. It has indirectly exploited the capability through the
use of abstractions in high-level programming languages, libraries, and
virtual-machine execution environments. By using high-level program -
ming languages and exploiting layers of abstraction, programmers can
OCR for page 71
71
WHAT IS COMPUTER PERFORMANCE?
of memory (and possibly by the bus that carries the traffic between the CPU
and memory.) A third benchmark could be constructed that hammers on the
I/O subsystem with little dependence on the speed of either the CPU or the
memory.
Handling most real workloads relies on all three computer subsystems, and
their performance metrics therefore reflect the combined speed of all three.
Speed up only the CPU by 10 percent, and the workload is liable to speed up,
but not by 10 percent—it will probably speed up in a prorated way because only
the sections of the code that are CPU-bound will speed up. Likewise, speed up
the memory alone, and the workload performance improves, but typically much
less than the memory speedup in isolation. Numerous other pieces of com-
puter systems make up the hardware. The CPU architectures and microarchi-
tectures encompass instruction sets, branch-prediction algorithms, and other
techniques for higher performance. Storage (disks and memory) is a central
component. Memory, flash drives, traditional hard drives, and all the technical
details associated with their performance (such as bandwidth, latency, caches,
volatility, and bus overhead) are critical for a system’s overall performance. In
fact, information storage (hard-drive capacity) is understood to be increasing
even faster than transistor counts on the traditional Moore’s law curve,1 but it is
unknown how long this will continue. Switching and interconnect components,
from switches to routers to T1 lines, are part of every level of a computer system.
There are also hardware interface devices (keyboards, displays, and mice). All
those pieces can contribute to what users perceive of as the “performance” of
the system with which they are interacting.
1 This phenomenon has been dubbed Kryder’s law after Seagate executive Mark Kryder (Chip
Walter, 2005, Kryder’s law, Scientific American 293: 32-33, available online at http://www.
scientificamerican.com/article.cfm?id=kryders-law).
express their algorithms more succinctly and modularly and can compose
and reuse software written by others. Those high-level programming
constructs make it easier for programmers to develop correct complex
programs faster. Abstraction tends to trade increased human programmer
productivity for reduced software performance, but the past increases in
single-processor performance essentially hid much of the performance
cost. Thus, modern software systems now have and rely on multiple lay -
ers of system software to execute programs. The layers can include oper-
ating systems, runtime systems, virtual machines, and compilers. They
offer both an opportunity for introducing and managing parallelism and a
challenge in that each layer must now also understand and exploit paral-
lelism. The committee discusses those issues in more detail in Chapter 4
and summarizes the performance implications below.
The key performance driver to date has been software portability. Once
OCR for page 72
72 THE FUTURE OF COMPUTING PERFORMANCE
BOX 2.4
Time To Solution
Consider a jackhammer on a city street. Assume that using a jackhammer is
not a pastime enjoyable in its own right—the goal is to get a job done as soon
as possible. There are a few possible avenues for improvement: try to make
the jackhammer’s chisel strike the pavement more times per second; make
each stroke of the jackhammer more effective, perhaps by putting more power
behind each stroke; or think of ways to have the jackhammer drive multiple
chisels per stroke. All three possibilities have analogues in computer design,
and all three have been and continue to be used. The notion of “getting the
job done as soon as possible” is known in the computer industry as time to
solution and has been the traditional metric of choice for system performance
since computers were invented.
Modern computer systems are designed according to a synchronous, pipe-
lined schema. Synchronous means occurring at the same time. Synchronous
digital systems are based on a system clock, a specialized timer signal that
coordinates all activities in the system. Early computers had clock frequencies
in the tens of kilohertz. Contemporary microprocessor designs routinely sport
clocks with frequencies of over about 3-GHz range. To a first approximation,
the higher the clock rate, the higher the system performance. System designers
cannot pick arbitrarily high clock frequencies, however—there are limits to the
speed at which the transistors and logic gates can reliably switch, limits to how
quickly a signal can traverse a wire, and serious thermal power constraints that
worsen in direct proportion to the clock frequency. Just as there are physical
limits on how fast a jackhammer’s chisel can be driven downward and then
retracted for the next blow, higher computer clock rates generally yield faster
time-to-solution results, but there are several immutable physical constraints
on the upper limit of those clocks, and the attainable performance speedups
are not always proportional to the clock-rate improvement.
How much a computer system can accomplish per clock cycle varies widely
from system to system and even from workload to workload in a given system.
More complex computer-instruction sets, such as Intel’s x86, contain instruc-
tions that intrinsically accomplish more than a simpler instruction set, such as
that embodied in the ARM processor in a cell phone; but how effective the com-
plex instructions are is a function of how well a compiler can use them. Recent
a program has been created, debugged, and put into practical use, end
users’ expectation is that the program not only will continue to operate
correctly when they buy a new computer system but also will run faster
on a new system that has been advertised as offering increased perfor-
mance. More generally, once a large collection of programs have become
available for a particular computing platform, the broader expectation is
that they will all continue to work and speed up in later machine genera -
OCR for page 73
73
WHAT IS COMPUTER PERFORMANCE?
additions to historical instruction sets—such as Intel’s SSE 1, 2, 3, and 4—attempt
to accomplish more work per clock cycle by operating on grouped data that are
in a compressed format (the equivalent of a jackhammer that drives multiple
chisels per stroke). Substantial system-performance improvements, such as fac-
tors of 2-4, are available to workloads that happen to fit the constraints of the
instruction-set extensions.
There is a special case of time-to-solution workloads: those which can be
successfully sped up with dedicated hardware accelerators. Graphics process-
ing units (GPUs)—such as those from NVIDIA, from ATI, and embedded in some
Intel chipsets—are examples. These processors were designed originally to
handle the demanding computational and memory bandwidth requirements of
3D graphics but more recently have evolved to include more general program-
mability features. With their intrinsically massive floating-point horsepower, 10
or more times higher than is available in the general-purpose (GP) micropro-
cessor, these chips have become the execution engine of choice for some im-
portant workloads. Although GPUs are just as constrained by the exponentially
rising power dissipation of modern silicon as are the GPs, GPUs are 1-2 orders
of magnitude more energy-efficient for suitable workloads and can therefore
accomplish much more processing within a similar power budget.
Applying multiple jackhammers to the pavement has a direct analogue in the
computer industry that has recently become the primary development avenue
for the hardware vendors: “multicore.” The computer industry’s pattern has
been for the hardware makers to leverage a new silicon process technology to
make a software-compatible chip that is substantially faster than any previous
chips. The new, higher-performing systems are then capable of executing soft-
ware workloads that would previously have been infeasible; the attractiveness
of the new software drives demand for the faster hardware, and the virtuous
cycle continues. A few years ago, however, thermal-power dissipation grew to
the limits of what air cooling can accomplish and began to constrain the attain-
able system performance directly. When the power constraints threatened to
diminish the generation-to-generation performance enhancements, chipmak-
ers Intel and AMD turned away from making ever more complex microarchitec-
tures on a single chip and began placing multiple processors on a chip instead.
The new chips are called multicore chips. Current chips have several processors
on a single die, and future generations will have even more.
tions. Indeed, not only has the remarkable speedup offered by industry
standard (×86-compatible) microprocessors over the last 20 years forged
compatibility expectation in the industry, but its success has hindered the
development of alternative, noncompatible computer systems that might
otherwise have kindled new and more scalable programming paradigms.
As the microprocessor industry shifts to multicore processors, the rate of
improvement of each individual processor is substantially diminished.
OCR for page 74
74 THE FUTURE OF COMPUTING PERFORMANCE
BOX 2.5
Throughput
There is another useful performance metric besides time to solution, and
the Internet has pushed it to center stage: system throughput. Consider a Web
server, such as one of the machines at search giant Google. Those machines run
continuously, and their work is never finished, in that new requests for service
continue to arrive. For any given request for service, the user who made the
request may care about time to solution, but the overall performance metric for
the server is its throughput, which can be thought of informally as the number
of jobs that the server can satisfy simultaneously. Throughput will determine
the number and configuration of servers and hence the overall installation cost
of the server “farm.”
Before multicore chips, the computer industry’s efforts were aimed primarily
at decreasing the time to solution of a system. When a given workload required
the sequential execution of several million operations, a faster clock or a more
capable microarchitecture would satisfy the requirement. But compilers are not
generally capable of targeting multiple processors in pursuit of a single time-to-
solution target; they know how to target one processor. Multicore chips there-
fore tend to be used as throughput enhancers. Each available CPU core can pop
the next runnable process off the ready list, thus increasing the throughput of
the system by running multiple processes concurrently. But that type of concur-
rency does not automatically improve the time to solution of any given process.
Modern multithreading programming environments and their routine suc-
cessful use in server applications hold out the promise that applying multiple
threads to a single application may yet improve time to solution for multicore
platforms. We do not yet know to what extent the industry’s server multithread-
ing successes will translate to other market segments, such as mobile or desktop
computers. It is reasonably clear that although time-to-solution performance is
topping out, throughput can be increased indefinitely. The as yet unanswered
question is whether the buying public will find throughput enhancements as
irresistible as they have historically found time-to-solution improvements.
The net result is that the industry is ill prepared for the rather sudden
shift from ever-increasing single-processor performance to the presence
of increasing numbers of processors in computer systems. (See Box 2.6
for more on instruction-set architecture compatibility and possible future
outcomes.)
The reason that industry is ill prepared is that an enormous amount
of existing software does not use thread-level or data-level parallelism—
software did not need it to obtain performance improvements, because
users simply needed to buy new hardware to get performance improve -
ments. However, only programs that have these types of parallelism will
experience improved performance in the chip multiprocessor era. Fur-
OCR for page 75
75
WHAT IS COMPUTER PERFORMANCE?
thermore, even for applications with thread-level and data-level parallel -
ism, it is hard to obtain improved performance with chip multiprocessor
hardware because of communication costs and competition for shared
resources, such as cache memory. Although expert programmers in such
application domains as graphics, information retrieval, and databases
have successfully exploited those types of parallelism and attained per-
formance improvements with increasing numbers of processors, these
applications are the exception rather than the rule.
Writing software that expresses the type of parallelism that hardware
based on chip multiprocessors will be able to improve is the main obstacle
because it requires new software-engineering processes and tools. The pro-
cesses and tools include training programmers to solve their problems
with “parallel computational thinking,” new programming languages
that ease the expression of parallelism, and a new software stack that can
exploit and map the parallelism to hardware that is evolving. Indeed, the
outlook for overcoming this obstacle and the ability of academics and
industry to do it are primary subjects of this report.
THE ECONOMICS OF COMPUTER PERFORMANCE
There should be little doubt that computers have become an indis -
pensable tool in a broad array of businesses, industries, research endeav-
ors, and educational institutions. They have enabled profound improve -
ment in automation, data analysis, communication, entertainment, and
personal productivity. In return, those advances have created a virtu-
ous economic cycle in the development of new technologies and more
advanced computing systems. To understand the sustainability of con-
tinuing improvements in computer performance, it is important first to
understand the health of this cycle, which is a critical economic underpin-
ning of the computer industry.
From a purely technological standpoint, the engineering community
has proved to be remarkably innovative in finding ways to continue to
reduce microelectronic feature sizes. First, of course, industry has inte -
grated more and more transistors into the chips that make up the com-
puter systems. Fortunate side effects are improvements in speed and
power efficiency of the individual transistors. Computer architects have
learned to make use of the increasing numbers and improved characteris-
tics of the transistors to design continually higher-performance computer
systems. The demand for the increasingly powerful computer systems
has generated sufficient revenue to fuel the development of the next
round of technology while providing profits for the companies leading
the charge. Those relationships form the basis of the virtuous economic
OCR for page 76
76 THE FUTURE OF COMPUTING PERFORMANCE
BOX 2.6
Instruction-Set Architecture: Compatibility
This history of computing hardware has been dominated by a few franchises.
IBM first noticed the trend of increasing performance in the 1960s and took ad-
vantage of it with the System/360 architecture. That instruction-set architecture
became so successful that it motivated many other companies to make com-
puter systems that would run the same software codes as the IBM System/360
machines; that is, they were building instruction-set-compatible computers. The
value of that approach is clearest from the end user’s perspective—compatible
systems worked as expected “right out of the box,” with no recompilation, no
alterations to source code, and no tracking down of software bugs that may
have been exposed by the process of migrating the code to a new architecture
and toolset.
With the rise of personal computing in the 1980s, compatibility has come
to mean the degree of compliance with the Intel architecture (also known as
IA-32 or x86). Intel and other semiconductor companies, such as AMD, have
managed to find ways to remain compatible with code for earlier generations
of x86 processors. That compatibility comes at a price. For example, the floating-
point registers in the x86 architecture are organized as a stack, not as a randomly
accessible register set, as all integer registers are. In the 1980s, stacking the
floating-point registers may have seemed like a good idea that would benefit
compiler writers; but in 2008, that stack is a hindrance to performance, and x86-
compatible chips therefore expend many transistors to give the architecturally
required appearance of a stacked floating-point register set—only to spend
more transistors “under the hood” to undo the stack so that modern perfor-
mance techniques can be applied. IA-32’s instruction-set encoding and its seg-
mented addressing scheme are other examples of old baggage that constitute
a tax on every new x86 chip.
There was a time in the industry when much architecture research was ex-
pended on the notion that because every new compatible generation of chips
must carry the aggregated baggage of its past and add ideas to the architecture
to keep it current, surely the architecture would eventually fail of its own ac-
cord, a victim of its own success. But that has not happened. The baggage is
there, but the magic of Moore’s law is that so many additional transistors are
made available in each new generation, that there have always been enough to
reimplement the baggage and to incorporate enough innovation to stay com-
petitive. Over time, such non-x86-compatible but worthy competitors as DEC’s
Alpha, SGI’s MIPS, Sun’s SPARC, and the Motorola/IBM PowerPC architectures
either have found a niche in market segments, such as cell phones or other
embedded products, or have disappeared.
and technology-advancement cycles that have been key underlying driv -
ers in the computer-systems industry over the last 30 years.
There are many important applications of semiconductor technol-
ogy beyond the desire to build faster and faster high-end computer sys-
tems. In particular, the electronics industry has leveraged the advances
OCR for page 77
77
WHAT IS COMPUTER PERFORMANCE?
The strength of the x86 architecture was most dramatically demonstrated
when Intel, the original and major supplier of x86 processors, decided to in-
troduce a new, non-x86 architecture during the transition from 32-bit to 64-bit
addressing in the 1990s. Around the same time, however, AMD created a proces-
sor with 64-bit addressing compatible with the x86 architecture, and customers,
again driven by the existence of the large software base, preferred the 64-bit x86
processor from AMD over the new IA-64 processor from Intel. In the end, Intel
also developed a 64-bit x86-compatible processor that is now far outselling its
IA-64 (Itanium) processor.
With the rise of the cell phone and other portable media and computing
appliances, yet another dominant architectural approach has emerged: the
ARM architecture. The rapidly growing software base for portable applications
running on ARM processors has made the compatible series of processors
licensed by ARM the dominant processors for embedded and portable applica-
tions. As seen in the dominance of the System/360 architecture for mainframe
computers, x86 for personal computers and networked servers, and the ARM
architecture for portable appliances, there will be an opportunity for a new
architecture or architectures as the industry moves to multicore, parallel com-
puting systems. Initial moves to chip-multiprocessor systems are being made
with existing architectures based primarily on the x86. The computing industry
has accumulated a lot of history on this subject, and it appears safe to say that
the era in which compatibility is an absolute requirement will probably end not
because an incompatible but compellingly faster competitor has appeared but
only when one of the following conditions takes hold:
· oftware translation (automatic conversion from code compiled for one
S
architecture to be suitable for running on another) becomes ubiquitous
and so successful that strict hardware compatibility is no longer neces-
sary for the user to reap the historical benefits.
· ulticore performance has “topped out” to the point where most buyers
M
no longer perceive enough benefit to justify buying a new machine to
replace an existing, still-working one.
· he fundamental hardware-software business changes so substantially
T
that the whole idea of compatibility is no longer relevant. Interpreted
and dynamically compiled languages—such as Java, PHP, and JavaScript
(“write once run anywhere”)—are harbingers of this new era. Although
their performance overhead is sometimes enough for the performance
advantages of compiled code to outweigh programmer productivity, Ja-
vaScript and PHP are fast becoming the languages of choice on the client
side and server side, respectively, for Web applications.
to create a wide variety of new form-factor devices (notebook computers,
smart phones, and GPS receivers, to name just a few). Although each of
those devices tends to have more substantial constraints (size, power, and
cost) than traditional computer systems, they often embody the computa-
tional and networking capabilities of previous generations of higher-end
OCR for page 78
78 THE FUTURE OF COMPUTING PERFORMANCE
computer systems. In light of the capabilities of the smaller form-factor
devices, they will probably play an important role in unleashing the
aggregate performance potential of larger-scale networked systems in
the future. Those additional market opportunities have strong economic
underpinnings of their own, and they have clearly reaped benefits from
deploying technology advances driven into place by the computer-sys-
tems industry. In many ways, the incredible utility of computing not only
has provided direct improvement in productivity in many industries but
also has set the stage for amazing growth in a wide array of codependent
products and industries.
In recent years, however, we have seen some potentially troublesome
changes in the traditional return on investment embedded in this virtuous
cycle. As we approach more of the fundamental physical limits of tech-
nology, we continue to see dramatic increases in the costs associated with
technology development and in the capital required to build fabrication
facilities to the point where only a few companies have the wherewithal
even to consider building these facilities. At the same time, although we
can pack more and more transistors into a given area of silicon, we are
seeing diminishing improvements in transistor performance and power
efficiency. As a result, computer architects can no longer rely on those
sorts of improvements as means of building better computer systems and
now must rely much more exclusively on making use of the increased
transistor-integration capabilities.
Our progress in identifying and meeting the broader value proposi-
tions has been somewhat mixed. On the one hand, multiple processor
cores and other system-level features are being integrated into monolithic
pieces of silicon. On the other hand, to realize the benefits of the multi -
processor machines, the software that runs on them must be conceived
and written in a different way from what most programmers are accus-
tomed to. From an end-user perspective, the hardware and the software
must combine seamlessly to offer increased value. It is increasingly clear
that the computer-systems industry needs to address those software and
programmability concerns or risk the ability to offer the next round of
compelling customer value. Without performance incentives to buy the
next generation of hardware, the economic virtuous cycle is likely to break
down, and this would have widespread negative consequences for many
industries.
In summary, the sustained viability of the computer-systems indus-
try is heavily influenced by an underlying virtuous cycle that connects
continuing customer perception of value, financial investments, and new
products getting to market quickly. Although one of the primary indica-
tors of value has traditionally been the ever-increasing performance of
each individual compute node, the next round of technology improve -
OCR for page 79
79
WHAT IS COMPUTER PERFORMANCE?
ments on the horizon will not automatically enhance that value. As a
result, many computer systems under development are betting on the
ability to exploit multiple processors and alternative forms of parallel -
ism in place of the traditional increases in the performance of individual
computing nodes. To make good on that bet, there need to be substantial
breakthroughs in the software-engineering processes that enable the new
types of computer systems. Moreover, attention will probably be focused
on high-level performance issues in large systems at the expense of time
to market and the efficiency of the virtuous cycle.