Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 53
2 What Is Computer Performance? F ast, inexpensive computers are now essential to numerous human endeavors. But less well understood is the need not just for fast computers but also for ever-faster and higher-performing comput- ers at the same or better costs. Exponential growth of the type and scale that have fueled the entire information technology industry is ending. 1 In addition, a growing performance gap between processor performance and memory bandwidth, thermal-power challenges and increasingly expensive energy use, threats to the historical rate of increase in transistor density, and a broad new class of computing applications pose a wide- ranging new set of challenges to the computer industry. Meanwhile, soci - etal expectations for increased technology performance continue apace and show no signs of slowing, and this underscores the need for ways to sustain exponentially increasing performance in multiple dimensions. The essential engine that has met this need for the last 40 years is now in considerable danger, and this has serious implications for our economy, our military, our research institutions, and our way of life. Five decades of exponential growth in processor performance led to 1 Itcan be difficult even for seasoned veterans to understand the effects of exponential growth of the sort seen in the computer industry. On one level, industry experts, and even consumers, display an implicit understanding in terms of their approach to application and system development and their expectations of and demands for computing technologies. On another level, that implicit understanding makes it easy to overlook how extraordinary the exponential improvements in performance of the sort seen in the information technology industry actually are. 53
OCR for page 54
54 THE FUTURE OF COMPUTING PERFORMANCE the rise and dominance of the general-purpose personal computer. The success of the general-purpose microcomputer, which has been due pri- marily to economies of scale, has had a devastating effect on the develop- ment of alternative computer and programming models. The effect can be seen in high-end machines like supercomputers and in low-end consumer devices, such as media processors. Even though alternative architectures and approaches might have been technically superior for the task they were built for, they could not easily compete in the marketplace and were readily overtaken by the ever-improving general-purpose proces- sors available at a relatively low cost. Hence, the personal computer has been dubbed “the killer micro.” Over the years, we have seen a series of revolutions in computer architecture, starting with the main-frame, the minicomputer, and the work station and leading to the personal computer. Today, we are on the verge of a new generation of smart phones, which perform many of the applications that we run on personal computers and take advan- tage of network-accessible computing platforms (cloud computing) when needed. With each iteration, the machines have been lower in cost per performance and capability, and this has broadened the user base. The economies of scale have meant that as the per-unit cost of the machine has continued to decrease, the size of the computer industry has kept growing because more people and companies have bought more computers. Per- haps even more important, general-purpose single processors—which all these generations of architectures have taken advantage of—can be pro- grammed by using the same simple, sequential programming abstraction at root. As a result, software investment on this model has accumulated over the years and has led to the de facto standardization of one instruc - tion set, the Intel x86 architecture, and to the dominance of one desktop operating system, Microsoft Windows. The committee believes that the slowing in the exponential growth in computing performance, while posing great risk, may also create a tremendous opportunity for innovation in diverse hardware and software infrastructures that excel as measured by other characteristics, such as low power consumption and delivery of throughput cycles. In addition, the use of the computer has becomes so pervasive that it is now economical to have many more varieties of computers. Thus, there are opportunities for major changes in system architectures, such as those exemplified by the emergence of powerful distributed, embedded devices, that together will create a truly ubiquitous and invisible computer fabric. Investment in whole-system research is needed to lay the foundation of the computing environment for the next generation. See Figure 2.1 for a graph showing flattening curves of performance, power, and frequency. Traditionally, computer architects have focused on the goal of creating
OCR for page 55
55 WHAT IS COMPUTER PERFORMANCE? 10,000,000 Num Transistors (in Thousands) Relative Performance 1,000,000 Clock Speed (MHz) Power Typ (W) 100,000 NumCores/Chip 10,000 1,000 100 10 1 0 1985 1990 1995 2000 2005 2010 Year of Introduction FIGURE 2.1 Transistors, frequency, power, performance, and cores over time (1985-2010). The vertical scale is logarithmic. Data curated by Mark Horowitz with input from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç. computers that perform single tasks as fast as possible. That goal is still important. Because the uniprocessor model we have today is extremely powerful, many performance-demanding applications can be mapped to run on networks of processors by dividing the work up at a very coarse granularity. Therefore, we now have great building blocks that enable us to create a variety of high-performance systems that can be programmed with high-level abstractions. There is a serious need for research and education in the creation and use of high-level abstractions for parallel systems. However, single-task performance is no longer the only metric of interest. The market for computers is so large that there is plenty of eco- nomic incentive to create more specialized and hence more cost-effective machines. Diversity is already evident. The current trend of moving com- putation into what is now called the cloud has created great demands for high-throughput systems. For those systems, making each transaction run as fast as possible is not the best thing to do. It is better, for example, to have a larger number of lower-speed processors to optimize the through- put rate and minimize power consumption. It is similarly important to conserve power for hand-held devices. Thus, power consumption is a
OCR for page 56
56 THE FUTURE OF COMPUTING PERFORMANCE BOX 2.1 Embedded Computing Performance The design of desktop systems often places considerable emphasis on gen- eral CPU performance in running desktop workloads. Particular attention is paid to the graphics system, which directly determines which consumer games will run and how well. Mobile platforms, such as laptops and notebooks, at- tempt to provide enough computing horsepower to run modern operating systems well—subject to the energy and thermal constraints inherent in mobile, battery-operated devices—but tend not to be used for serious gaming, so high- end graphics solutions would not be appropriate. Servers run a different kind of workload from either desktops or mobile platforms, are subject to substantially different economic constraints in their design, and need no graphics support at all. Desktops and mobile platforms tend to value legacy compatibility (for ex- ample, that existing operating systems and software applications will continue to run on new hardware), and this compatibility requirement affects the design of the systems, their economics, and their use patterns. Although desktops, mobile, and server computer systems exhibit important differences from one another, it is natural to group them when comparing them with embedded systems. It is difficult to define embedded systems accurately because their space of applicability is huge—orders of magnitude larger than the general-purpose computing systems of desktops, laptops, and servers. Em- bedded computer systems can be found everywhere: a car’s radio, engine con- troller, transmission controller, airbag deployment, antilock brakes, and dozens of other places. They are in the refrigerator, the washer and dryer, the furnace controller, the MP3 player, the television set, the alarm clock, the treadmill and stationary bike, the Christmas lights, the DVD player, and the power tools in the garage. They might even be found in ski boots, tennis shoes, and greeting cards. They control the elevators and heating and cooling systems at the office, the video surveillance system in the parking lot, and the lighting, fire protection, and security systems. Every computer system has economic constraints. But the various systems tend to fall into characteristic financial ranges. Desktop systems once (in 1983) cost $3,000 and now cost from a few hundred dollars to around $1,000. Mobile systems cost more at the high end, perhaps $2,500, down to a few hundred dol- lars at the low end. Servers vary from a few thousand dollars up to hundreds of thousands for a moderate Web server, a few million dollars for a small corporate key performance metric for both high-end servers and consumer hand- held devices. See Box 2.1 for a discussion of embedded computing per- formance as distinct from more traditional desktop systems. In general, power considerations are likely to lead to a large variety of specialized processors. The rest of this chapter provides the committee’s views on matters related to computer performance today. These views are summarized in
OCR for page 57
57 WHAT IS COMPUTER PERFORMANCE? server farm, and 1 or 2 orders of magnitude more than that for the huge server farms fielded by large companies, such as eBay, Yahoo!, and Google. Embedded systems tend to be inexpensive. The engine controller under the hood of a car cost the car manufacturer about $3-5. The chips in a cell phone were also in that range. The chip in a tennis shoe or greeting card is about 1/10 that cost. The embedded system that runs such safety-critical systems as eleva- tors will cost thousands of dollars, but that cost is related more to the system packaging, design, and testing than to the silicon that it uses. One of the hallmarks of embedded systems versus general-purpose com- puters is that, unlike desktops and servers, embedded performance is not an open-ended boon. Within their cost and power budgets, desktops, laptops, and server systems value as much performance as possible—the more the better. Embedded systems are not generally like that. The embedded chip in a cell phone has a set of tasks to perform, such as monitoring the phone’s buttons, placing various messages and images on the display, controlling the phone’s energy budget and configuration, and setting up and receiving calls. To ac- complish those tasks, the embedded computer system (comprising a central processor, its memory, and I/O facilities) must be capable of a some overall performance level. The difference from general-purpose computers is that once that level is reached in the system design, driving it higher is not beneficial; in fact, it is detrimental to the system. Embedded computer systems that are faster than necessary to meet requirements use more energy, dissipate more heat, have lower reliability, and cost more—all for no gain. Does that mean that embedded processors are now fast enough and have no need to go faster? Are they exempt from the emphasis in this report on “sustaining growth in computing performance”? No. If embedded processor systems were to become faster and all else were held equal, embedded-system designers would find ways of using the additional capability, and delivering new functionalities would come to be expected on those devices. For example, many embedded systems, such as the GPS or audio system in a car, tend to interface directly with human beings. Voice and speech recognition capability greatly enhance that experience, but current systems are not very good at the noise suppression, beam-forming, and speech-processing that are required to make this a seamless, enjoyable experience, although progress is being made. Faster computer systems would help to solve that problem. Embedded systems have benefited tremendously from riding an improvement curve equivalent to that of the general-purpose systems and will continue to do so in the future. the bullet points that follow this paragraph. Readers who accept the com - mittee’s views may choose to skip the supporting arguments and move on to the next chapter. · Increasing computer performance enhances human productivity. · One measure of single-processor performance is the product of operating frequency, instruction count, and instructions per cycle.
OCR for page 58
58 THE FUTURE OF COMPUTING PERFORMANCE · Performance comes directly from faster devices and indirectly from using more devices in parallel. · Parallelism can be helpfully divided into instruction-level paral- lelism, data-level parallelism, and thread-level parallelism. · Instruction-level parallelism has been extensively mined, but there is now broad interest in data-level parallelism (for example, due to graphics processing units) and thread-level parallelism (for example, due to chip multiprocessors). · Computer-system performance requires attention beyond pro- cessors to memories (such as,dynamic random-access memory), storage (for example, disks), and networking. · Some computer systems seek to improve responsiveness (for example, timely feedback to a user’s request), and others seek to improve throughput (for example, handling many requests quickly). · Computers today are implemented with integrated circuits (chips) that incorporate numerous devices (transistors) whose population (measured as transistors per chaip) has been doubling every 1.5-2 years (Moore’s law). · Assessing the performance delivered to a user is difficult and depends on the user’s specific applications. · Large parts of the potential performance gain due to device inno - vations have been usefully applied to productivity gains (for example, via instruction-set compatibility and layers of software). · Improvements in computer performance and cost have enabled creative product innovations that generated computer sales that, in turn, enabled a virtuous cycle of computer and product innovations. WHY PERFORMANCE MATTERS Humans design machinery to solve problems. Measuring how well machines perform their tasks is of vital importance for improving them, conceiving better machines, and deploying them for economic bene- fit. Such measurements often speak of a machine’s performance, and many aspects of a machine’s operations can be characterized as performance. For example, one aspect of an automobile’s performance is the time it takes to accelerate from 0 to 60 mph; another is its average fuel economy. Braking ability, traction in bad weather conditions, and the capacity to tow trailers are other measures of the car’s performance. Computer systems are machines designed to perform information processing and computation. Their performance is typically measured by how much information processing they can accomplish per unit time,
OCR for page 59
59 WHAT IS COMPUTER PERFORMANCE? but there are various perspectives on what type of information process - ing to consider when measuring performance and on the right time scale for such measurements. Those perspectives reflect the broad array of uses and the diversity of end users of modern computer systems. In gen - eral, the systems are deployed and valued on the basis of their ability to improve productivity. For some users, such as scientists and information technology specialists, the improvements can be measured in quantitative terms. For others, such as office workers and casual home users, the per- formance and resulting productivity gains are more qualitative. Thus, no single measure of performance or productivity adequately characterizes computer systems for all their possible uses.2 On a more technical level, modern computer systems deploy and coordinate a vast array of hardware and software technologies to pro- duce the results that end users observe. Although the raw computational capabilities of the central processing unit (CPU) core tend to get the most attention, the reality is that performance comes from a complex balance among many cooperating subsystems. In fact, the underlying perfor- mance bottlenecks of some of today’s most commonly used large-scale applications, such as Web searching, are dominated by the character- istics of memory devices, disk drives, and network connections rather than by the CPU cores involved in the processing. Similarly, the interac - tive responsiveness perceived by end users of personal computers and hand-held devices is typically defined more by the characteristics of the operating system, the graphical user interface (GUI), and the storage com- ponents than by the CPU core. Moreover, today’s ubiquitous networking among computing devices seems to be setting the stage for a future in which the computing experience is defined at least as much by the coor- dinated interaction of multiple computers as it is by the performance of any node in the network. Nevertheless, to understand and reason about performance at a high level, it is important to understand the fundamental lower-level contribu- tors to performance. CPU performance is the driver that forces the many other system components and features that contribute to overall perfor- mance to keep up and avoid becoming bottlenecks PERFORMANCE AS MEASURED BY RAW COMPUTATION The classic formulation for raw computation in a single CPU core identifies operating frequency, instruction count, and instructions per cycle 2 Consider the fact that the term “computer system” today encompasses everything from small handheld devices to Netbooks to corporate data centers to massive server farms that offer cloud computing to the masses.
OCR for page 60
60 THE FUTURE OF COMPUTING PERFORMANCE (IPC) as the fundamental low-level components of performance.3 Each has been the focus of a considerable amount of research and discovery in the last 20 years. Although detailed technical descriptions of them are beyond the intended scope of this report, the brief descriptions below will provide context for the discussions that follow. · Operating frequency defines the basic clock rate at which the CPU core runs. Modern high-end processors run at several billion cycles per second. Operating frequency is a function of the low- level transistor characteristics in the chip, the length and physi- cal characteristics of the internal chip wiring, the voltage that is applied to the chip, and the degree of pipelining used in the microarchitecture of the machine. The last 15 years have seen dramatic increases in the operating frequency of CPU cores. As an unfortunate side effect of that growth, the maximum operat- ing frequency has often been used as a proxy for performance by much of the popular press and industry marketing campaigns. That can be misleading because there are many other important low-level and system-level measures to consider in reasoning about performance. · Instruction count is the number of native instructions—instructions written for that specific CPU—that must be executed by the CPU to achieve correct results with a given computer program. Users typically write programs in high-level programming languages— such as Java, C, C++ , and C#—and then use a compiler to translate the high-level program to native machine instructions. Machine instructions are specific to the instruction set architecture (ISA) that a given computer architecture or architecture family implements. For a given high-level program, the machine instruction count varies when it executes on different computer systems because of differences in the underlying ISA, in the microarchitecture that implements the ISA, and in the tools used to compile the pro- gram. Although this section of the report focuses mostly on the low-level raw performance measures, the role of the compiler and other modern software system technologies are also necessary to understand performance fully. · Instructions per cycle refers to the average number of instructions that a particular CPU core can execute and complete in each cycle. IPC is a strong function of the underlying microarchitecture, or machine organization, of the CPU core. Many modern CPU 3 John L. Hennessy and David A. Patterson, 2006, Computer Architecture: A Quantitative Approach, fourth edition, San Francisco, Cal.: Morgan Kauffman.
OCR for page 61
61 WHAT IS COMPUTER PERFORMANCE? cores use advanced techniques—such as multiple instruction dis- patch, out-of-order execution, branch prediction, and speculative execution—to increase the average IPC.4 Those techniques all seek to execute multiple instructions in a single cycle by using additional resources to reduce the total number of cycles needed to execute the program. Some performance assessments focus on the peak capabilities of the machines; for example, the peak per- formance of the IBM Power 7 is six instructions per cycle, and that of the Intel Pentium, four. In reality, those and other sophisticated CPU cores actually sustain an average of slightly more than one instruction per cycle when executing many programs. The differ- ence between theoretical peak performance and actual sustained performance is an important aspect of overall computer-system performance. The program itself provides different forms of parallelism that differ- ent machine organizations can exploit to achieve performance. The first type, instruction-level parallelism, describes the amount of nondependent instructions5 available for parallel execution at any given point in the program. The program’s instruction-level parallelism in part determines the IPC component of raw performance mentioned above. (IPC can be viewed as describing the degree to which a particular machine organiza- tion can harvest the available instruction-level performance.) The second type of parallelism is data-level parallelism, which has to do with how data elements are distributed among computational units for similar types of processing. Data-level parallelism can be exploited through architectural and microarchitectural techniques that direct low-level instructions to operate on multiple pieces of data at the same time. This type of process- ing is often referred to as single-instruction-multiple-data. The third type is thread-level parallelism and has to do with the degree to which a program can be partitioned into multiple sequences of instructions with the intent of executing them concurrently and cooperatively on multiple processors. To exploit program parallelism, the compiler or run-time system must map it to appropriate parallel hardware. Throughout the history of modern computer architecture, there have been many attempts to build machines that exploit the various forms of 4 Providing the details of these microarchitecture techniques is beyond the scope of this publication. See Hennessey & Patterson for more information on these and related techniques. 5 An instruction X does not depend on instruction Y if X can be performed without using results from Y. The instruction a = b + c depends on previous instructions that produce the results b and c and thus cannot be executed until those previous instructions have completed.
OCR for page 62
62 THE FUTURE OF COMPUTING PERFORMANCE parallelism. In recent years, owing largely to the emergence of more gen - eralized and programmable forms of graphics processing units, the inter- est in building machines that exploit data-level parallelism has grown enormously. The specialized machines do not offer compatibility with existing programs, but they do offer the promise of much more per- formance when presented with code that properly exposes the avail- able data-level parallelism. Similarly, because of the emergence of chip multiprocessors, there is considerable renewed interest in understanding how to exploit thread-level parallelism on these machines more fully. However, the techniques also highlight the importance of the full suite of hardware components in modern computer systems, the communication that must occur among them, and the software technologies that help to automate application development in order to take advantage of parallel - ism opportunities provided by the hardware. COMPUTATION AND COMMUNICATION’S EFFECTS ON PERFORMANCE The raw computational capability of CPU cores is an important com - ponent of system-level performance, but it is by no means the only one. To complete any useful tasks, a CPU core must communicate with memory, a broad array of input/output devices, other CPU cores, and in many cases other computer systems. The overhead and latency of that communica - tion in effect delays computational progress as the CPU waits for data to arrive and for system-level interlocks to clear. Such delays tend to reduce peak computational rates to effective computational rates substantially. To understand effective performance, it is important to understand the characteristics of the various forms of communication used in modern computer systems. In general, CPU cores perform best when all their operands (the inputs to the instructions) are stored in the architected registers that are internal to the core. However, in most architectures, there tend to be few such registers because of their relatively high cost in silicon area. As a result, operands must often be fetched from memory before the actual compu - tation specified by an instruction can be completed. For most computer systems today, the amount of time it takes to access data from memory is more than 100 times the single cycle time of the CPU core. And, worse yet, the gap between typical CPU cycle times and memory-access times continues to grow. That imbalance would lead to a devastating loss in performance of most programs if there were not hardware caches in these systems. Caches hold the most frequently accessed parts of main memory in special hardware structures that have much smaller latencies than the main memory system; for example, a typical level-1 cache has an access
OCR for page 63
63 WHAT IS COMPUTER PERFORMANCE? time that is only 2-3 times slower than the single cycle time of the CPU core. They leverage a principle called locality of reference that characterizes common data-access patterns exhibited by most computer programs. To accommodate large working sets that do not fit in the first-level cache, many computer systems deploy a hierarchy of caches. The later levels of caches tend to be increasingly large (up to several megabytes), but as a result they also have longer access times and resulting latencies. The concept of locality is important for computer architecture, and Chapter 4 highlights the potential of exploiting locality in innovative ways. Main memory in most modern computer systems is typically imple - mented with dynamic random-access memory (DRAM) chips, and it can be quite large (many gigabytes). However, it is nowhere near large enough to hold all the addressable memory space available to appli - cations and the file systems used for long-term storage of data and programs. Therefore, nonvolatile magnetic-disk-based storage 6 is com- monly used to hold this much larger collection of data. The access time for disk-based storage is several orders of magnitude larger than that of DRAM, which can expose very long delays between a request for data and the return of the data. As a result, in many computer systems, the operating system takes advantage of the situation by arranging a “con - text switch” to allow another pending program to run in the window of time provided by the long delay in many computer systems. Although context-switching by the operating system improves the multiprogram throughput of the overall computer system, it hurts the performance of any single application because of the associated overhead of the context- switch mechanics. Similarly, as any given program accesses other system resources, such as networking and other types of storage devices, the associated request-response delays detract from the program’s ability to make use of the full peak-performance potential of the CPU core. Because each of those subsystems displays different performance characteristics, the establishment of an appropriate system-level balance among them is a fundamental challenge in modern computer-system design. As future technology advances improve the characteristics of the subsystems, new challenges and opportunities in balancing the overall system arise. Today, an increasing number of computer systems deploy more than one CPU core, and this has the potential to improve system performance. In fact, there are several methods of taking advantage of the potential of parallelism offered by additional CPU cores, each with distinct advantages and associated challenges. 6 Nonvolatile storage does not require power to retain its information. A compact disk (CD) is nonvolatile, for example, as is a computer hard drive, a USB flash key, or a book like this one.
OCR for page 69
69 WHAT IS COMPUTER PERFORMANCE? for a given program is important because it provides the results that are an integral part of the iterative research loop directed by the researcher. That is an example of performance as time to solu - tion. (See Box 2.4 for more on time to solution.) · In small-business settings, the computer system tends to be used for a very wide array of applications, so high general-purpose performance is valued. · For computer systems used in banking and other financial mar- kets, the reliability and accuracy of the computational results, even in the face of defects or harsh external environmental con- ditions, are paramount. Many deployments value gross transac - tional throughput more than the turnaround time of any given program, except for financial-transaction turnaround time. · In some businesses, computer systems are deployed into mission- critical roles in the overall operation of the business, for example, e-commerce-based businesses, process automation, health care, and human safety systems. In those situations, the gross reliabil- ity and "up time" characteristics of the system can be far more important than the instantaneous performance of the system at any given time. · At the very high end, supercomputer systems tend to work on large problems with very large amounts of data. The underlying performance of the memory system can be even more important than the raw computational capability of the CPU cores involved. That can be seen as an example of throughput as performance (see Box 2.5). Complicating matters a bit more, most computer-system deployments define some set of important physical constraints on the system. For example, in the case of a notebook-computer system, important energy- consumption and physical-size constraints must be met. Similarly, even in the largest supercomputer deployments, there are constraints on physical size, weight, power, heat, and cost. Those constraints are several orders of magnitude larger than in the notebook example, but they still are fun - damental in defining the resulting performance and utility of the system. As a result, for a given market opportunity, it often makes sense to gauge the value of a computer system according to a ratio of performance to constraints. Indeed, some of the metrics most frequently used today are such ratios as performance per watt, performance per dollar, and perfor- mance per area. More generally, most computer-system customers are placing increasing emphasis on efficiency of computation rather than on gross performance metrics.
OCR for page 70
70 THE FUTURE OF COMPUTING PERFORMANCE BOX 2.3 Hardware Components A car is not just an engine. It has a cooling system to keep the engine running efficiently and safely, an environmental system to do the same for the drivers and passengers, a suspension system to improve the ride, a transmission so that the engine’s torque can be applied to the drive wheels, a radio so that the driver can listen to classic-rock stations, and cupholders and other convenience features. One might still have a useful vehicle if the radio and cupholders were missing, but the other features must be present because they all work in har- mony to achieve the function of propelling the vehicle controllably and safely. Computer systems are similar. The CPU tends to get much more than its proper share of attention, but it would be useless without memory and I/O subsystems. CPUs function by fetching their instructions from memory. How did the instructions get into memory, and where did they come from? The in- structions came from a file on a hard disk and traversed several buses (commu- nication pathways) to get to memory. Many of the instructions, when executed by the CPU, cause additional memory traffic and I/O traffic. When we speak of the overall performance of a computer system, we are implicitly referring to the overall performance of all those systems operating together. For any given workload, it is common to find that one of the “links in the chain” is, in fact, the weakest link. For instance, one can write a program that only executes CPU operations on data that reside in the CPU’s own register file or its internal data cache. We would refer to such a program as “CPU-bound,” and it would run as fast as the CPU alone could perform it. Speeding up the memory or the I/O system would have no discernible effect on measured performance for that benchmark. Another benchmark could be written, however, that does little else but perform memory load and store operations in such a way that the CPU’s internal cache is ineffective. Such a benchmark would be bound by the speed THE INTERPLAY OF SOFTWARE AND PERFORMANCE Although the amazing raw performance gains of the microproces- sor over the last 20 years has garnered most of the attention, the overall performance and utility of computer systems are strong functions of both hardware and software. In fact, as computer systems have deployed more hardware, they have depended more and more on software technologies to harness their computational capability. Software has exploited that capability directly and indirectly. Software has directly exploited increases in computing capability by adding new features to existing software, by solving larger problems more accurately, and by solving previously unsolvable problems. It has indirectly exploited the capability through the use of abstractions in high-level programming languages, libraries, and virtual-machine execution environments. By using high-level program - ming languages and exploiting layers of abstraction, programmers can
OCR for page 71
71 WHAT IS COMPUTER PERFORMANCE? of memory (and possibly by the bus that carries the traffic between the CPU and memory.) A third benchmark could be constructed that hammers on the I/O subsystem with little dependence on the speed of either the CPU or the memory. Handling most real workloads relies on all three computer subsystems, and their performance metrics therefore reflect the combined speed of all three. Speed up only the CPU by 10 percent, and the workload is liable to speed up, but not by 10 percent—it will probably speed up in a prorated way because only the sections of the code that are CPU-bound will speed up. Likewise, speed up the memory alone, and the workload performance improves, but typically much less than the memory speedup in isolation. Numerous other pieces of com- puter systems make up the hardware. The CPU architectures and microarchi- tectures encompass instruction sets, branch-prediction algorithms, and other techniques for higher performance. Storage (disks and memory) is a central component. Memory, flash drives, traditional hard drives, and all the technical details associated with their performance (such as bandwidth, latency, caches, volatility, and bus overhead) are critical for a system’s overall performance. In fact, information storage (hard-drive capacity) is understood to be increasing even faster than transistor counts on the traditional Moore’s law curve,1 but it is unknown how long this will continue. Switching and interconnect components, from switches to routers to T1 lines, are part of every level of a computer system. There are also hardware interface devices (keyboards, displays, and mice). All those pieces can contribute to what users perceive of as the “performance” of the system with which they are interacting. 1 This phenomenon has been dubbed Kryder’s law after Seagate executive Mark Kryder (Chip Walter, 2005, Kryder’s law, Scientific American 293: 32-33, available online at http://www. scientificamerican.com/article.cfm?id=kryders-law). express their algorithms more succinctly and modularly and can compose and reuse software written by others. Those high-level programming constructs make it easier for programmers to develop correct complex programs faster. Abstraction tends to trade increased human programmer productivity for reduced software performance, but the past increases in single-processor performance essentially hid much of the performance cost. Thus, modern software systems now have and rely on multiple lay - ers of system software to execute programs. The layers can include oper- ating systems, runtime systems, virtual machines, and compilers. They offer both an opportunity for introducing and managing parallelism and a challenge in that each layer must now also understand and exploit paral- lelism. The committee discusses those issues in more detail in Chapter 4 and summarizes the performance implications below. The key performance driver to date has been software portability. Once
OCR for page 72
72 THE FUTURE OF COMPUTING PERFORMANCE BOX 2.4 Time To Solution Consider a jackhammer on a city street. Assume that using a jackhammer is not a pastime enjoyable in its own right—the goal is to get a job done as soon as possible. There are a few possible avenues for improvement: try to make the jackhammer’s chisel strike the pavement more times per second; make each stroke of the jackhammer more effective, perhaps by putting more power behind each stroke; or think of ways to have the jackhammer drive multiple chisels per stroke. All three possibilities have analogues in computer design, and all three have been and continue to be used. The notion of “getting the job done as soon as possible” is known in the computer industry as time to solution and has been the traditional metric of choice for system performance since computers were invented. Modern computer systems are designed according to a synchronous, pipe- lined schema. Synchronous means occurring at the same time. Synchronous digital systems are based on a system clock, a specialized timer signal that coordinates all activities in the system. Early computers had clock frequencies in the tens of kilohertz. Contemporary microprocessor designs routinely sport clocks with frequencies of over about 3-GHz range. To a first approximation, the higher the clock rate, the higher the system performance. System designers cannot pick arbitrarily high clock frequencies, however—there are limits to the speed at which the transistors and logic gates can reliably switch, limits to how quickly a signal can traverse a wire, and serious thermal power constraints that worsen in direct proportion to the clock frequency. Just as there are physical limits on how fast a jackhammer’s chisel can be driven downward and then retracted for the next blow, higher computer clock rates generally yield faster time-to-solution results, but there are several immutable physical constraints on the upper limit of those clocks, and the attainable performance speedups are not always proportional to the clock-rate improvement. How much a computer system can accomplish per clock cycle varies widely from system to system and even from workload to workload in a given system. More complex computer-instruction sets, such as Intel’s x86, contain instruc- tions that intrinsically accomplish more than a simpler instruction set, such as that embodied in the ARM processor in a cell phone; but how effective the com- plex instructions are is a function of how well a compiler can use them. Recent a program has been created, debugged, and put into practical use, end users’ expectation is that the program not only will continue to operate correctly when they buy a new computer system but also will run faster on a new system that has been advertised as offering increased perfor- mance. More generally, once a large collection of programs have become available for a particular computing platform, the broader expectation is that they will all continue to work and speed up in later machine genera -
OCR for page 73
73 WHAT IS COMPUTER PERFORMANCE? additions to historical instruction sets—such as Intel’s SSE 1, 2, 3, and 4—attempt to accomplish more work per clock cycle by operating on grouped data that are in a compressed format (the equivalent of a jackhammer that drives multiple chisels per stroke). Substantial system-performance improvements, such as fac- tors of 2-4, are available to workloads that happen to fit the constraints of the instruction-set extensions. There is a special case of time-to-solution workloads: those which can be successfully sped up with dedicated hardware accelerators. Graphics process- ing units (GPUs)—such as those from NVIDIA, from ATI, and embedded in some Intel chipsets—are examples. These processors were designed originally to handle the demanding computational and memory bandwidth requirements of 3D graphics but more recently have evolved to include more general program- mability features. With their intrinsically massive floating-point horsepower, 10 or more times higher than is available in the general-purpose (GP) micropro- cessor, these chips have become the execution engine of choice for some im- portant workloads. Although GPUs are just as constrained by the exponentially rising power dissipation of modern silicon as are the GPs, GPUs are 1-2 orders of magnitude more energy-efficient for suitable workloads and can therefore accomplish much more processing within a similar power budget. Applying multiple jackhammers to the pavement has a direct analogue in the computer industry that has recently become the primary development avenue for the hardware vendors: “multicore.” The computer industry’s pattern has been for the hardware makers to leverage a new silicon process technology to make a software-compatible chip that is substantially faster than any previous chips. The new, higher-performing systems are then capable of executing soft- ware workloads that would previously have been infeasible; the attractiveness of the new software drives demand for the faster hardware, and the virtuous cycle continues. A few years ago, however, thermal-power dissipation grew to the limits of what air cooling can accomplish and began to constrain the attain- able system performance directly. When the power constraints threatened to diminish the generation-to-generation performance enhancements, chipmak- ers Intel and AMD turned away from making ever more complex microarchitec- tures on a single chip and began placing multiple processors on a chip instead. The new chips are called multicore chips. Current chips have several processors on a single die, and future generations will have even more. tions. Indeed, not only has the remarkable speedup offered by industry standard (×86-compatible) microprocessors over the last 20 years forged compatibility expectation in the industry, but its success has hindered the development of alternative, noncompatible computer systems that might otherwise have kindled new and more scalable programming paradigms. As the microprocessor industry shifts to multicore processors, the rate of improvement of each individual processor is substantially diminished.
OCR for page 74
74 THE FUTURE OF COMPUTING PERFORMANCE BOX 2.5 Throughput There is another useful performance metric besides time to solution, and the Internet has pushed it to center stage: system throughput. Consider a Web server, such as one of the machines at search giant Google. Those machines run continuously, and their work is never finished, in that new requests for service continue to arrive. For any given request for service, the user who made the request may care about time to solution, but the overall performance metric for the server is its throughput, which can be thought of informally as the number of jobs that the server can satisfy simultaneously. Throughput will determine the number and configuration of servers and hence the overall installation cost of the server “farm.” Before multicore chips, the computer industry’s efforts were aimed primarily at decreasing the time to solution of a system. When a given workload required the sequential execution of several million operations, a faster clock or a more capable microarchitecture would satisfy the requirement. But compilers are not generally capable of targeting multiple processors in pursuit of a single time-to- solution target; they know how to target one processor. Multicore chips there- fore tend to be used as throughput enhancers. Each available CPU core can pop the next runnable process off the ready list, thus increasing the throughput of the system by running multiple processes concurrently. But that type of concur- rency does not automatically improve the time to solution of any given process. Modern multithreading programming environments and their routine suc- cessful use in server applications hold out the promise that applying multiple threads to a single application may yet improve time to solution for multicore platforms. We do not yet know to what extent the industry’s server multithread- ing successes will translate to other market segments, such as mobile or desktop computers. It is reasonably clear that although time-to-solution performance is topping out, throughput can be increased indefinitely. The as yet unanswered question is whether the buying public will find throughput enhancements as irresistible as they have historically found time-to-solution improvements. The net result is that the industry is ill prepared for the rather sudden shift from ever-increasing single-processor performance to the presence of increasing numbers of processors in computer systems. (See Box 2.6 for more on instruction-set architecture compatibility and possible future outcomes.) The reason that industry is ill prepared is that an enormous amount of existing software does not use thread-level or data-level parallelism— software did not need it to obtain performance improvements, because users simply needed to buy new hardware to get performance improve - ments. However, only programs that have these types of parallelism will experience improved performance in the chip multiprocessor era. Fur-
OCR for page 75
75 WHAT IS COMPUTER PERFORMANCE? thermore, even for applications with thread-level and data-level parallel - ism, it is hard to obtain improved performance with chip multiprocessor hardware because of communication costs and competition for shared resources, such as cache memory. Although expert programmers in such application domains as graphics, information retrieval, and databases have successfully exploited those types of parallelism and attained per- formance improvements with increasing numbers of processors, these applications are the exception rather than the rule. Writing software that expresses the type of parallelism that hardware based on chip multiprocessors will be able to improve is the main obstacle because it requires new software-engineering processes and tools. The pro- cesses and tools include training programmers to solve their problems with “parallel computational thinking,” new programming languages that ease the expression of parallelism, and a new software stack that can exploit and map the parallelism to hardware that is evolving. Indeed, the outlook for overcoming this obstacle and the ability of academics and industry to do it are primary subjects of this report. THE ECONOMICS OF COMPUTER PERFORMANCE There should be little doubt that computers have become an indis - pensable tool in a broad array of businesses, industries, research endeav- ors, and educational institutions. They have enabled profound improve - ment in automation, data analysis, communication, entertainment, and personal productivity. In return, those advances have created a virtu- ous economic cycle in the development of new technologies and more advanced computing systems. To understand the sustainability of con- tinuing improvements in computer performance, it is important first to understand the health of this cycle, which is a critical economic underpin- ning of the computer industry. From a purely technological standpoint, the engineering community has proved to be remarkably innovative in finding ways to continue to reduce microelectronic feature sizes. First, of course, industry has inte - grated more and more transistors into the chips that make up the com- puter systems. Fortunate side effects are improvements in speed and power efficiency of the individual transistors. Computer architects have learned to make use of the increasing numbers and improved characteris- tics of the transistors to design continually higher-performance computer systems. The demand for the increasingly powerful computer systems has generated sufficient revenue to fuel the development of the next round of technology while providing profits for the companies leading the charge. Those relationships form the basis of the virtuous economic
OCR for page 76
76 THE FUTURE OF COMPUTING PERFORMANCE BOX 2.6 Instruction-Set Architecture: Compatibility This history of computing hardware has been dominated by a few franchises. IBM first noticed the trend of increasing performance in the 1960s and took ad- vantage of it with the System/360 architecture. That instruction-set architecture became so successful that it motivated many other companies to make com- puter systems that would run the same software codes as the IBM System/360 machines; that is, they were building instruction-set-compatible computers. The value of that approach is clearest from the end user’s perspective—compatible systems worked as expected “right out of the box,” with no recompilation, no alterations to source code, and no tracking down of software bugs that may have been exposed by the process of migrating the code to a new architecture and toolset. With the rise of personal computing in the 1980s, compatibility has come to mean the degree of compliance with the Intel architecture (also known as IA-32 or x86). Intel and other semiconductor companies, such as AMD, have managed to find ways to remain compatible with code for earlier generations of x86 processors. That compatibility comes at a price. For example, the floating- point registers in the x86 architecture are organized as a stack, not as a randomly accessible register set, as all integer registers are. In the 1980s, stacking the floating-point registers may have seemed like a good idea that would benefit compiler writers; but in 2008, that stack is a hindrance to performance, and x86- compatible chips therefore expend many transistors to give the architecturally required appearance of a stacked floating-point register set—only to spend more transistors “under the hood” to undo the stack so that modern perfor- mance techniques can be applied. IA-32’s instruction-set encoding and its seg- mented addressing scheme are other examples of old baggage that constitute a tax on every new x86 chip. There was a time in the industry when much architecture research was ex- pended on the notion that because every new compatible generation of chips must carry the aggregated baggage of its past and add ideas to the architecture to keep it current, surely the architecture would eventually fail of its own ac- cord, a victim of its own success. But that has not happened. The baggage is there, but the magic of Moore’s law is that so many additional transistors are made available in each new generation, that there have always been enough to reimplement the baggage and to incorporate enough innovation to stay com- petitive. Over time, such non-x86-compatible but worthy competitors as DEC’s Alpha, SGI’s MIPS, Sun’s SPARC, and the Motorola/IBM PowerPC architectures either have found a niche in market segments, such as cell phones or other embedded products, or have disappeared. and technology-advancement cycles that have been key underlying driv - ers in the computer-systems industry over the last 30 years. There are many important applications of semiconductor technol- ogy beyond the desire to build faster and faster high-end computer sys- tems. In particular, the electronics industry has leveraged the advances
OCR for page 77
OCR for page 78
78 THE FUTURE OF COMPUTING PERFORMANCE computer systems. In light of the capabilities of the smaller form-factor devices, they will probably play an important role in unleashing the aggregate performance potential of larger-scale networked systems in the future. Those additional market opportunities have strong economic underpinnings of their own, and they have clearly reaped benefits from deploying technology advances driven into place by the computer-sys- tems industry. In many ways, the incredible utility of computing not only has provided direct improvement in productivity in many industries but also has set the stage for amazing growth in a wide array of codependent products and industries. In recent years, however, we have seen some potentially troublesome changes in the traditional return on investment embedded in this virtuous cycle. As we approach more of the fundamental physical limits of tech- nology, we continue to see dramatic increases in the costs associated with technology development and in the capital required to build fabrication facilities to the point where only a few companies have the wherewithal even to consider building these facilities. At the same time, although we can pack more and more transistors into a given area of silicon, we are seeing diminishing improvements in transistor performance and power efficiency. As a result, computer architects can no longer rely on those sorts of improvements as means of building better computer systems and now must rely much more exclusively on making use of the increased transistor-integration capabilities. Our progress in identifying and meeting the broader value proposi- tions has been somewhat mixed. On the one hand, multiple processor cores and other system-level features are being integrated into monolithic pieces of silicon. On the other hand, to realize the benefits of the multi - processor machines, the software that runs on them must be conceived and written in a different way from what most programmers are accus- tomed to. From an end-user perspective, the hardware and the software must combine seamlessly to offer increased value. It is increasingly clear that the computer-systems industry needs to address those software and programmability concerns or risk the ability to offer the next round of compelling customer value. Without performance incentives to buy the next generation of hardware, the economic virtuous cycle is likely to break down, and this would have widespread negative consequences for many industries. In summary, the sustained viability of the computer-systems indus- try is heavily influenced by an underlying virtuous cycle that connects continuing customer perception of value, financial investments, and new products getting to market quickly. Although one of the primary indica- tors of value has traditionally been the ever-increasing performance of each individual compute node, the next round of technology improve -
OCR for page 79
79 WHAT IS COMPUTER PERFORMANCE? ments on the horizon will not automatically enhance that value. As a result, many computer systems under development are betting on the ability to exploit multiple processors and alternative forms of parallel - ism in place of the traditional increases in the performance of individual computing nodes. To make good on that bet, there need to be substantial breakthroughs in the software-engineering processes that enable the new types of computer systems. Moreover, attention will probably be focused on high-level performance issues in large systems at the expense of time to market and the efficiency of the virtuous cycle.