memory in units of cache lines wastes a large fraction (as much as 94 percent) of local memory bandwidth when only a single word of the cache line is needed.

Many scientific applications have sufficient spatial and temporal locality that they provide better performance per unit cost on commodity processors than on custom processors. Some scientific applications can be solved more quickly using custom processors but at a higher cost. Some users will pay that cost; others will tolerate longer times to solution or restrict the problems they can solve to save money. A small set of scientific applications that are bandwidth-intensive can be solved both more quickly and more cheaply using custom processors. However, because this application class is small, the market for custom processors is quite small.4

In summary, commodity processors optimized for commercial applications meet the needs of most of the scientific computing market. For the majority of scientific applications that exhibit significant spatial and temporal locality, commodity processors are more cost effective than custom processors, making them better capability machines. For those bandwidth-intensive applications that do not cache well, custom processors are more cost effective and therefore offer better capacity on just those applications. They also offer better turnaround time for a wider range of applications, making them attractive capability machines. However, the segment of the scientific computing market—bandwidth-intensive and capability—that needs custom processors is too small to support the free market development of such processors.

The above discussion is focused on hardware and on the current state of affairs. As the gap between processor speed and memory speed continues to increase, custom processors may become competitive for an increasing range of applications. From the software perspective, systems with fewer, more powerful processors are easier to program. Increasing the scalability of software applications and tools to systems with tens of thousands or hundreds of thousands of processors is a difficult problem, and the characteristics of the problem do not behave in a linear fashion. The cost of using, developing, and maintaining applications on custom systems can be substantially less than the comparable cost on commodity systems and may cancel out the apparent cost advantages of hardware for commodity-based high-performance systems—for applications that will run only on custom systems. These issues are discussed in more detail in Chapter 5.


L. Carrington, A. Snavely, X. Gao, and N. Wolter. 2003. A Performance Prediction Framework for Scientific Applications. ICCS Workshop on Performance Modeling and Analysis (PMA03). Melbourne, June.


S. Goedecker and A. Hoisie. 2001. Performance Optimization of Numerically Intensive Codes. Philadelphia, Pa.: SIAM Press.


The IBM Power 4 has a 512-byte level 3 cache line.


This categorization of applications is not immutable. Since commodity systems are cheaper and more broadly available, application programmers have invested significant effort in adapting applications to these systems. Bandwidth-intensive applications are those that are not easily adapted to achieve acceptable performance on commodity systems. In many cases the difficulty seems to be intrinsic to the problem being solved.

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement