1

Mother's Day, Victoria's Secret, and Napster: Network Traffic Modeling

The Nature of the Internet

Andrew Odlyzko of AT&T-Research opened the workshop's session on network traffic modeling by drawing a contrast between modeling of traditional networks (such as telephone and electrical networks) and modeling of the Internet. He noted that for telephone service networks, the needs are well defined—good service requires open channels, no interruptions, and low noise. Their modeling is well understood, and the network characteristics are relatively static. Indeed, one of the few significant changes in voice traffic in the United States is that the peak in phone usage has shifted from Mother 's Day greetings to responses to television call-ins (such as those related to the show “Who Wants to Be a Millionaire”).

The Internet, in contrast, while driven by a set of relatively simple protocols, has dynamics that are much more complex. Walter Willinger (also from AT&T-Research) and others on a workshop panel (see the agenda on p. 16) helped to explain the basics of Internet terminology and operations. The Internet's host computers and routers (the internal switching stations) can be considered as the nodes of a graph. The edges in the graph are links between nodes, some of which are physical and some wireless. Within the Internet are autonomous systems, which are subnets under the control of a single authority. In contrast to the telephone network, the Internet is constantly changing and growing, and the mixes of nodes, traffic, and other network-related attributes are characterized by extreme heterogeneity.

Data traffic is governed by several protocols, with TCP/IP being the most important and most widely supported protocol suite. The Internet Protocol (IP) determines the routing of data. Using an analogy he attributed to Vinton Cerf, Odlyzko likened IP to a postal system within which postcards are sent but are not guaranteed to reach their destination: that is, IP is a communication system for which some likelihood of delivery failure is tolerated. There are usually multiple paths through the network, and routing between a particular source –destination pair can change suddenly, although it tends to be relatively stable over time. Long messages are partitioned into several packets for transmission, with the packet size determined by physical characteristics such as the bandwidth of the link. A typical World Wide Web page might contain about 20,000 bytes, which, for example, might be partitioned into 15 packets for transmission. The distribution of packet sizes is multimodal, typically with pronounced peaks at around 50, 500, and 1,500 bytes.

The Transmission Control Protocol (TCP) works on top of IP and controls the transfer of data across the Internet. TCP is an acknowledgment-based system: when a sender's packet arrives safely at its destination, an acknowledgment packet (ACK) is sent back to the sender. Such an ACK-based system serves a dual purpose. First, if no ACK is received, the sender can assume that the packet has been lost, and re-transmittal can be automatic. Thus, TCP provides a reliable connection for the transfer of data, a functionality that is not provided by IP. Second, measuring the delay between sending a packet and receiving an acknowledgment of its receipt enables the user to estimate the communications delay between sender and receiver, which in turn allows a time limit to be set for declaring that a given packet has not been received. Thus, once a link is established, the rate at which a sender is allowed to send packets can be adjusted up or down depending on whether or not loss of a packet has been inferred. The aim is to avoid network congestion while fully exploiting the available bandwidth within the network, all through the use of information available locally at each server.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop 1 Mother's Day, Victoria's Secret, and Napster: Network Traffic Modeling The Nature of the Internet Andrew Odlyzko of AT&T-Research opened the workshop's session on network traffic modeling by drawing a contrast between modeling of traditional networks (such as telephone and electrical networks) and modeling of the Internet. He noted that for telephone service networks, the needs are well defined—good service requires open channels, no interruptions, and low noise. Their modeling is well understood, and the network characteristics are relatively static. Indeed, one of the few significant changes in voice traffic in the United States is that the peak in phone usage has shifted from Mother 's Day greetings to responses to television call-ins (such as those related to the show “Who Wants to Be a Millionaire”). The Internet, in contrast, while driven by a set of relatively simple protocols, has dynamics that are much more complex. Walter Willinger (also from AT&T-Research) and others on a workshop panel (see the agenda on p. 16) helped to explain the basics of Internet terminology and operations. The Internet's host computers and routers (the internal switching stations) can be considered as the nodes of a graph. The edges in the graph are links between nodes, some of which are physical and some wireless. Within the Internet are autonomous systems, which are subnets under the control of a single authority. In contrast to the telephone network, the Internet is constantly changing and growing, and the mixes of nodes, traffic, and other network-related attributes are characterized by extreme heterogeneity. Data traffic is governed by several protocols, with TCP/IP being the most important and most widely supported protocol suite. The Internet Protocol (IP) determines the routing of data. Using an analogy he attributed to Vinton Cerf, Odlyzko likened IP to a postal system within which postcards are sent but are not guaranteed to reach their destination: that is, IP is a communication system for which some likelihood of delivery failure is tolerated. There are usually multiple paths through the network, and routing between a particular source –destination pair can change suddenly, although it tends to be relatively stable over time. Long messages are partitioned into several packets for transmission, with the packet size determined by physical characteristics such as the bandwidth of the link. A typical World Wide Web page might contain about 20,000 bytes, which, for example, might be partitioned into 15 packets for transmission. The distribution of packet sizes is multimodal, typically with pronounced peaks at around 50, 500, and 1,500 bytes. The Transmission Control Protocol (TCP) works on top of IP and controls the transfer of data across the Internet. TCP is an acknowledgment-based system: when a sender's packet arrives safely at its destination, an acknowledgment packet (ACK) is sent back to the sender. Such an ACK-based system serves a dual purpose. First, if no ACK is received, the sender can assume that the packet has been lost, and re-transmittal can be automatic. Thus, TCP provides a reliable connection for the transfer of data, a functionality that is not provided by IP. Second, measuring the delay between sending a packet and receiving an acknowledgment of its receipt enables the user to estimate the communications delay between sender and receiver, which in turn allows a time limit to be set for declaring that a given packet has not been received. Thus, once a link is established, the rate at which a sender is allowed to send packets can be adjusted up or down depending on whether or not loss of a packet has been inferred. The aim is to avoid network congestion while fully exploiting the available bandwidth within the network, all through the use of information available locally at each server.

OCR for page 1
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop Although TCP is the most widely used transport protocol, others are also in use. Some applications, such as transmission of voice or video information, do not require lossless transmissions and would not be able to incorporate packets that were retransmitted after loss. Applications such as these might use the User Data Protocol (UDP) which does not guarantee delivery. Losses in video information might be dealt with by encoding in packets of multiple classes and transmitting a low-resolution picture using TCP, with supplemental low-priority or high-resolution packets sent using UDP. Thus, the control of traffic in the Internet is fully distributed, and feedback greatly affects traffic. Empirical results show that traffic is “bursty” and has fractal behavior over a wide range of time scales. This characteristic was discovered by analysis of data at Bellcore in the early 1990s and subsequently verified by a number of independent studies. The main cause of this self-similarity is the heavy-tailed property in the distribution of lengths of files or Web documents. (For a heavy-tailed distribution, P[X>x] ~ x−k, where k is a positive number; i.e., the probability that the length of a file is at least x decays very slowly, namely hyperbolically in x.) Queuing systems with exponential-type distributions are rather easy to model using Markovian techniques, but systems with heavy-tailed distributions, while more realistic, are quite difficult to analyze. Predictive Models of the Internet Currently, it is becoming feasible to create temporal models good enough for studying traffic performance at a single node or router. It would be desirable to have spatio-temporal models that enable study of the performance of the network as a whole (Problem 1). Problem 1. Create spatio-temporal-layered models that represent various communication levels (application, transport, physical layer, and so on) and that capture and predict useful information about Internet behavior. To date, single TCP sessions have been modeled as stationary ergodic loss processes. Many recently developed models assume that a single queue is the bottleneck, with several long-lived and homogeneous TCP sessions coming through; this results in a fluid-like model. TCP connections in a realistic Internet-like context, with its naturally occurring heterogeneities, cannot be modeled yet. Willinger also noted that there has been some study of a new variation of TCP that randomly deletes packets at a linearly increasing rate when the queue length is too great, even if buffers are available. Debasis Mitra of Bell Laboratories, Lucent Technologies, noted that if cost is additive, and if treatment at the next node is not dependent on treatment at past ones, this protocol can be modeled as a Markov decision process. The use of stochastic differential equations would also be an approach to modeling router behavior. If the characteristics of routers (e.g., buffer capacity, number of outward connections, protocols) are poorly modeled, then network designers are likely to overdesign (resulting in networks that are more expensive than necessary) or otherwise have difficulty ensuring good network performance. The cost implications are significant, owing to the number of routers in a network and the expense of wasting bandwidth, especially in wireless systems.

OCR for page 1
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop New Protocol Designs There was some controversy at the workshop over whether the study of protocols such as TCP was worth a mathematician's time, or whether effort should be devoted instead to developing protocols that would also be easier to model. Willinger, however, believed that many protocols would lead to models that are similar to those developed for TCP, and that TCP would continue to be used for many years. One participant noted the need for control theory expertise in the creation of new protocols, and Jitendra Malik of the University of California at Berkeley proposed a related problem (Problem 2). Problem 2. Is there an effective scalable protocol that does not require global coordination of network traffic? What are necessary and what are the sufficient conditions for scalability? How should we measure the complexity (Hamming window size) of a protocol? Security on the Internet and Understanding Traffic Patterns Sugih Jamin of the University of Michigan discussed some Internet security issues. The Internet was designed to be resilient, available, and scalable; only recently has security become a priority. There are two kinds of attacks: those against information content (violation of the secrecy, the integrity, or the authentication of content) and those against the infrastructure (intrusion or denial of service). There are well-developed tools to deal with attacks on information, including encryption, electronic signatures, and firewalls. But there are no truly effective countermeasures for denial of service. Denial-of-service attacks can be accomplished by inundating a victim computer with so many bogus message packets that useful work is brought to a halt. This can be accomplished by taking over other machines, termed “zombies,” which then all send messages to the targeted computer. For instance, the zombies can all request a connection with the victim by sending a synchronization message (a SYN flood) with a “spoofed ” (false) return address. SYN requests are a standard, and normally benign, part of Internet traffic, but a flood of messages can effectively disable the victim. Alternatively, the zombies can request that the victim echo a message to a spoofed address, a “smurfing” attack. Currently, the only defenses against such attacks are good computer hygiene (to keep machines from being misused as zombies) and, when possible, prosecution of offenders under the criminal justice system (for example, the U.S. Computer Fraud and Abuse Act). Neither of these options prevents harm to the victim of distributed denial-of-service (DDOS) attacks. Problem 3. Can we detect correlated traffic streams across the Internet in real time, transparent to legitimate uses, so that we can detect distributed denial-of-service attacks before they shut down targeted computers? The goal of Problem 3 is to identify patterns of behavior that are suspicious, when the individual components of the attack each look quite legitimate. Such a capability would also be useful in guarding against legitimate overloads, such as the run that shut down the Victoria's Secret Web site in 1999 (when too many individuals attempted to contact it at one time to see a live online fashion show), so that at least some of the users can still be served.

OCR for page 1
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop One possible approach to this problem is to model the incoming traffic at a server using a hidden Markov model and attempt to find parameters that characterize DDOS attacks, but this can be done only at the autonomous system level. John Baras (University of Maryland) noted that IBM has made available a software system that is capable of analyzing log files in order to detect DDOS. Interestingly enough, this software is similar to a system that is used for identifying subsequences in DNA. But thus far no general solution to DDOS detection is available, leading to open Problem 4. Problem 4. Define and detect the signatures of activities that significantly affect network traffic, such as distributed denial of service and use of Napster. Napster is a mechanism for harvesting audio files from widely distributed servers, that is, all of the machines that have used its services. Napster has created tremendous traffic overload on some autonomous systems, such as university computer systems, and its use has been banned in some places. While Napster itself might disappear because of copyright rulings, it is yet another example of the dynamic nature of Internet traffic characteristics. Network Tomography Donald Towsley of the University of Massachusetts posed an open problem concerning “network tomography” (Problem 5). Problem 5. From observations of point-to-point traffic characteristics, can we reconstruct characteristics of the network? Towsley has done some work on the topic, and spirited discussion at the workshop centered on whether the problem was more analogous to electrical impedance tomography (EIT)1 or to discrete Radon transform tomography. Signal paths for network tomography are sometimes uncertain (as is the case for EIT), but the discrete nature of the problem is more akin to Radon transform tomography, leading workshop participant Tony Chan (UCLA) to ask whether there is a radon transform theory for graphs. Relying on earlier work in this area by Y. Yardi, statisticians at Bell Laboratories2 have worked on a related problem and have obtained results on identifiability. Work at Telcordia Technologies by F. Chung, M. Garrett, R. Graham, and D. Shallcross has shown that reconstructing a parsimonious graph consistent with given distance (or packet round-trip) measurements is an NP-complete problem. The Modeling Problem Ingrid Daubechies (Princeton University) and David Donoho (Stanford University) abstracted additional mathematical problems from the network traffic presentations and discussion at the workshop. Daubechies posed a series of interconnected problems (Problem 6, Problem 7, and Problem 8), some of them close in nature to the problems highlighted above. 1   See, for example, National Research Council, Mathematics and Physics of Emerging Biomedical Imaging, National Academy Press, Washington, D.C., 1996, pp. 143-146, 209, and the references therein. 2   For example, Vander Wiel, S., Cao, J., Davis, D., and Yu, B., Time-varying network tomography: Router link data, Symposium on the Interface: Computing Science and Statistics, Schaumburg, Ill., June 1999.

OCR for page 1
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop Problem 6. Develop graph theoretical results and multiresolution approaches suitable for modeling the layered Internet. In the multiresolution approach, each node has structure. It would be desirable to make layers within the Internet correspond to the multiresolution, but a different decomposition might be more fruitful. The topology of the network and the bandwidth are both relevant; as Tony Chan noted, there may be connections with the automatic aggregation procedures in algebraic multigrid. There are opportunities for experts in graph theory, statistics, numerical analysis, and graphics to collaborate with network experts. Problem 7. Statisticians and probabilists need to build “structural” models, simpler than the TCP model, that mimic the Internet's important characteristic that the loss probability is a function of the entire network. There are opportunities for experts in statistics, probability, mathematical modeling, and statistical mechanics to collaborate with network experts. Problem 8. Statisticians and probabilists could study data to characterize “invariants” of Internet traffic in order to be able to evaluate models. This is partly a data-mining problem. Donoho summarized the underlying themes as discovering invariants, identifying anomalies, and designing the Internet in a rational way. He noted some cultural differences between network researchers and mathematical scientists: the short time scale over which problems are interesting, massive statistical data sets that can never be completely understood, and the lack of generations of older results to build upon. But he noted how fascinating the Internet is to a statistician accustomed to dealing with naturally occurring systems: the Internet is a humanly designed system whose performance, at least in principle, can be fully measured. He said that he could imagine larger systems evolving in the near future: sensors on every car, for instance, would result in a network with some similar characteristics. The traffic from such ad hoc networks would be poorly structured, just as Internet traffic is, so the Internet is a particularly apt model system. Problems of this sort should also influence statistics curricula, whose focus today is often on how one extracts maximal information from small data sets rather than how one can deal with and extract information from huge amounts of ill-structured data. The network modeling attempts made to date also point up the compelling need for taking new approaches to stochastic modeling and suggest the possible value of involving operations researchers and experts in economic analysis. In the future, the Internet may provide various classes of services, at different prices, creating whole new modeling problems (e.g., Problem 9). Problem 9. What is the simplest mechanism, requiring the least cooperation from routers (in order to ensure scalability), for computing shadow costs (Lagrange multipliers) upon which to base traffic pricing and priority setting?

OCR for page 1
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop Some work on such issues has been done by S. Low (now at the California Institute of Technology), by the Berkeley group of Anantharam, Varaiya, and Lau, and especially by F. Kelly's group at the University of Cambridge in the United Kingdom. Summary of Network Traffic Modeling Session As Ingrid Daubechies noted, work in network traffic modeling must be interdisciplinary, and there will not be much progress from mathematical scientists working in isolation. The federal Information Technology Research initiative is one funding program that recognizes this reality and encourages collaboration.