Read "Proceedings of a Workshop on Statistics on Networks (CD-ROM)" at NAP.edu

« Previous: Front Matter

Page 1 Cite

Suggested Citation:"Keynote Address, Day 1 Network Complexity and Robustness--John Doyle, California Institute of Technology." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.

Page 2 Cite

Page 3 Cite

Page 4 Cite

Page 5 Cite

Page 6 Cite

Page 7 Cite

Page 8 Cite

Page 9 Cite

Page 10 Cite

Page 11 Cite

Page 12 Cite

Page 13 Cite

Page 14 Cite

Page 15 Cite

Page 16 Cite

Page 17 Cite

Page 18 Cite

Page 19 Cite

Page 20 Cite

Page 21 Cite

Page 22 Cite

Page 23 Cite

Page 24 Cite

Page 25 Cite

Page 26 Cite

Page 27 Cite

Page 28 Cite

Page 29 Cite

Page 30 Cite

Page 31 Cite

Page 32 Cite

Page 33 Cite

Page 34 Cite

Page 35 Cite

Page 36 Cite

Page 37 Cite

Page 38 Cite

Page 39 Cite

Page 40 Cite

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Page 59 Cite

Page 60 Cite

Page 61 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Keynote Address, Day 1 1

Network Complexity and Robustness John Doyle, California Institute of Technology DR. DOYLE: I am going to try to set the stage for this meeting. The work that I am going to talk about is the result of a lot of collaborations. I certainly wonât give justice to those people here. All the good ideas and all the good work that I am going to talk about is done with them, so when I say âweâ I mean âthey.â A lot of them are here today; two of them have posters and Jean Carlson is going to talk later. Walter Willinger is here. I am going to talk a lot about work with them. There are buzzwords related to network complexityânetwork-centric, systems biology, pervasive, embedded. What I am interested in is the core theory challenges that underlie those common themes. I am going to be a little narrow in my focus in the sense that I am going to concentrate on biological networks and technological networks: I want to stick with stuff where we mostly know how the parts work, so that means not much social networks. There is a remarkably common core of theoretical challenges, so from the math-stat side I think there are some really common themes here. There has been recent dramatic progress in laying the foundation, yet there has also been amazingly, at the same time, a striking increase in what I would call unnecessary confusion. I will talk a little bit about both of these. One of the common themes I am going to talk about is the fact that we see power laws all over. I think a lot of people at this workshop are going to be talking about that, because it is a common trait across all advanced technology and biology. Another common theme is that many of these systems in biology and advanced technology are robust yet fragile. What I mean by that is they work really well most of the time but that they fail occasionally, and when they do fail it can be catastrophically. We will hear more about that from other speakers and poster presenters. What I am going to do today, though, is talk about motivation for new theory and also education. We need to educate each other about even some fairly elementary aspects of statistics and mathematics. To get this started on the broadest level possible, I am going to stick with stuff that only assumes the background provided by a Caltech undergraduate education. Letâs start with some data. The figures below show the 20th centuryâs hundred largest disasters worldwide. What I have plotted here are three kinds of disasters: technological disasters in tens of billions of dollars, natural disasters in hundreds of billions of dollars (and natural disasters can be even bigger), and power outages over some period of time and tens of millions of customers. What you see on a log-log chart are roughly straight lines with slopes of minus one. 2

Figure 2 shows the raw data. I have just taken the event sizes and plotted them; the worst events are along the bottom, while the more common events are higher up and to the left. The significance of this is that the worst event is orders of magnitude worse than the median event, which we are learning much more about this last year. It also means that demonstrations of these behaviors (events of the type represented in the upper left corner of Figures 1 and 2) and reality (which includes the events toward the lower right) can be very different, both in scale and in character, with the more common behaviors being robust and the rare behavior displaying the great fragility of these systems. When we take and build our new network-centric embedded- everywhere systems that some of us in this room are going to provide to all of us, they will degrade, and the reality may be much, much worse. This is also an indication of robust yet fragile. The typical behavior is much smaller than the worst case, and the worst case is very, very bad. FIGURE 1 3

FIGURE 2 I am not going to focus so much on disasters. Jean Carlson is going to talk about that tomorrow, but I do want to get some of the ideas behind these kinds of statistics. They are either all you study or you wonât study it at all, and there is not much conversation back and forth between those groups of people. The reason you get straight lines is that logarithms of a power law give you straight lines, and that is the slope alpha. That just describes the data, and I donât assume any kind of model. If I said I am going to think of this as samples from a stochastic process, that is a model. What I have written down is just data and I have done no statistics on it. The fact that I have drawn a line there is just to help your eye visualize things. One important thing to do is to look at data. One of the things that isnât often done in papers is people donât show you their data in a way that you can make judgments about the statistics they have made. We will see that there are lots of errors that arise because of this. Power laws are really ubiquitous. Why is that and what do probability theory and statistics tell us about power laws? Well, if I were to ask you why are Gaussians so ubiquitous you would say it is central limit theorem. If I asked you why the exponential distribution is so popular or so common you would probably talk about the Markov properties or the marginalization properties, which are essentially, mathematically the same thing. They both occur in situations of low variability, so expect to see Gaussians and exponentials all over the 4

place. When it comes to systems with high variability, power laws actually have even more strong statistical properties than Gaussians or exponentials. In fact, they have all that Gaussians and exponentials have, and even more. They are in some senseâI think Walter Willinger coined this phraseâmore normal than normal. We should not call Gaussians normal; we should call power laws normal. They arise for no reason other than the way you measure data, and the way you look at the world will make them appear all over the place, provided there is high variability. Much of science goes out of its way to get rid of high variability. We study in our labs low-variability phenomena, but high variability is everywhere. With the former, you get convergent moments, while the latter corresponds to divergent moments. Typical notions of moments and mean and variance donât mean anything. We talk about the coefficient of variation in a Gaussian. That is a well-defined concept in an exponential; it is actually 1. In power laws, of course, it is divergent. The issue is that power laws are not exceptional, but the really important issue is low versus high variability. What about mechanisms? We have large disasters because we have uncertain environments, we put assets at risk, and we only devote a certain amount of resources to ameliorating those risks, and thus sometimes those resources get overwhelmed. So, large events are not really surprising. If we were just to plot the large power outage of August 14, 2003, it is completely consistent with this data. The ramifications of the attacks on September 11, 2001, as a technological disaster, are not off the map. Hurricane Katrina is not off the map. These are all more or less consistent with the largest event. Like I said, I am not going to discuss disasters very much. High variability is much more fundamental and is very ubiquitous, particularly in highly engineered or evolved systems. Power laws are more normal. They shouldnât be thought of as signatures of any specific mechanisms any more than Gaussians are. It also means that their statistical properties lead people to find them where they arenât, through statistical errors. They are very abundant in networks, and I will come back to try to talk about in biology, in particular, why they are everywhere in biology. Because of all these properties, there are major errors that are not just isolated instances, but typical in high-impact journals in science, presumably because mathematicians and statisticians play little or no role in writing, reviewing, or on the editorial boards of these journals. One of these errors is variability in power laws. They are badly misestimating the slope or basically, even more profoundly, misunderstanding where high variability comes from in the first place. A typical mistake that is made, taking data that are actually exponential and plotting them in such a way that suggests a power law, is exemplified in the following series of figures. I numerically generated random data with the little Matlab code shown, using Matlabâs random 5

number generator (actually a pseudo-random number generator). I generated exponentially distributed data, and on Figure 3âs semi-log plot you see it is nearly a straight line. I checked that I had written my Matlab program right, and that the Matlab random number generator works. It is linear on a semi-log plot because, obviously, you take logs of an exponential and you get linear with slope -a. FIGURE 3 Instead of these rank plots you might think that we could just look at frequencies. Frequencies are obviously related to the rank as shown in Figure 4, even in this case where they are integers. You could write a little program to calculate the frequency plot and do a frequency plot and, lo and behold, a power law as shown in Figure 5, slope 1.5. All of a sudden, I think I have a power law here. In this case I know I donât, but in some sense I have made a classic error. The idea is that I just plotted in a silly wayâagain, you are differentiating dataâit is noisy and you are going to get problems. I donât want to belabor this point too much because it is well understood why this happens, there is no mystery about it; it is just a matter of proper education. The statistics community has to explain this to people. 6

FIGURE 4 FIGURE 5 7

FIGURE 6 8

FIGURE 7 We really have got to not do that. You would think that it was an isolated instance but, in fact, it is almost always the case that you will see the plots on the right above are systematic errors. There are a bunch of networks that have been claimed to have power laws in them, and the data that was presented actually didnât. Figure 8 comes from a paper-on-protein interaction power law, an article from Nature. I got the data and re-plotted it, but this is basically a plot that appears in a supplement. The paper concluded that it is a power law with a slope 1. It doesnât even have finite mean or variance, if you think of it as a probability model. What does it really have? Roughly it has exponential. Again, just to comment, this is an undergraduate sort of error. If a Caltech undergraduate did this they wouldnât last long. Of course, this is in Nature and in some senseâI have reviewed a few papers where I have said, no, you have got to plot itâthey say, no, if we donât plot it the way we do on the right, then the editors wonât accept them. I said we have got to change that. 9

FIGURE 8 10

FIGURE 9 Figures 9-13 deal with an analogous error, from Science, does the power grid have a power law? No, itâs almost a perfect exponential. You canât get a better exponential than that and yet it was called a power law. You can ask, what about the World Wide Web? Letâs re-plot to make sure of what they were doing. They did logarithmic binning. You might think that helps. In this case, it really might be a power law, but you get the wrong slope. For a power law, slopes of 1.1 versus 1.7 are really different. You are off by orders of magnitude. So, once you get into power laws, the slopes are really importantâagain the point is, it doesnât exactly look like a power law, the real data. That is not the big deal. The big deal is it does have high variability. The Web has high variability. In fact, when you look at the Internet, practically everything has high variability. It is almost impossible to find something that doesnât have high variability, although Internet routers provide an exception. If you look at the router size frequency plot, you could come up with a power law whereas, in fact, it is clearly exponential in the tail because there is an excess of small-degree routers. That is probably because, either at the edge of the network there are a lot of degree-one routers or, in fact, it is probably an artifact of the way the data was taken, and you have to check this, but it is certainly not a power law. These errors are typical, but the big one, I think, is badly misunderstanding the origins of high variability, because high 11

variability is everywhere. FIGURE 10 FIGURE 11 12

FIGURE 12 13

FIGURE 13 Is it even possible for a router topology to have high variability and no degree? You could do it. Figure 14 is a chart of Abilene, which is the Internet 2, and Figure 15 is actually a weather map taken on November 8. There are a lot of Internet experts here; if you look at the Abilene weather map, you might find this picture quite peculiar. If you ask why, it is because it is running at about 70 to 80 to 90 percent, in some places, saturation. You never see that. Why is that? This is the day we broke the land speed record, and we were using this part of the network, and nobody even noticed. 14

FIGURE 14 15

FIGURE 15 Here is a little advertisement for the FAST protocol. Figure 16 shows the topology. Everybody in the U.S. academic system uses this network, and it is typical of an ISP backbone. There are lots of these. We are going to pull out that backbone and make a greatly compressed network out of it by just adding edge systems. Yes, you can create a network that has a power law degree distribution, and the next slide shows such a power law. But following that, Figure 16 is another graph with exactly the same degree distribution. These couldnât be more different, so the degree distribution of a network tells you almost nothing about it, particularly if it is high variability. If it is high variability it gives you more possibility to do more and more bizarre things, and you can quantify the extent to which, in the space of graphs, these are at opposite extremes. Again, they couldnât be more different. On the left we have a low-degree core and high-performance robustness, which is typical of the Internet. On the right-hand side you have got these high-degree hubs, horrible performance and robustness. If the Internet looked anything like the right, it wouldnât work at all. Of course it would fix our problem with spam, because you wouldnât get e-mail; you wouldnât get anything. So, the Internet looks nothing like that, and yet, apparently, a lot of scientists think it does. Certainly no one in the Internet community does. In 16

fact, there is high variability everywhere. FIGURE 16 17

FIGURE 17 I have shown you a bunch of cases. You have to work pretty hard to find an example on the Internet or an example in biology that doesnât have high variability. Ironically, the weird thing is, it is like those with low variability have been found and then presented as having power lawsâvery strange. Anyway, there are enormous sources of high variability, and it is connected with this issue of robust yet fragile, and I want to talk about that. Everything I am going to say should appear obvious and trivial provided you remember what you studied as an undergrad either in biochemistry, molecular biology, or engineering textbooks. I donât know if everybody has had this undergraduate level, but I am going to assume only undergraduate level. You have to continue to ignore what you read in high-impact journals on all these topics. We have already seen that for power laws, but you have to temporarily ignore most of these things. Maybe you can go back and think about it later. Imagine that your undergraduate biochemistry and molecular biology textbook was just okay, and maybe like Internet for Dummies or whatever you read about the Internet. Maybe other people will talk about it. Figure 18 is a typical graph we get of a metabolism, where we have 18

enzymes and metabolites. Again, I am going to hope that you know a little bit about this, because I will talk about it at a very simple level. FIGURE 18 In Figure 19, I zoom in on the center part of this, which is glycolysis. Experts in the room will notice that I am doing a little weird thing like an abstract version of what bacteria do, and I am going to stick to bacteria since we know the most about them. 19

FIGURE 19 What is left out on a cartoon like this? First of all, it is autocatalytic loops, which are fueling loops. In this case you make ATP at the end, but you have to use it up at the beginning, and you will see that isnât quite right, but it gets you the spirit. There are also regulatory loops. Inserting them, we get a more complicated cartoon, shown in Figure 20. Cartoons are useful, but you have to understand what this means. Those lines mean very different things. They certainly wouldnât be comparable, but the problem is if you put all the lines in this thing it would just be black with lines as in Figure 21. People donât draw it that way simply because if you put all the action in you wouldnât see anything but black lines everywhere, but you would know what is missing. 20

FIGURE 20 21

FIGURE 21 Now look at Figure 22: The thing on the left can be modeled to some extent as a graph, but the item on the right cannot; there isnât any sense in which this is a graph. It is not even a bipartite graph. You canât think of it that way; it makes no sense. An enormous amount of confusion results from trying to make this a graph; it is not a graph. At the minimum, you could think about stoichiometry plus regulation. 22

FIGURE 22 Here is what I am talking about: the change in concentrations of mass and energy in the cell is a product of a mass and energy balance matrix, which we usually know exactly, and a reaction flux component which we donât know: It depends on the concentrations. FIGURE 23 Figures 24-26 show elements need to be represented by these two terms. The stoichiometry matrix could be represented as a bipartite graph or a matrix of integers. Again, this is standard stuff. The reaction flux has two components, and that is one of the things I want to talk about. First of all, you have regulation of enzymes. The enzyme levels themselves, which are the enzymatic reactions, are controlled by transcription, translation, and degradation. On top 23

of that they had another layer of control, which is allosteric regulation of the enzymes themselves. Enzyme levels are controlled, and the enzymes themselves are controlled. The rates of the enzymes themselves are controlled also. There are two layers of control, and you might talk about how those two layers of control relate to the two layers of control on the Internet, which is TCP and IP, why they are separated, and what they do. FIGURE 24 24

FIGURE 25 FIGURE 26 25

What I want to talk about is that these systems have universal organizational structures and architecture. As shown on Figure 27, one of these we will call the bow tie, which is the flow of energy and materials, and another is the hour glass, which is the organization and control. These both coexist and we will see how they fit together. I want to look at Figure 26 and talk about precursors and carriers, again, the standard terminology. Precursors are those things connected by the blue lines, the basic building blocks for the cell, and the carriers are things like ATP, NADH, carriers of energy, redox and small moieties, again, standard terminology. FIGURE 27 Figures 28-32 illustrate how to map catabolism into this framework. What catabolism does is take whatever nutrients are available and break them down to make precursors and carriers. Those are then fed into biosynthesis pathways to create the larger building blocks of the cell. This is core metabolism. 26

FIGURE 28 FIGURE 29 That core metabolism then feeds into the polymerization and complex assembly process by which all of these things are made into the parts of the cell, as shown in Figure 30 below. 27

FIGURE 30 Of course, if you are a bacterium you have to do taxis and transport to get those nutrients into catabolism, and then there are autocatalytic loops, because not only are those little autocatalytic loops inside for the carriers but, of course, you must make all the proteins that are the enzymes to drive metabolism, and then there are lots of regulations and controls on top of that. So, schematically, we get the nested bow ties in Figure 31. 28

FIGURE 31 What is important about this is that you have huge variety at the edges and basically no variety in the middle. What do I mean by that? If you look at all the things that bacteria encounter in the environment, all the chemicals, all the ions and all the possibilities, you couldnât list them all; it is basically infinite. What do they eat? Well, they eat nearly anything. I mean anything that has a reasonable redox status, some bug will eat it. Since we have cultured only 1 percent of all the bacteria in the world, we have no idea. They probably eat just about anything. They eat uranium, which is a cool thing to do, too. There is a huge variety in what they will eat. What are these final building blocks? There are about 100, and they are the same in every organism on the planet, every cell on the planet, exactly the same, but there are only about 20 in the middle, again, the same in every cell on the planet, and those are the precursors and the carriers, about 12. Again, it depends on how you count. I am counting AMP, ADP and ATP all as the same. It is like a little charged up battery, so that is how I am counting. If you want to count every one of them, you will get 30 or 40, but it is a small number. When you go on to this side again you get an almost unlimited number of macromolecules. I am grossly underestimating, but in a given cell you will have on the order of a million distinct macromolecules that are produced by the polymerization and complex assembly. What you have is this huge variety. If you go and look at almost any kind of data from these systemsâtime constants rates, molecular 29

constants, fluxes, variable, everything you look atâyouâll find a huge variability, variability over orders of magnitude. If you take this data in a particular way you are likely to get power laws, again, simply for the same reason that, if there was low variability, you would likely find Gaussians. FIGURE 32 There is a very limited critical point of homogeneity. It is homogeneity within the cell, and within every cell on the planet, so this critical point of homogeneity is essential. What you get is this nest of bow ties. We have this big bow tie and inside it you have little bow ties, and you have also regulation and control on top of it. The question is why? Why this structure? What is it about this? What does it confer? Since every cell on the planet has the same architecture, you could say it is just an evolution accident or it is an organizational principle. Which is it? What do you compare it with? Hopefully, we will find something on Mars that will have different handedness, and then we know that it didnât have common descent and then we can actually compare them. But in the meantime, we donât have much to compare. What to compare with and how to study it? I think what you compare it with is technology. How do you 30

study it? You use math, which is why we are here, I assume. It turns out we build everything this way, so if we look at manufacturing, this is how we build everything. We collect and import raw materials, you make common currencies and building blocks, and you undergo complex assembly. As an example of an engineered system, letâs look at power production, as shown in Figure 33 and Figure 34, the bowtie figure abstracted from it. There is a huge number of ways we make power and lots of ways we consume power, at least in the United States. When I come to the East Coast I donât have to worry about whether my laptop will work. It works because you have this common carrier providing power. If you go to Europe, power is carried in a little different way. Gasoline is a common carrier, ATP, glucose, proton motive force. This is the way we build buildings. There is a huge variety of things you can make stuff from, a huge variety of ways you can do it, and in the middle there is more or less a standard set of tools. FIGURE 33 31

FIGURE 34 32

FIGURE 35 With the engineered system, we can go back and see how it used to be. What we look back and see is the evolution of protocols, getting a glimpse of evolving evolvability. Evolvability is the robustness of lineages to large changes in long time scales. I want to think of robustness and evolvability as just a continuum. We will see there is no conflict between them. The idea that a system being very robust makes it not evolvable is an error. The problem is that all this stuff in the middle is usually hidden from the user. The user sees the hardware and they see the applications, but they donât see the protocols in between. It is easy to make up all sorts of nonsense about how it works. This is a universal architecture only in advanced technologies. It has only been since the industrial revolution that we did things this way. In biology you get extreme heterogeneity; it is self dissimilar; it looks different at every scale. If you zoom in on the middle, the core is highly efficient, but the edges is where you get robustness and flexibility because that is where the uncertainty is. This is shown schematically in Figure 36. If you look at the core you get highly efficient, very special purpose enzymes that are controlled by competitive inhibition and allostery, and small metabolites. It is like this machine is sitting there and the metabolites are running through itâa big machine, little metabolites. At 33

the edges it is exactly the opposite: you get robustness and flexibility, but you have general purpose polymerases, and they are controlled entirely differently, usually by regulated recruitment. So, you have an entirely different control mechanism, very different styles, and again, this is all standard undergraduate biochemistry, but it is an important observation. If you think about the distinction between this and riveting, you have a machine that sits on the floor and makes rivets. It just sits there and spits out rivets; it doesnât move and you control it by a knob on the back. Once you take the rivets and you want to rivet an airplane, you now take the rivet gun and move around on the airplane. You take the rivet to the system. It is very much similar to core metabolism versus complex molecular assembly. Again, there are these universal arcs. FIGURE 36 Figure 32 isnât to scale. To show its high variability you may ask how would I scale it? We could scale it by flows. We could scale it by whatever. Suppose we scale it by the genome. We can think of this as a gene network, and it holds for e. coli with its 4,400 genes (see Figure 37, which is not exactly up to date). About half of those e. coli genes have known function. Of the ones with known function, about 10 percent are called essential. Again, I am using the biological notion of essential, which just means if you take out that gene, you canât keep it alive 34

on life support no matter what. If you scale up that way it turns out the whole genome, to a first approximation, is regulation and control. This is a radically different scaling than I started out with. All the complexity is down here at the bottom. That is not unusual. FIGURE 37 In technology it is the same way. If you take a modern car, a modern high end Lexus, Mercedes or so on, essentially all the complexity is in these elaborate control systems. If you knock one of those out what you lose is robustness, not minimal functionality. For example, brakes are not essential. If you remove the brakes from a car the car will still move, particularly in a laboratory environment. What will happen if you take it out on the road? You are more likely to crash. There is a standard joke about the biochemist and the geneticist and their brakes. This is not because brakes are redundant; brakes and seat belts are not redundant. In some sense they are, but no engineer would think of them as redundant. They are layered control systems and that is the same story with cells. These non-essential knockouts arenât due to redundancy. That is a fiction. If you knock out this stuff you lose robustness not minimal functionality, and then there are a few things in here that, if you knock out, they are often lethal. If redundancy was 35

a strategy that bacteria adopted for robustness where would they put it? In the core, but they donât, for the most part, so that is not the explanation. The explanation is straightforward, straight out of standard undergraduate biochemistry. This supplies materials and energy; this supplies robustness. Robustness contributes to complexity much more than the basic functionality so we have to understand this bottom part. We canât just look at the stoichiometry if we want to understand this. FIGURE 38 36

FIGURE 39 We are familiar with the idea that we have this exploding internal complexity, even though it is supposed to appear very simple. Our cars appear simpler. It used to be if you wanted to be a driver you had to be a mechanic. Then you could be a mechanic as an option. Now you canât be a mechanic. You canât fix your own car. In fact, the mechanic canât fix your car without a bank of computers. So, these systems have become much more robust. Your modern high-end luxury vehicle that has all of these control systems will almost never have a mechanical failure. How does it fail? The answer is software. Mercedes Benz automobiles now have terrible maintenance records, but they have eliminated mechanical failures. They have replaced them with software errors, and now they have worse performance than ever. Imagine that some day all our cars on the freeway turn left. That is why we call it robust yet fragile. They are extremely robust most of the time but when they fail they fail big time. What is the nightmare? The possibility is that we will never sort this out and biology might just accumulate the parts list and never really understand how it works. In technology we might build these increasingly complex systems and it will have increasingly arcane failures. Apparently, nothing in the sort of orthodox view of complexity says this wonât happen. I believe it wonât happen if we do the math right, but we 37

need new math to do this right. What are the greatest fragilities of this architecture? It is hijacking, because the presence of common parts makes it easier to develop the ability to hijack. If you have a common user interface anybody can come in and hijack it. I can come in, plug in, and steal your energy. Viruses come and plug right in, and they donât have to worry about which e. coli they are in. They are all the same with respect to this; viruses only have to worry about the immune systems. So, what you see is that these things have very characteristic fragilities. Let me briefly talk about eukaryotes. In fact, letâs talk about us for a minute. I donât know much about them but since we are staying at the undergraduate level I am probably okay. There are plenty of people in the room to fix all the things I say wrong. What is the fragility of us as a whole? It is obesity and diabetes. At a physiological level we have a bow tie architecture with glucose and oxygen sitting in the middle, as shown in Figure 40 below. FIGURE 40 38

Human complexity Robust Yet Fragile âº Efficient, flexible metabolism Obesity and diabetes âº Complex development and Rich parasite ecosystem âº Immune systems Inflammation, Auto-Im. âº Regeneration & renewal Cancer Complex societies Epidemics, war, â¦ Advanced technologies Catastrophic failures â¢ Evolved mechanisms for robustness allow for, even facilitate, novel, severe fragilities elsewhere (often involving hijacking/exploiting the same mechanism) â¢ Universal challenge: Understand/manage/overcome this robustness/complexity/fragility spiral FIGURE 41 What happens is if you look at all the problems on the right (Figure 41), they are the direct consequences of mechanisms on the left. So, we have efficient flexible metabolism, great 500,000 years ago when you are starving between trips to hunting or whatever. You put that in a modern environment and you are not running around any more and you get obesity and diabetes. You need complex development in order to create us. That makes a rich parasitic ecosystem. You then have an immune system, so now you have horrible autoimmune diseases, some of the worst diseases. One form of diabetes is caused by autoimmune disease. You need regeneration and renewal in the adult and that, of course, can get hijacked. Hijacking is the greatest fragility. You get cancer; you get complex societies and we have epidemics. Perhaps others will talk about that. It looks like robustness and fragility are a conserved quantity in the sense that when you try to make something more robust, you make something else more fragile, and that is actually true. If I have time I may say something about it. Again, there is an undergraduate-level story that you can talk about, that there is a conservation law for robustness and fragility. Letâs look a little bit at transcripts for regulation for just a second. Some of my favorite work is this work on network motifs by Uri Alon and his co-workers. What is a motif? Figure 42 shows a transcriptional regulatory motif for heat shock. What you have is the gene which codes for the alternative signal factor that then regulates a bunch of operons. I am going to focus on an abstract version of an operon. What does this do? Typically, your proteins are supposed to be 39

folded, and this is a reason why you have fevers, because your body is trying to unfold the proteins in the bacteria that are attacking you. That is probably not all there is to it but, as I said, there are people here who can fix the things I say wrong. What is happening is you have chaperons that both fold nascent proteins and refold unfolded proteins, and you also have proteases that degrade particular aggregates and then they can be reused. This is a system and if this is all you had, you could survive heat shock provided you made enough of these things. FIGURE 42 That is not how e. coli does it; what e. coli does is builds elaborate control systems, as shown next. First of all, the RNA itself is sensitive to heat; it unfolds and is transcribed more rapidly under the presence of heat. That makes more sigma 32 that makes more chaperons. In addition, the chaperons sequester the sigma 32. It has hydrophobic residues so it looks like an unfolded protein. If the chaperons suddenly get busy having to fold unfolded proteins, the sigma 32 is free to go make more, a negative feedback loop, a very cleverly designed one, and then there is an additional feedback loop associated with degradation. So, there is an elaborate control system that is involved in protein protein interactions. What you see here are two layers of 40

control. There is the transcription regulation on the left, and there is this protein-protein interaction, which allosteric regulation of enzymes is a similar thing on the right. You have got these reaction fluxes, regulation enzymes, and then you have got this two level control, allosteric regulation of enzymes. FIGURE 43 What I want to do is compare that with the Internet. Hopefully, you are familiar with the Internet hourglass shown below. You have a huge variety of applications and a huge variety of linked technologies, but they all run IP under everything, IP on everything. 41

FIGURE 44 You have this common core that is universal, every computer on the planet, more or less. What is universal about this? How does it compare with biology? Well, you have a huge variety of components. Think of all the different things you could build a bacteria from. It is an almost unlimited list. Then there is a huge variety of environments in which bacteria can live. The same thing is true for the Internet, a huge variety of applications at the top and a huge variety of different physical components at the bottom. If you are going to build architecture you have got to deal with enormous uncertainty at both edges. Again, itâs similar to the bow tie, but now we are talking about it a little bit differently: it is the control system and not the flow of material and energy. If you are bacteria you pick a genome, or you go to the Cisco Systems catalog and you make your network. You now have a physical network, but it doesnât work because it doesnât do anything yet. What does it need? It needs this feedback control system or it doesnât work. Then, it is interesting. 42

FIGURE 45 There is huge variety everywhere except in the feedback control. This architecture fits every cell on the planet, every computer on the planet. You have got the hardware at the bottom: routers, links, servers, or DNA, genes, enzymes. You pick a raw physical network or you pick a raw genome. Then to turn it into a network you need to have routes on it so you have an IP route rerouting. It has two things: it has a feedback control system that deals with uncertainty below it. You put it together, you run IP on it, and it gives you routes. It now makes you a very robust network to whatever is going on at the lower levels. If you had a very benign environment that is all you would need. You simply run your applications right on top of IP; you donât need anything else. On the biology side you need transcriptional regulation to create the enzyme levels that are necessary to run the cell. In a benign environment this is all you would need. In fact, though, you are usually in an uncertain environment with complicated applications coming and going, supply and demand of nutrients and products and, as a consequence, on the Internet you need TCP, which gives you robustness of supply and demand and packet losses, you do congestion control and it activates re-transmission of lost packets. The resulting comparison is shown on Figure 46. On the right-hand side you have allosteric regulation, all that post-translational modification, all the different things that you do in protein- 43

protein interactions and regulation. This all happens in the cytoplasm. It makes you robust to changes in supply and demand in a much shorter time scale. What you see is a layer of control that gives you primarily robustness to what is below a layer of control that gives you robustness to what is above, and you stick those two together, and you get a very robust plug-and-play modularity. FIGURE 46 It isnât an accident that TCP and IP are separate and do different things, and it is not an accident that we haveâI have used allosteric controls as a stand in for all the different ways that you can do modifications of proteins, again, hidden from the user. We donât know all that is going on. When you use your computer, you donât see this going on. Again, it is easy to make up fantastic stories about these. What is important about this is you have a vertical decomposition of this protocol stack. Each layer independently follows the rules, and if everybody else does good enough it works. The Internet does it this way and so does biology. Also, what you have is a horizontal decomposition. Each level of decentralized and asynchronous, so there is no centralized control. It all runs based on these protocols, so you get this bow tie picture. 44

FIGURE 47 45

FIGURE 48 Where does the protein and hourglass fit together? It is the regulation and control that are organized as an hourglass. If you are thinking about this, how does this relate to the Internet? The cell metabolism is an application that runs on this bow tie architectureâit is the bow tie architecture that runs on this hourglass. They sit together in the same cell. Right now we donât do a lot of this in technology. We keep these things fairly separate. We are starting to build control systems that run on top of IP networks. Something I thought I would never work on is a disaster, but it is going to happen whether I like it or not. We are going to have to make it work. What this does is gives you robustness on every time scale. Again, there is no trade off in these architectures between robustness in the short run and robustness in the long run. 46

FIGURE 49 47

FIGURE 50 Viruses and worms: What is the worst form of spam? It is the spam that looks like real e-mail. So for the worst attack on a complex system, how does a virus get into a cell or into our network? It has to look okay and it gets through. Those are the attacks that we need to worry about most and on the Internet file transfers you have got navigation, you have got congestion control, and you have ack-based re-transmission. On a little broader time scale, if routers fail you can reroute around failures, you can do hop swaps and rebooting of hardware. On longer time scales you can rearrange the components. On the longest time scale you can bring in entirely new applications and hardware, and it plugs right in. So, on every time scale you get enormous robustness created by this architecture. You also get fragility on every time scale. If the end systems donât run the congestion control there is nothing to prevent the whole network from collapsing. If everybody turns off TCP tomorrow we are dead. The worst thing is fail-on of components, not things failing off. If you take a sledgehammer to the Internet nothing happens; it is not a big deal. If you fail-on, or hijack, that is the worst thing that can happen. Distributed denial of service is an example, black holing, and other fail-on cascades. So things failing on are the worst things. In fact, the worst thing is to nearly obey the protocols and deviate subtly. What does that do? Suppose a nutrient fluctuates? All you have to do is control the pathway that feeds 48

the core and everything else works. You donât have to do control except locally. That is why these centralized control schemes work. A new nutrient comes in, the whole regulatory apparatus comes in, it comes over here, kicks in transcription, translation, makes a new protein, all of a sudden, that comes back, that is all regulated and control, you have got a solid catalytic feedback, you make a new pathway and you eat that nutrient. Nothing else had to change. If you lose a router it is no big deal. You have all these regulatory mechanisms to protect you with that. You are going to incorporate new components. Suppose you need a pathway but it isnât there. You are a bug and somebody is dosing you with antibiotics, and you want to consume it and you canât; this is no big deal. If somebody else has it, you can grab it and stick it in so bacteria can do this. This is why it can acquire antibiotic resistance on such a fast time scale. We canât do this; we have this lineage problem. If we could do this, I know I would look like some combination of Michael Jordan and Lance Armstrong, not like this. We canât; we are stuck with the genes our parents give us and we canât go grab more, although I guess you guys are working on it. On longer time scales the idea is that you can create copies and then evolve them, and they work the whole time. They work the whole time because they run the same protocols. It makes it easy to evolve. Also, they are fragile on time scales. The fluctuations of demand and supply can actually exceed regulatory capabilities so you have heard of glycolytic oscillations. You can actually kill bacteria by simply fluctuating the amount of glucose in its environment, and it is killed because its control system reacts improperly and, again, in well known ways. On the short term we have failed on components. You get indiscriminate kinases, loss of suppression, and loss of tumor suppression. Again, the worst thing is fail-on. The worst thing for kinase is not to stop phosphorylating its target, but to phosphorylate targets that it is not supposed to. That is a major cause of cancer. Failing on is a much bigger problem in these systems. In fact, this hijacking that we talked about, that is diabetes and obesity. It turns out the same mechanisms responsible for robustness and most perturbation allows possible extreme fragility failures. So, what we are seeing is extreme variabilities in all the measurable quantities, but also this extreme variability and robustness and fragility, and it is created by these architectures. High variability everywhere and, of course, once there is high variability, you can find power laws if you are looking. These are universal organization structures and architectures. I have a whole story on signal transduction. Time is getting late, and I knew that was going to be the case so I put it at the end. What I talked about was the theoretical foundations a little bit. We want to get into some discussion, but let me just do a little bit of math staying at the undergraduate level. What are the theoretical foundations for studying these kinds of systems? 49

They are the most rigorous theories we have around; the most sophisticated and applicable theories come out of computation, computer science, control theory, and information theory. Those subjects have become fragmented and isolated. The weird thing was when I started working on the Internet I realized here was the biggest communication network ever built, and it doesnât use anything from Shannon, more or less, at the core, the TCP/IP. And that the people who work on TCP/IP and the people who work on forward error correction never talk to each other. It makes some sense to me, but we have got to get that all back together. There have been attempts to create new sciences, and they have pretty much been failures. The problem is we donât need new sciences; we need new mathematics; we need integrated mathematics. Recent progress has really been spectacular. I want to give you an undergraduate example of the kind of progress, and I will try to do this very quickly and, again, it will assume that you have a background on the topic I am talking about. The standard Shannon storyâ actually, this is the way Shannon first presented it, which is not the way it is usually presented in textbooks, if you go back and look at his original papersâsupposes you have a disturbance that is going to create some error, and you have a process of encoding and delay through a channel with capacity C, then decode. Then you are going to try to construct a signal to cancel that disturbance. (see Figure 51). That is treating communications like a control problem. This is actually how Shannon first thought about it. He said the entropy reduction is limited by the channel capacity, standard story, and the other big result is that, if the delay of a disturbance is long enoughâif the delay goes to infinityâand the entropy of the disturbance is less than the channel capacity, then you can make the error zero. These are the big results in Shannon. 50

FIGURE 51 Hard bounds: The bounds are achievable under certain assumptions, and there is a decomposition theorem that is well known in coding theory about how to achieve that. In control theory there is the corresponding thing which is a little bit less known, and I am beginning to realize the reason it is less well known is because people like me teach this stuff really badly. So, everybody takes their undergraduate course in controls. How many people took an undergraduate course in controls? It is always the course you hated most. I donât understand this, and it is a pathology and a virus that I have. We have got to get it out of some of this stuff. 51

FIGURE 52 Anyway, control theory has a similar thing. You take D and E, you take the transforms and you can define this thing called a sensitivity function. If you think of the other thing as a conservation law based on channel capacity, you canât exceed channel capacity here. You have another conservation law that says that the total sensitivity is conserved. So, I have this demo, shown in the following illustration. It is harder to move the tip of the pointer around if it is up than if it is down. If I want to go around here, I can move it around real easyâsame parts but way harder. This theorem tells us why. It turns out that there is an instability it adds to the problem, and it turns out that the instability gets worse as it gets shorter, and eventually I canât do it at all, and you can do the experiment yourself at home. You can check out this theorem at homeâsame thing, hard bounds, achievable solution, decomposable. 52

FIGURE 53 These two things have been sitting staring at us for 60 years, which is a total embarrassment. What happens if you stick them together? What would be the best thing that could possibly be true? You just stick them together? There is the cost of stabilization, there are the benefits of remote sensing and there is a need for low latency. That canât possibly be right, because it is so obvious you would have thought some undergraduate would have done it 50 years ago. The other weird thing is Shannon worked for Bode at Bell Labs. They actually sat and talked about this stuff together. Not only that but it is a hard bound; it is achievable under certain assumptions, with the usual, you first prove it for the added Gaussian case, and then the solution is decomposable under certain assumptions. It is possible to unify these things, and there are actually some undergraduate results that come out right away. This hasnât been published yet, but it is going to appear at the next conference on decision and control; it has been submitted to IEEE Transactions on Control, and so on. 53

FIGURE 54 54

FIGURE 55 55

FIGURE 56 Here is a claim or probably, more properly, irresponsible speculation. A lot of the complexity we see in biology is dominated by dealing with this trade off, and the deal for techno- networks. We are starting to use communication networks to do control. Biology already does that. Biology is latency driven everywhere. What does the Internet manage? Really latency. We need to have a latency theory of communications. Go to the FAST web site and read about this. This is pretty cool. It is a case where a very interesting new theory was done, global stability with arbitrary numbers, arbitrary areas and delays so you get very great robustness, global stability of a nonlinear system in the presence of time delays. I never thought I would see this. If I had to do it, I wouldnât. I have good students who did this. One of the co-authors of the paper is here, Lun Li, and you can talk to her about it. She, I think, actually understands it. Anyway, it has been very practical and it has led to the theory of new protocols that, again, here this is macho stuff; this is breaking the world record. What is the theory? The way you connect these things is it is really a constrained optimization problem. It starts out as constrained optimization, but then you really have to redo optimization theory to get it to work in this layer decomposition way. There is actually a coherent theory for the first time of these layered, decentralized, asynchronous protocols. It is a 56

nascent theory. That little theorem I showed you before is an example of the kind of unified theory that is now coming along, but it is very much in its beginnings because we had these stovepipe areas chugging along for 50 years with the smartest people I know cranking out theorems in an isolated way. We have to go back 50 years and start putting them back together again. It looks doable and as I tried to show you, we are going to have to think about high variability all over the place. The high-variability statistics is going to be a dominant issue, and it is not necessarily a bad thing. In our disaster world it is a bad thing, but a lot of times this high variability can be exploited. TCP exploits at a huge amount. My cell phone story, if that wasnât there, TCP wouldnât work. So, it is a situation where we have got to be able to learn to exploit that more systematically. I think you are probably going to hear about bits and pieces of this from others throughout the next couple of days. I will stop and see if you have any quick questions. QUESTIONS AND ANSWERS QUESTION: A very quick question. In your earlier plot you showed for the power law, you said that apparently it is not power law, but more like exponential. It looks like it is a better fit with the power law than exponential. How do you explain that? DR. DOYLE: It is a statistical error. So, it is a case where you simply are making a mistake. Again, I didnât really explain the details. This is a standard thing. The idea is that the thing on the right was a differentiated version of the thing on the left. The simple way of thinking about it is, you differentiate it and you create all this noise. There are two advantages to the picture on the left, the rank plot. First of all, I show you the data in its raw form, and you can say whether or not you think it is a power law. There are no statistics done. I show you the raw data. Then you can do a little bit of statistics by just sort of eyeballing the straight line. The idea is that you shouldn't think of a straight line it has to fit. The idea is that we use mean and variance to describe all sorts of low-variability phenomena. We know it isnât really Gaussian and we donât check the other moments. We need similarly to find better ways, in a broader sense, to describe high-variability data. All of a sudden you get means of 1 and variances of 100âit is not a useful statistic, it doesn't convert. You take the data over again. Means and variances donât mean anything in this high-variability world. What you need to do robust statistics is a solvable power law, but you would want to use that even when it wasn't a power law, just like you use mean and variance, so there are robust ways to do this. This is all decades old statistics, and we don't teach it very well. That is one of the 57

challenges that we need to do a better job, just of teaching these things. The problem is science has so focused on getting rid of high variability. High variability was thought to be a bad thing, like you are doing bad experiments. As Jean will point out, high variability exists all over in nature. In the laboratory, it only exists in very exotic circumstances, so it is associated with exotica. DR. HANDCOCK: I would like to pick up on that question a little bit, about the identification and characterization of power laws. I think one issue that needs to be discussed, or at least I would like to hear some discussion more of, is the case where statisticians in particular have looked at questions for a long period of time, but they havenât really been used, that knowledge has not been used in other areas of science. DR. DOYLE: That is absolutely right, it has not been used. DR. HANDCOCK: In this area, I think it is particularly important. Clearly, statisticians have had a lot to say about curve fitting and other issues like that, but what I am most reminded of is the recurrence of these debates. The classic example to my mind is Morris Kendellâs 1961 address to the Royal Statistical Society, where he essentially lambastes previous work done in a half a century before that time on essentially very similar questions here. DR. DOYLE: This is a very old story, a very old story. DR. HANDCOCK: I would like to hear people say more, how can the work of statisticians be more recognized and just routinely used by other sciences, to avoid occurrences of this kind? DR. DOYLE: I will tell you that my experience has been to give them good software tools do it right and make it easy for them to do it right. If you go to Matlab and get the stat tool box, itâs all low variability. So, my experience has certainly been, if we want to get good, new robust control theory into the hands of people, you make software. If you want to get biologists to be able to share models, you have to have a systems biology mark-up language, you have to make software. So, you have to turn your theories into software, and it has got to be usable, it has got to be user friendly. So, we use Matlab and call the SVD, and how many of us know actually how the SVD works, sort of? Well, you donât need to. The point is, if you do it right, you need to know what it does, but you donât need to know exactly how it does it. We need to do the same thing for high- variability statistics. It is one thing to lambaste everybody about it. It is another thing to try to create the tools, and show people, teach people. This is the thing we have got to do, we have got to teach people. There are people out there who are discovering this high-variability data everywhere. We are going to hear a lot about it in the next two days, and it is everywhere. The 58

irony is, because of its strong statistical properties, you end up finding it where it isnât, and the stats community can make a lot of contribution. The problem is that I am not sure anybody studies this any more. It is this old, old topic, old classical stability laws; it was hot in the 1920s, it was hot in the 1950s, and maybe it will come back. I think it should come back. DR. KLEINFELD: Implicit in your talk was this notion that in engineering systems things are components, and also implicit in your talk and sort of in the world is that, the way biology seems to design things, they are very much in modules. Is there some sort of theorem that says that one should really build things in modules, like some complexity diverges or . . . DR. DOYLE: Yes and no. I would say we have nascent, little toy theorems that suggest that these architecturesâhere the point is, you donât have modules unless you have protocols. If you look around, you discover the modules first, you see them, you see the parts, but they are meaningless unless there are protocols. None of this stuff would work or hook together unless they had protocols. What we can prove now is that some of these protocols are optimal in the sense that they are as good as they can be. For example, TCP properly run achieves a global utility sharing, that is fair sharing among all the users that use it. What that says is that you can get the same performance as if you had a centralized non-modular solution, but you can get it with a protocol that is both robust and evolvable. Now, is there any other way to do it? We don't know. You prove that this is optimal and that this is robust. So, there is a lot more work to be done. I mean, it is only the last few years that we have even had a coherent theory of how TCP/IP works. So, this is quite a new area. I think that is an important question. How much further can we go in proving properties? What we need to design now, as engineers, we need to design protocols. We need to design protocols that are going to run our world, and we need to do them and make them robust and verifiable, and scalable. Now we donât do that very well. In biology, what we are doing is reverse engineering the protocols that evolution has come up with. Unfortunately, the good news is, it looks like it uses more or less the same protocols. So, it is not going to be incomprehensible. It could just be this infinite parts list. If we just made a parts list of everything on the Internet, it would be completely bewildering, we would never make any sense of it. Because we know the architecture, and because we know the protocols, we know what everything is doing. We have got to do the same thing with biology. 59

REFERENCE Han, J.D., et al. 2004. âEvidence for Dynamically Organized Modularity in the Yeast Protein- Protein Interaction Network.â Nature 430. 60

Network Models 61

Next: Neurons, Networks, and Noise: An Introduction--Nancy Kopell, Boston University »

Proceedings of a Workshop on Statistics on Networks (CD-ROM) (2007)

Chapter: Keynote Address, Day 1 Network Complexity and Robustness--John Doyle, California Institute of Technology

Welcome to OpenBook!

Get Email Updates