Read "Proceedings of a Workshop on Statistics on Networks (CD-ROM)" at NAP.edu

« Previous: Neurons, Networks, and Noise: An Introduction--Nancy Kopell, Boston University

Page 74 Cite

Suggested Citation:"Mixing Patterns and Community Structure in Networks--Mark Newman, University of Michigan and Santa Fe Institute." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Page 99 Cite

Page 100 Cite

Page 101 Cite

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Page 106 Cite

Page 107 Cite

Page 108 Cite

Page 109 Cite

Page 110 Cite

Page 111 Cite

Page 112 Cite

Page 113 Cite

Page 114 Cite

Page 115 Cite

Page 116 Cite

Page 117 Cite

Page 118 Cite

Page 119 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Mixing Patterns and Community Structure in Networks Mark Newman, University of Michigan and Santa Fe Institute DR. NEWMAN: I am going to be talking mostly about social networks because that is mostly what I know about. I hope that some of the ideas I talk about will be applicable to some of the other kinds of networks that some of the others of you work on. First, I want to mention some of my collaborators, one of whom, Aaron Clauset, is in the audience. The others are Michelle Girvan, Cris Moore, and Juyong Park. FIGURE 1 The story I am going to tell you starts with an old idea from network analysis. We believe that many of the networks we study divide up naturally into groups. Suppose I am talking about a social network. I will assume that you know what a social network is. It is a network where the nodes are, for example, people and they are connected together by some social interaction such as theyâre friends, or they have business acquaintanceships, or they have been in close physical proximity recently. Choose the definition that is relevant to your particular problem. So, you have a network of vertices and edges as shown in Figure 1, and we are going to assume that it divides up naturally into communities, groups, clusters, or whatever you would like to call them. First of all, one thing that we might be interested in doing is finding those groups, if they exist in the first place, and if they do exist, determine what kind of groups they are. For example you could be interested in a social network and be looking at groups of people in a network. You could be interested in the World Wide Web, and be looking for groups of related Web pages. You could be looking at a metabolic 74

network and be looking for functional clusters of nodes. There are a variety of situations in which you might be interested in this. Finding groups in networks is an old problem. It is something that computer scientists, in particular, have looked at for many decades. But there is a difference between what we want to do with social networks, for example, and what computer scientists have traditionally done. First of all, in the computer science problems, traditionally when you are looking for the groups in networks, there are some additional constraints that we donât have, such as you know how many groups you want beforehand. A standard computer science problem might be that you have a bunch of tasks and you want to divide them up over many processors on a computer, and you know how many processors you have. You know how many groups you want to divide them into. Very often you might want the groups to be of roughly equal sizes. That is a typical constraint in the kinds of algorithms that computer scientists look at because, for example, you want to load balance between many processors in a computer. In the problems we are looking at that is not often true. What is more, there is a fundamental philosophical difference compared to the traditional approach of partitioning a computer network, in that we are assuming here that the social networks we are looking at naturally divide into these groups somehow. In the kinds of problems you want to solve in computer science, often you are just given the network and you want to find whatever the best division of it is into groups, such that you have a bunch of these tightly knit groups, and there are lots of connections within the groups, and only a few connections between the groups, and you do the best you can with whatever network you are given. We are taking a slightly different point of view here, and assuming from the outset that there is some good division for some good sociological reason, for example, that your social network naturally divides up into these groups, and we would like to find the natural divisions of the network. And those natural divisions might involve dividing it into some small groups and some large groups, and you donât know how many groups there will be beforehand, and so forth. There are old algorithms for doing this from computer science, like the one shown in Figure 2, which you might be familiar with, the so-called Kernighan-Lin algorithm from the 1970s. In this algorithm, if you want to divide a network into two parts, you can start by dividing it any way you like, essentially at random, and then you repeatedly swap pairs of vertices between the two parts in an effort to reduce the number of edges that run between the left-hand group and the right-hand group in Figure 2. You continue doing that until you canât improve things any more, and then you stop. This is typical of the kinds of things that traditional algorithms do. It works very well but it has some constraints. First of all, you need to know the sizes of the groups 75

you want beforehand, because when you swap two vertices, you donât change the size of the groups. Each group gets one vertex and loses one vertex. FIGURE 2 Furthermore, the Kernighan-Lin algorithm only bisects. It starts off with two groups, and it ends up with two groups. If you want more than two groups, then you would have to do this repeatedly, repeatedly bisect, divide into two, and then divide the two into two, and so forth. However, there is no guarantee that the best division into three groups is given by finding first the best division into two and then dividing one or the other of those two groups. Something that I like better is shown in Figure 3, spectral partitioning, a rather elegant method that is based on matrix-algebra-type ideas. This is also something that has been worked on since the 1970s. In this approach, we take the so-called graph Laplacian, a matrix representing the network in which the degrees of the vertices are down the diagonal. The degree of a vertex is how many connections it has. There is a -1 off the diagonal in each position corresponding to an edge in the network. Minus 1 here means that there is an edge between vertex 2 and vertex 1. This is a symmetric matrix in an undirected network, and it has an eigenvector of all 1s with an eigenvalue of 0. 76

FIGURE 3 If you have a block diagonal matrix, like the one shown in Figure 4, one that divides completely into two separate communities with no connections between them, then it has two eigenvectors with eigenvalue 0, one with all the 1s corresponding to the first block and one with all the 1s corresponding to the second block. And, of course, any linear combination of those two is also an eigenvector with eigenvalue 0. FIGURE 4 If you have a network that is approximately block diagonal, that is precisely the sort of 77

thing we are talking aboutâin other words, it has two communities, two separate groups, but they are not quite separate, there are a few connections between them, so we have a few off diagonals here and here. Then, as shown in Figure 5, you still have one eigenvector with eigenvalue 0, and the second one is slightly perturbed and has a slightly positive eigenvalue. You can prove that it is always positive. By looking at the properties of that second eigenvector you can deduce what those two blocks must have been. It is actually very simple. You just look at the signs of the elements of those eigenvectors and assign vertices to the two groups according to their sign. This gives a very nice and principled way of bisecting a network into two groups. In this case, you donât have to specify what the size of the two groups is. You just findâif your matrix really is approximately block diagonal with two blocksâwhat those blocks are. But this algorithm only bisects, and it is a rather slow algorithm, since it involves finding the leading eigenvectors of the matrix. Typically for large networks you do this by the Lanczos method because it is a sparse matrix, but it is still a pretty slow thing to do. FIGURE 5 Looking particularly at social networks, but potentially for other networks, we have been thinking about other methods of solving this problem for the kinds of networks we are looking at. Many people have worked on this in the last decade or so. I am going to tell you about some stuff that we have done, just because I was involved in it and I like it, but of course it is certainly not the only thing that has been done. Figure 6 shows one method we looked at; this is work that I did with Michelle Girvan, who is at the Santa Fe Institute. Suppose you have a network like the one in Figure 1, and it divides into groups where there are many edges within the groups and only a few edges between the groups. The idea is going to be, suppose I live within one of the dotted circles and I want to 78

drive to a node within one of the others. I would then go along one of the edges within my circle to get to the edge that connects the groups, which is like a bridge. In fact, if I lived anywhere within my dotted circle and I wanted to drive anywhere within another circle, I am going to end up going over that bridge. If you live in San Francisco and you want to get to the East Bay, you would go over the Bay Bridge, and that is the way you have to go. That means there is a lot of traffic on the Bay Bridge because everybody goes that way. These bridges between communities are things that are going to have a lot of traffic if we consider, in some simulation kind of a way, traffic moving around from every part of the network to every other part of the network. A simple concept that captures this idea of betweenness was first formulated by Lin Freeman, who is also here somewhere. So, the idea here is that we find the shortest path between every vertex in the network and every other vertex in the network, and then we count how many of those shortest paths go along each edge in the network. This is like a primitive measure of how much traffic will be flowing along each edge of the network. Then, because some of these edges are the few edges that lie between the communities, those ones are going to get a lot of traffic. We identify the edges between the communities as those with the highest value of this betweenness measure, and then we remove them from the network. Once you remove them and those edges there are gone, you are left with the three clumps, and these are your communities. FIGURE 6 It is a little bit more complicated than that in practice. The principal problem is if you have two edges between two communities, like the two at the top in Figure 1, then there is no guarantee that they will both get a high value of betweenness. It is certainly the case that the sum of the number of paths that go along one and the number of paths that go along the other should be high but, under certain circumstances, it is entirely possible that one of them could be high and 79

one of them could be low. If you remove the edges with the high betweenness, you wouldnât remove all the ones that connect groups, only some of them. In order to get around that, you have to find the edge with the highest betweenness, remove it, and recalculate this betweenness measureâthis traffic flow measure. If I remove one edge and recalculate, any traffic that previously had to go along that edge will now go along the other bridge instead. So, even if it didnât have a high count before, it does have a high count now. It is crucial that you do this recalculation step at each step of the algorithm, and that slows things down a bit. That is the primary disadvantage of this method. In fact it takes a moderately long time to do the calculation. We did come up with a nice way of speeding up the calculation. Naively you would expect this to be an O(n4) calculation. But some of the calculations can be reused, and you can get it down to an O(n3) calculation, but that is still prohibitively slow for very large networks. You couldnât do this on the World Wide Web graph or something like that, but it could typically be done on the sizes of social networks we are looking at. It works up to tens of thousands of vertices. This works quite well and I will give you some examples. Some of these are whimsical examples, but there is a reason for that. Figure 7 shows a famous network from the social science literature. At the top is a network from a study done in the 1970s of a karate club at a U.S. university. The vertices represent students in this club who were practicing their karate, and connections between them represent friendships as determined by direct observation of the members of the club by a sociologist who was working with them. This is an interesting example. The sociologist, Wayne Zachary, studied this club for two years and fortuitously it turned out during the course of these two years that a dispute arose between two factions within the club, and eventually the club split in two and one half went off and formed their own club. The theory is that you should be able to spot that coming split in this network, even though this network here was constructed before the split occurred. In fact, you can do that. The white circles and the gray squares represent the factions as they were after the split. All we did was to take this network and feed it through our algorithm. The output of the algorithm is a so- called dendrogram, which is a tree as shown in (b) in Figure 7. It doesnât matter too much what it indicates. The main thing to notice is that it splits the network into two groups, and the coding of the circles and the squares is the same as in the network in (a). You can see that it has very neatly found the group of the squares and the group of circles. There is only one error over there, vertex number three, which is the one right in the middle there, on the boundary between the two factions. Maybe it is understandable why it got that one wrong. 80

FIGURE 7 This is a network that was derived from observations made before the split, and yet clearly the network has in it information about what is going to happen in the future when the thing does split up. In a way, it is predicting what is going to happen, although it is predicting after we knew what the answer already was. Obviously, we wouldnât have put this example up here if it didnât work. Figure 8 gives another example. I am at the University of Michigan, and so I need to have an example about football. This is a pictureâone of these dendrograms, one of these treesâthat shows what happens when you feed the schedule of games for division 1-A college football into our algorithm. As you might know, college football is played in conferences, which are groups of schools that play together on a regular basis. If you feed the schedule in, it picks out the conferences very nicely. They are color coded by the different shapes and colors in this figure. The groups that the algorithm finds, as you can see, pretty much correspond to the conferences. In a way, this is a silly example but we look at examples like this for a good reason, which is they are cases where we think we know what the answers should be. You want to have applications to real world networks to show that this really is working, but you also want, when you are developing algorithms, to apply it to something where you can tell, after you did it, whether it is actually giving a sensible answer. This algorithm is always going to give some 81

answer. If you plug in some network and it spits out an answer and you canât tell whether it is sensible, then it didnât really get you anywhere. Many of the examples we look at when we are developing the algorithm are ones where we think we know what it ought to be doing before we start. FIGURE 8 Figure 9 gives one more example. In this case this is a food web, a non-social network example, a food web of marine organisms in the Chesapeake Bay. The interesting thing about what happens when we feed this one into the algorithm is that it appears to split the food web into the pelagic organisms, the ones that live in the upper layers of the water, and the benthic ones, the ones that live in the mud down at the bottom. It appears to be indicating that this ecosystem is divided into two weakly interacting systems with not very much going on between them, so it certainly does have applications other than the social networks. 82

FIGURE 9 One of the differences between what we do and what computer scientists do is to know whether the solution that we got to a problem is actually a good one. We donât want to just get the best solution, whatever it is, we also want to know if the solution we got is no good; in other words, if there wasnât really any community structure there in the network in the first place. To do that, we use a measure that we call modularity. The idea with modularity is basically to measure what fraction of the edges in our network fall within the groups that we find. What we would like is that most of the edges are within groups, and only a few of them between groups. However, if you only use that, it is not a good measure of what we want, because then you could put all the vertices in a single group together, and all the edges would be within that group, and you would get this 100 percent modularity, but that is obviously a trivial result, it is not a good division of the network. What we use instead is the fraction of edges that fall within the groups minus the expected fraction of edges that would fall within those groups. I can write down exactly how we calculate that if you like, but the basic idea you can see by examining that trivial partition. If I put all the vertices in one group together, then all the edges would be within that one group, but it is also expected that all the edges will be in that one group. Thus, you get 100 percent minus 100 percent, so the measure defined in the previous 83

paragraph is 0 rather than a high score. So we think of modularity as something which is high if the number of edges within groups is much greater than we would expect on the basis of chance. What we can do is take one of these dendograms, which represents the order in which the network gets split into smaller and smaller pieces as we remove the edges, and plot this modularity against it, as shown in Figure 10. What we typically find is that the modularity has a peak somewhere and then it falls off, and this peak indicates where the best split would be in this modularity sense, the best split in terms of getting most of the edges within groups. That is shown by the red line in Figure 10. The value there at the peak indicates how good a split it is. This is a measure that goes strictly from 0 to 1. It is about 50 percent modularity in Figure 10, which is pretty good. We usually find that there is something significant going on if it is anywhere above about 30 percent, and the highest values we have seen in real networks are about 80 percent. FIGURE 10 Figure 11 shows what happens if you apply this method to the karate club network. It actually finds two different things. It has a peak for the split into two groups that I talked about before, the one we know about. It has another peak for a split into five groups, and it looks like that second peak is actually a little bit higher. Maybe there is some other smaller-scale structure going on there in the networks that we didnât know about before. 84

FIGURE 11 Once you have this modularity, it allows you to say what the division of a network into groups is. Now I can draw a picture like Figure 12, where I have taken a network and have colored the groups by whatever corresponds to this maximum modularity. This particular one is a network of collaborations, meaning co-authorships, between scientists at the Santa Fe Institute. The colors here correspond rather nicely to subject divisions within the institute, so it looks as though it is giving something sensible. FIGURE 12 85

Figure 13 is a fun network. This one is characters from Victor Hugoâs lengthy and rather boring novel, Les Miserables. It is a very big, thick book and has a lot of characters, and what I have done is joined up any two characters that were in the same scene at any point during the novel. This is another one of those examples where you think you know what should be happening. We know the plot of the novel, and so you can look at this and see if it is giving sensible results. There are well-known groups of characters that are getting lumped together here. FIGURE 13 Figure 14 is another example, taking it one step further. Suppose I am given a network like the one shown in (a). Again, this is a collaboration network of who has written papers with whom. In this case, it is people in the physics community who work on networks, so it is a network of network people. If you were just given that, you canât pick out that much structure. If you feed it into this algorithm that finds the groups in the network, then you begin to see some structure. It turns out that the groups it finds, indicated by the colors in (b), pretty much correspond to known groups of collaborators at known institutions. Taking it one step further, you can take this and make this meta network shown in (c), where there is a circle corresponding to each of the groups in (b), and the edges between them represent which groups have collaborated with which other groups. I have even made the sizes of the circles and the sizes of the edges 86

represent the number of people and the number of collaborations. I think that this is the sort of thing that could be very useful for analyzing these very large network data sets that we are getting now. FIGURE 14 We are getting these data sets that are so big that there is no way you can draw a picture on the screen. In this case you can, but we are getting ones with millions of points now, and it would be very nice to have an automated way to coarse grain it and get the big picture that you see in panel (c) of Figure 14. Then, if you want to know the details, you can zoom in on one of those circles and see what the smaller-scale structure is. In theory, you can do this repeatedly and have several layers of this so that you can visualize things more readily, seeing the biggest structure, and the middle-scale structure, and the small-scale structure all separately when you couldnât plot the whole network on the page because it would all just come out looking like a mess. 87

In the last few minutes I want to tie in this idea of group structure with another topic, the network of friendships between school children. Our data are taken from something called the U.S. National Longitudinal Study of Adolescent Health, the AdHealth study. Jim Moody, who is here, is heavily involved in this. When the data for friendships in a particular high school are plotted, it shows clearly two groups of kids in the school, the white kids and the black kids, and the network of friendships is quite heavily divided along race lines. This is not a surprise to sociologists; this is something that people have known about for decades. What is happening is very clear when you draw these network pictures. This is the phenomenon that they call homophily or assortative mixing, that people tend to associate with other people that they perceive as being similar to themselves in some way. It could be race, it could be educational level, income, nationality, what language you speak, what age you are, almost anything. Practically, you name it, people associate according to that. You can also have disassortative mixing, which is when people associate with people that are different from themselves in some way. Here are some examples. Figure 15 shows a matrix of couples that I took from a paper by Martina Morris, who I also saw here. This is from a study that was done in San Francisco, and they interviewed a lot of couples and tabulated what ethnic groups they came from. You can see that the fraction down the diagonal represents people who are in a couple with somebody from the same ethnic group as them. It is much larger than the fractions off the diagonal. FIGURE 15 Figure 16 is a picture of the ages of husbands and wives from the National Survey of 88

Family Growth, showing their age at the time they got married. What you should pick out from this is that most of the dots fall along the diagonal, indicating that people tend to marry people about the same age as them. Of course, marriage is not really a network. I assume people are only married to one person at a time. I am using it here as a proxy for friendship. People become friends before they get married. FIGURE 16 I am going to look at one type of assortative mixing. People tend to associate with other people like themselves. One way in which they can do that is if people who have many connections in the network tend to be connected to other people with many connections: gregarious people hanging out with other gregarious people and the hermits with other hermits. That would be assortative by degree. Remember, degree is how many connections you have. In social networks, the gregarious people are hanging out with the gregarious people, and for all the other kinds of networks, at least the ones I have looked at, you have disassortative mixing, where the high-degree nodes are connected to the low-degree nodes. This could be a bad thing, for example, if you were concerned about the spread of disease. Certainly it is a bad thing if a person with many connections gets a disease, because they could pass it on to many other people. It is even worse if the people that they are connected to are, themselves, people with many connections, because then it spreads very easily amongst the people with many connections and, hence, could be much worse than if those people had fewer connections. It is very easy to spot the difference between these two types of networks. Networks that 89

are assortative by degree tend to get all the people with large numbers of connections stuck together, and they form a big clump in the center of the network with a sort of peripheral ring of lower-connected people around them. This is shown on the left in Figure 17. In the disassortative case, you tend to get star-like structures, where you have a person with many connections connected to a bunch of people with only a few connections. These things are very distinctive. You start to be able to spot them; you just draw a picture of the network on the page and see that it is a disassortative network: it has these star-like structures in it. This has practical repercussions. I just mentioned the one about disease. Figure 18 shows an analytic calculation that represents the same kind of thing. What it is representing basically is that disease spreads more easily if the network is assortative than if it is disassortative, and sadly, we find that most social networks are assortative. So, that could be a bad thing. FIGURE 17 90

FIGURE 18 Here is another example, dealing with the robustness of networks. One thing people have been interested in the context of the spread of disease is how robust the network is. If I want to break a network up so that it is not connected any more, how should I do that? Which nodes in the network should I look to first? For example, you might imagine that you had a vaccination campaign, and you wanted to vaccinate certain people against a disease. Who should you vaccinate against the flu? Usually they vaccinate old people but, in a way, that is crazy because in many cases old people hardly talk to anybody, so they are unlikely to pass the disease on. You vaccinate them to prevent them from getting the disease, but you donât have the knock-on effect of preventing the people that they would have infected from getting the disease because there were no such people. Rather, you should be targeting school children or people who have many contacts. Not only can you protect them, but they are likely to spread it to other people and you can protect the people they would spread it on to. So, the question is who should be targeted? Figure 19 shows on the vertical axis the size of the largest component. Roughly speaking, that is the maximum number of people that can catch this disease in the worst possible case. For the horizontal axis I am removing vertices, and removing precisely the ones that have the most connections, so I am trying to get the ones that 91

will pass the disease on to the most people. In the assortative cases, we have to remove a much larger fraction of the verticesâpreventing the spreading by our vaccination campaignâbefore the size of the epidemic goes to zero, while well-targeted vaccination can have more impact in the disassortative case. Again, unfortunately, the assortative case, which is common in real social networks, is much more robust to the removal of these vertices. The hand-waving explanation for that is they are all clumped together in this big clump in the middle, and you are removing a lot of people from the same spot, all in the middle of the network, and you are leaving the rest of the network untouched. Because they all clump together, you are attacking one local bit of social space and not all the rest of it. FIGURE 19 To tie this in with what I started to talk about at the beginning, I believe these two things could be connected. It is an interesting pattern that you have with disassortative non-social networks and assortative social networks. How could that be? There are two sides to the answer to that question, and I will give you one of them. How could you get assortativity in social networks? I believe that it comes from this idea of community structure. This is my explanation. If you have people in a bunch of different groups, as in Figure 20âhere the groups are not connected together at all, but I will change that in a momentâthen someone in a small group can only have a few friends because they are in the small group. What is more, the people they are connected to can only have a few friends because they are in the same small group. If you are in a large group, then you can have lots of friends, and your friends can have lots of friends. Just by 92

virtue of the fact of being divided into groups of varying sizes, you can get this assortative mixing by degree, the people in those small groups with other people in small groups, and the same for large groups. FIGURE 20 You can actually make a slightly more sophisticated model of this as follows: Figure 21 shows a bunch of people, A through K, and groups 1 through 4 that they belong to. This is the so-called bipartite graph, which I am going to project down onto just the people. In panel (b) you see just the people, and an edge between two people if they share a group in (a). Then I am going to put in just some fraction of those edges, because I am assuming that not everybody knows everybody else in the same group. Just because you are in a group together doesnât mean you know everybody in that group, so only some fraction of the edges are present here. This makes a simple model of what I showed on Figure 20, and this is actually something that we know how to do mathematicallyâtake a bipartite graph, do a one-mode projection onto just one half of the bipartite graph, and then take out some fraction of the edges. In physics that is what we would call bond percolation. We know how to do each of those steps analytically, and we can solve this model and prove that it does always give you positive correlation between the degrees. In theory that could explain how you get this positive correlation in social networks. I donât have time to talk about the negative half. I would be happy to answer any questions, but I have run out of time and I will stop talking now. Thank you very much. 93

FIGURE 21 QUESTIONS AND ANSWERS DR. BLEI: I have a quick question. Your modularity measure, where did that expectation come from? I must have missed it. DR. NEWMAN: The expected number of connections? Basically what we do is, calculate the expected number conditional on the degree sequence of the network and nothing else. You make a model in which you say, I know the assignment of the vertices to the groups, and I know the degree of each vertex, and then, based only on that, it is actually relatively straightforward to just calculate what the expected number of edges would be in them. One could certainly use a more sophisticated model if you knew other information about the network, such as correlations between the degrees of vertices, but in that particular measure that we looked at, we did not do that. So, it was only conditional on group membership and the degrees. DR. HANDCOCK: Obviously, looking at clustering as a canonical problem in social networks, my question is, how does your method compare to other algorithmic methods such as block modeling, generalized block modeling, and also model-based methods such as Nowicki and Snijdersâs method. 94

DR. NEWMAN: That is a good question. So, there are a lot of methods that have been used in sociology before, some of which worked quite well. For me, the advantage of our method over some of these algorithmic methods is not necessarily how well they work but sort of the transparency of the rationale. I feel like there is sort of a principle for why this thing should work. So, it is clear, for example, if this might not be the appropriate method for your problem, or why it might not work in a particular case. I think that it is important to understand those kinds of things when you are doing statistics on networks, because you donât want to be sort of plugging the things into a block box. You want a nice simple method so you can see whether you think it will work. However, there are, of course, other very simple methods that have been proposed by sociologists that work well, like standard hierarchical clustering method, you have to propose some measure of similarity between vertices before you can do the hierarchical clustering of them, and of course, other very simple methods that have been proposed by sociologists that work well, like standard hierarchical clustering methods. I think those have many of the same advantages as ours does, but they work in very different ways. For example, in your typical additive hierarchical clustering method, you have to propose some measure of similarity between vertices before you can do the hierarchical clustering of them, and of course the results that you are going to get out depend on the particular similarity measure that you propose and then, again, the good think about that is that it allows you to see whether that would be appropriate for your problem or not. The bad thing is that it might not be, because you have to make some choice there about the things you are doing. I would say that our method is doing many of the same things that these other methods are doing, and those other methods might be more appropriate, indeed, for some problems, but we also think our method is more appropriate for some other ones. As I mentioned many people have worked on this and there are many other interesting methods, some of which I really like, that have been proposed recently, which I didnât have time to talk about. DR. BROWN: Letâs take one final question. DR. FEINBERG: The stories you told for the examples, which is the rationale in your answer to Mark, donât quite get at a dimension that was part of the story. That is, are these models really generating the graph or a re they simply observational descriptive models for the graph the way it appears? When you told the story about how you might create the graph and build bridges, that has a dynamic generative characteristic, but that is not what you did. DR. NEWMAN: No, it is not. One could imagine doing that kind of thing. In some sense, this last model that I mentioned here is doing that kind of thing. No, we didnât do any of those sorts of things. I think that there is definitely room for doing these sorts of things here. 95

One could, in theory, generate a network like this and then ask, how well does the network generated fit the real world data? That would be the forward problem, and we are not doing that in most of the cases we were talking about. You are absolutely right; we are really doing the backward problem. REFERENCES Morris, M. 1995. âData Driven Network Models for the Spread of Infections Disease.â Epidemic Models: Their Structure and Relation to Data. Cambridge, U.K.: Cambridge University Press. U.S. Department of Health and Human Services, National Center for Health Statistics. 1995. National Survey of Family Growth, Cycle V. Hyattsville, MD. 96

Dimension Selection for Latent Space Models of Social Networks Peter Hoff, University of Washington DR. HOFF: I am going to talk about estimating parameters in generative models of social networks and statistical models for relational data. What I mean by relational data is data where you have data in an array, a 2-dimensional array, and you have row objects and column objects. The row objects could be a variety of things. They could be tumor tissue samples, and the columns could be genes or proteins that are being expressed. The rows could be countries, and the columns could also be countries and you are measuring interactions between countries, and I am going to have a data analysis example of that. FIGURE 1 In social network analysis the row labels are the same as the column labels, where we are measuring relationships between these nodes. I am going to use the old workhorse of multivariate analysis, the singular value decomposition, to analyze these types of data. Just a notation here: suppose we have measurements as shown in Figure 1, where yi,j is the measurement in the i-th row and the j-th column; we could think of it as the measurement of the relationship 97

between row object i and column object j. In a social network data set these are typically 1 and 0 indicating the presence of a tie or not. Often when we are analyzing data sets what do we want to do? We want to look at patterns in the data set, perhaps relate the network data to some explanatory variables, these xâs on the right in Figure 1, and perhaps make predictions and so on. Letâs start off with a very naive strawman model. Consider what you might first do if you saw these data. You might say, binary data: I am going to fit a logistic regression relating these explanatory variables to my network observations. You model the log odds of there being a tie between two nodes as Î²Txi,j. So, Î² is measuring the relationship between these variables. A model like this can provide a measure of the association between X and Y, via regression coefficients and standard errors. You can also make predictions of missing or future observations. A lot of times you might have missing data in there and you want to predict what those data are from the observed data. Often what you want to do after this is to know what sort of patterns there might be after controlling for these explanatory variables. The very naive logistic regression model says you have all these data, and I am going to model these data as being independent, and the log odds of each tie is given by this Î²Txi,j. This type of model does not recognize some very obvious things that we know might be present from multivariate analysis. It doesnât recognize that yi,j and yi,k have something in common. That is, they have the row object in common. They have the sender of the tie in common. This model ignores potential within-node homogeneity, so its node i or row i has some characteristic about it. That characteristic might play a role in all of its relationships in that whole row. Similarly, you might have homogeneity within a column. The flip side of homogeneity within a node is across-node heterogeneity. Those are two sides of the same coin. This is summarized in Figure 2. 98

FIGURE 2 Models that ignore this type of homogeneity can wreak havoc with our inferential goals. If we have this type of heterogeneity, the yi,jâs are not statistically independent, so parameter estimates and confidence intervals for these coefficients are suspect. More importantly, if we are trying to make predictions, a type of model like this will have very bad predictive performance. Because we know that potentially yi,j may give more information about, say, yi,k than does some observation from something in a different row and a different column. We would like to have models which have this sort of possibility for within-node or within-row or within-column homophily, as was talked about in the previous talk. I guess Mark lifted an overhead from Martinaâs talk; I am going to lift an overhead from Mark Handcockâs paper. Figure 3 shows a network and how within-node homogeneity or across-node heterogeneity might be manifested in this network. One of the first things we might observe is that some of these nodes have just one ingoing or one outgoing tie, whereas some other nodes have many, many, many ties. 99

FIGURE 3 The first thing we might want to do to model this type of heterogeneity is to add something to our logistic regression model, and this is perhaps the first sort of random effect or latent variable model that is used for these types of data. We want to add some parameter to model, say, the out-degree of node i and the in-degree node j, or the expansiveness of object i and the sort of popularity of node j. If you include these in your model, you can fit these parameters, and almost every data set that I have ever played around with, when you do this, it dramatically improves the fit of the model, both within sample and out of sample. It dramatically improves how well you are fitting the data, but also if you try to do a measure about a sample prediction using cross validation, you see that adding these parameters tends to dramatically improve the fit. Thatâs the obvious thing that a statistician would do as a first pass, but in network data there is just an additive structure on this log odd scale. There is probably a lot more going on, and we know this from studying the sociology of these social networks. Some of the other types of things we see that could be caused by within node homogeneity, as was discussed in the previous talk, you can imagine that there are a bunch of nodes, and they all have similar characteristics, which you donât observe. What you do observe in the network is that maybe all these nodes are connecting to one another. You might imagine that you have these clusters of nodes, and perhaps that was caused by the fact that a bunch of nodes have some sort of common characteristics that they are clustering upon. We see these things called transitivity. Transitivity is the concept that, if one node ties to 100

another node, if node i ties to node j, and node j ties to node k, then more often than not there is a tie between i and k. Those are the sorts of possibilities we want to allow for in our models of network data or relational data. How might we incorporate this in a statistical model? Well, these things canât be represented by row and column effects. So, maybe we need something more complicated than that. You might say, letâs just fit an interaction term. That is what a statistician might think, looking at these data. You have row effects, column effects, and any deviations from that would be fit with an interaction. You donât have enough data to fit a single separate parameter for every cell in the array. What people have done, or at least these few people here, is try to come up with reduced-rank models for these interactions, trying to simplify this interaction. FIGURE 4 One such model is the sort of stochastic block structure or latent class model by Nowicki and Snijders, and this was the idea that was mentioned before, that maybe nodes belong to some class, and nodes form ties with people in the same class or category preferentially. This is shown in Figure 4. The important thing to realize in these types of models is that this class variable is not observed. You see these data, and then you try to use the concept of a latent class variable to model the structure that you see. If you see a network and there are lots of clusters, you can represent that cluster structure with the latent class model. These models can describe the 101

patterns that you see in the network. You donât see the class a priori. Another model is to imagine that these nodes are lying in some unobserved latent social space, and that ties are formed preferentially between nodes that are close by one another. This type of structure can actually represent statistically the sort of transitivity we saw earlier and the clustering we saw earlier. More recently I have grown quite fond of a different way to represent a similar idea, which is through latent interproduct models. The idea here is that every row has a set of characteristics as a row, every column has a set of characteristics as a column, and that the relationship between a row and a column are strong, if the row and column latent characteristics match or line up. This is that type of model. This is a model that I used as a symmetric model, and it is useful for relationships that are kind of positive in a symmetric way. FIGURE 5 This is all leading to what I am going to talk about now, the singular value decomposition, coming up with a model-based approach for the singular value decomposition, and incorporating it into network models. As sketched in Figure 5, the idea here is that we are going to try to represent this structure in a way that there is some sort of network interpretation to the modelâso that we can actually estimate the parameters in that way. Many of you are familiar with the singular value decomposition. It basically is just that any matrix has a decomposition into a set of what are called, say, left singular vectors, which I am denoting U, right singular 102

vectors, which I am denoting V, and they have weights. So, this matrix D is a diagonal matrix of positive elements. You can write the i,j-th element of this matrix as Ui T Vj with some weights, given by D. The idea is this type of model, this sort of decomposition has been used for relational data for many decades. There is a large literature based on it. The basic idea is that this Î³i,j is some measurement of the relationship between i and j, and we are decomposing that relationship into saying there are some row effects for unit i, some column effects for unit j, and the relationship is strong if those two vectors are in the same direction, if they are similar in that way. The nuts and bolts here, K is the rank of this matrix of gammas. The Us and the Vs are typically constrained to be orthogonal matrices and orthonormal matrices. The interpretation here, again, is that for each of the rows we have latent row characteristics, for each of the columns we have latent column characteristics. The model says, letâs try to represent the relationship between these nodes using these latent characteristics. I want to point out that this sort of representation can represent the types of structures that we see in social networks. It can represent heterogeneity of out-degrees and heterogeneity of in-degrees. It can represent transitivity and it can represent clustering, and it is a well known mathematical beast, so it is nice in that way. I want to point out this earlier model we talked about here in Figure 6, this having row and column effects. This is just a very simple special case of a rank-2 matrix. The singular value decomposition is more general than that. FIGURE 6 103

FIGURE 7 Figure 7 shows how I spent my summer vacation. If you have a data matrix Y, it is well known how to decompose this matrix Y into a set of left singular values, right singular values, and so on. I didnât want to just do that decomposition. I need to incorporate the singular value decomposition in a statistical model, and I also want to figure out what the dimension is and what an appropriate dimension is. Can we represent things with a rank 2 matrix, do we need rank 3 matrix and so on? Here is a set up. We observed there is some data matrix Y and it is equal to some rank X mean matrix. Remember, any rank X matrix has a decomposition of this form, plus Gaussian noise. The idea is that we get to observe this Y. We know that Y is equal to some mean matrix plus noise, and we would like to make inference on what the mean matrix is. In other words, we want to make inference on what the Us are, the Vs are, and the singular value Ds are. That is actually not too hard of a problem. The really hard problem was figuring out the dimensions. If any of you have worked with trying to make inference on the dimension of a model you know that is kind of hard. 104

FIGURE 8 I know that there are statisticians in the audience, so very quickly I am going to do a little statistics. We are going to use the tools of Bayesian inference to estimate these parameters. Figure 8 says just a little bit about the prior distribution for U, and it is going to be the same as the prior distribution for V. Remember, U is this vector of latent characteristics for the rows. We are constraining in our model the U and V to be orthonormal matrixes. How do we come up with a prior distribution for an orthonomal matrix? Here is one way to do it. In many cases the first thing you think of turns out to be the thing you want. Remember, we are saying that U is an orthonomal matrix, and that means that the columns of U are all unit vectors and are all orthogonal to each other. How do we generate such a thing? What we can do first is sampleâ each column is a point on the sphere; it is a point on the N-dimensional sphere. We sample the first one, z1, uniformly from the N-dimensional sphere, and we set the first column of U equal to that vector. U2 has to be orthogonal to U1. What we can do is sample z2 uniformly from the (N-1)-dimensional sphere, and then use the null space of U1, multiply that by z2, and you get something that is a point on the sphere, and it is orthogonal to z1, and so on. You keep doing that and finally you have created your orthonormal matrix. The result is that you can find out the characteristic function of this thing, and it turns out to be the uniform distribution on the set of orthonormal matrices, which is called the Steifel manifold. More importantly, this probability distribution is exchangeable, which basically means that the distribution of column j, given the other columns, is the same as the distribution of Uk, given the first k-1 vectors. This is going to be a tool which allows us to do parameter estimation. 105

There is a lot of really interesting multivariate analysis which goes on here under the hood, which I am going to gloss over. Basically, the conditional distribution of the jth column, given the others has to be orthogonal to all the others, but it is uniform to the space that is orthogonal to the others. So, you get that Uj, given the others, is equal to a random point on the (m â K + 1) -dimensional sphere, times the null space of the others. Figure 9 shows more of the math. I tried to make it more transparent by using colors. How the inference usually works is that we have the data, we have Y. We have the prior distributions for all the parameters of interest, and then we combine those to get the posterior distribution of the parameters we want. We want to know what the probability distribution is for the unknown parameters, the U, the V, and D, given the data Y. Actually, as long as we can find the conditional distribution of each object, given all the other objects, we can make parameter estimation. The result is that everything works out very nicely and you get the posterior distribution the jth column. Suppose you are interested in the jth column of the latent row characteristics. The posterior distribution is proportional to the prior distribution times the probability of the data given the jth column, as shown in Figure 9. This is just through the Gaussian model for Y. Remember the prior says that the jth column has to be orthogonal to the other columns. It is orthogonal to all the other columns, but then it has some non-uniform distribution about where that lies. I would be happy to go over this in more detail later. 106

FIGURE 9 The hardest part of all this was trying to evaluate the dimension. As I said before, we were saying with this model that basically each row has some latent characteristics, each column has some latent characteristic, and that is going to be our model for how these things interact. We know that if we have an NÃN matrix, we donât want to say that each person has a vector of N characteristics, and that model is saturated. There is going to be too much noise and variability there. Also, we might think that we need more than one or two dimensions, so we would like to evaluate the dimension of the latent characteristics and select a model in a principled way. Basically, the problem reduces down to this problem shown in Figure 10. This problem is asking, is Y equal to a rank 1 matrix plus noise, or is Y equal to, say, a rank 0 matrix plus noise? Trust me, if you can answer this problem you can address the problem of what the dimension is. If you know how to go from zero to one you know how to basically go from k to k+1. You need to evaluate the relative probabilities of model 1 to model 0, and that essentially boils down to the hard part of evaluating the probability of the data under model one relative to model zero. Basically, you have to do some really complicated integration to do this, but it is fun for some people. 107

FIGURE 10 I had knee surgery this summer, and I think this problem wouldnât have gotten done if I hadnât had the surgery. After the surgery I had to spend every day lying in my backyard with a pad of paper. I couldnât go out and do fun things; I had to sit there and do integrals. The hard part is you have to integrate the expression at the bottom of Figure 10. Basically we are saying the probability of the data under model 1, you can actually write it as the probability of the data under model 0, and then a part that sees whether or not adding these vectors will contribute to explaining the variation in Y. You have to integrate these over the uniform distribution on the sphere for U and V, and that is difficult, involves some Bessel functions and stuff like that. I am now going to show how this might actually work in practice. For those of you who have taken some multivariate data and said, ah, I have genes on this thing on the rows and I have tumors on the columns, and I want to see the effects going on here, maybe you have looked at the singular value decomposition of your data to try to sort out what is going on in the data. If you have ever looked at plots like Figure 11, basically we decomposed the matrix and looked at the singular values. The singular values are saying how much variation there is in the matrix in each dimension. If you had a data matrix Y and it was just equal to noise, and if you have played around with these things, you know that it is going to look like a decaying graph. 108

FIGURE 11 I simulated some data using some rank k. Does anybody want to hazard a guess as to what the rank of my mean matrix was? It probably wasnât zero. I have a vote for four. That seems to be popular. Why might it be four? If you have played around with these things you know that you are looking for gaps in the magnitude from one singular value to the next. You know it is definitely more than zero. In Figure 11 you can see a big gap between 1 and 2, so it is at least 1, but then there is this big gap between 3 and 4, and then there is another little gap between 4 and 5, and then it seems to be monotonically going down. So 4 is a really good guess; the actual answer is 5, as shown in Figure 12. This goes to show you how well our eye is at doing these things. 109

FIGURE 12 Figure 13 shows the actual singular values that I used to generate the data matrix. Four is a fair guess because the fifth non-zero singular value is smaller than the others. I used the Bayesian machinery. You can get a posterior distribution for the mean matrix, and then you can look at the posterior mean of all the singular values. If you add up all the blue, it is measuring the total amount of variation that the model picked up in the data. Here is a posterior distribution on the rank of a matrix. This should say that Bayesian methods are good, because you had your intuition about what the rank was, and the Bayesian estimate matched up to that to a great degree. The posterior distribution says there is a very good chance that there are four or more dimensions to this mean matrix. The two highest probability ranks are four and five, as you might expect. 110

FIGURE 13 One other thing about doing Bayesian inference, if you do the singular value decompositionâhere is my little plug for Bayesian analysisâsuppose I told you the rank was 5, what is your least-squares estimate of the mean matrix, given that you know the rank is 5? Well, the least-squares estimate is actually the first five singular values and singular vectors, so you would just be taking the top of the green bars for the first singular values there. If you take the true matrix that was used to generate the data, tab all those entries, and plot them against the least-squares estimate, you get the green pattern shown on the right side of Figure 14. 111

FIGURE 14 If you take the Bayesian estimate you get something with a much lower mean square. What is happening here is the singular value decomposition is taking our data matrix Y and it is looking at the projections onto these subspaces. Data matrix Y is equal to the mean matrix plus error so the singular value decomposition is projecting all the error as well onto the subspace. It is probably best to point out that looking at the range of the truth is â3 to 3, and the range of the singular value decomposition is essentially â4 to 4. That is not atypical of these types of problems that sort of a least-squares estimate overestimates the variability in the matrix. The reason why I went through all that work was because I wanted to embed the concept of the singular value decomposition model into more complicated models for social networks. I wanted to incorporate this into very general models for all sorts of relational data, not just Gaussian data, not just binary data, but data of all sorts. The way I do that is with a generalized linear model. Figure 15 shows the general model for relational data. We have a sociomatrix Y. All of this generalizes to the case where Y is an n x m matrix; it doesnât have to be a square matrix. 112

FIGURE 15 We have a set of explanatory variables X, and our model, relating X to Y, is as follows: we have what we call a predictor, Î¸, and it is equal to a regression part plus a latent structure part. This latent structure part is just this model that we talked about, the singular value decomposition model. The generalized linear model is not that our observed data is equal to X Î² + Z, but some function of the mean is equal to X Î² + Z. Some function of the mean of yi,j is equal to the regression part, a latent structure part, and lack of fit. In this context, you can fit binary data where you model the log odds of the observations as being equal to the structure. If you have count data you might model things on the log scale and, if it is continuous, you might model it on the additive scale, using the identity link. I should point out that a lot of data that people have been talking about is skewed data. I wouldnât call it high-variability data, but I would call it skewed data. You have not just the first and second moments there, you have the third and fourth moments, and those are really the things that can actually represent that skewed structure. That can also be incorporated in here, although with more difficulty. You have to decide on what scale you are making the measurements. Actually, you could do continuous data and use a log link to model skewed data. An example of what I am going to show in the next two slides is some international conflict data. The model for how this really is generated is debatable, but it is a very nice data set because we can see the structure. Mark pointed this out in the last talk. It is good to work on 113

problems where you can tell if the answer makes sense or not. I have some colleagues in political science who are interested in modeling conflicts between nations. The data is from Mike Ward and Xun Cao, and what they measure is the number of militarized disputes initiated by i with target j over a 10-year period. This involves actual militarized disputes and conflicts as well as threats and a bunch of other things that they have recorded. They want to relate these to a bunch of covariates, so they have a bunch of theories about what causes nations to have conflicts with others and what things might ameliorate conflicts. There is this concept of democratic peace that democratic nations donât have conflicts with each other. There is this idea that trade inhibits conflicts and so on. They like to evaluate these things and look at the structure that is left unexplained by these covariates. Figure 16 provides an overview. FIGURE 16 We fit this into our model. We say that we are going to model the mean of each observation on the log scale with this regression part plus this latent structure. I was very happy to see the posterior distribution on the number of dimensions of the latent structure as shown in Figure 17. Why was I happy? Itâs because it is easy to plot 2-dimensional latent structures. You can plot 3-dimensional latent structure, but this k could have been anywhere between 0 to perhaps 130. If the dimension of the latent structure was 73, I wouldnât be able to give you a very good 114

picture at the end of what is going on, or you could look at the first few dimensions of the latent structure. I should also point out that if you play with these data enough, you know that there is no way that the mean structure has dimension less than two. There are lots of countries that are very active in conflict, and a lot of countries that are attacked a lot; you know that you are going to have row and column effects, and that is a 2-dimensional structure. Therefore, you know the dimension has got to be at least two. FIGURE 17 The right half of Figure 17 shows some confidence intervals on the regression parameters. There are some obvious things to point out. As distance increases, the amount of conflict decreases. People have conflicts with countries that are near them, and population plays a big role. Populations of countries that are in conflicts tend to be higher than average. 115

FIGURE 18 Figure 18 shows the latent structure. I will finish by explaining this plot. Recall the model. The model for each country has a 2-dimensional vector of characteristics as an attacker and a 2-dimensional vector of characteristics as a target, so that each one of these countries has a vector as an aggressor. What I have done here is plot the direction of each countryâs vector in red and that is this outer circle here. The magnitude of their plotting character is the magnitude of their vector in that direction. The inner circle is the direction of each countryâs 2-dimensional latent vector as a target. Then I have drawn these green lines indicating the strength of the number of conflicts between the countries. So, you see somewhat obvious things because we know the names of the rows and the names of the columns. Those are the main structures there. Here is where I see if I made myself clear. If there wasnât any latent structure, can anybody guess what this plot would look like, or what would be different about this plot? Where would the green line be? If there were no two dimensional latent structure, there would be no relationship between the actual links, or attacks between the countries, and their position in this circle. All you would see is a bunch of green lines scattered all over the place like that so you can see how these positions really do pick up the structure that is there. To summarize, my point is that singular value decomposition is a well-known tool in multivariate analysis that has been used for a long time, and there is a huge literature based on it, sort of in the additive case in the psychometric literature in lots of fields. It is something that is 116

well understood, although apparently the problem of identifying the dimension of the mean matrix has been left unsolved, at least in a Bayesian context, and actually in a non-Bayesian context it is also unsolved. Using this framework, the idea is that we can use this natural singular value decomposition model that we are familiar with, and incorporate it into different types of data structures, along with regression effects and other things that we use when we are doing statistical modeling. That is the main point of the talk. QUESTIONS AND ANSWERS QUESTION: One quick question. Could you just say a couple of words on what you are thinking about here with the time series extensions, the dynamic extensions, because that is something that is sort of natural to the topics we are going to be covering. DR. HOFF: I have always known that I should spend more time in the dynamic aspect of networks, but every time I gave a talk on latent variable models, everybody always asked, what is the dimension? So, I had to spend time doing the dimension first. Now that that is done, there are a number of ways that you can do dynamic network inference for these things, or incorporate these ideas. One way is to have a time-series model for the latent characteristics, as a Markov model or something like that, and actually David Banks and Eric Vance are working on stuff like that. Again, I have some other ideas about how to set these things up in a general autoregressive structure. Their ideas are kind of interesting, they have a behavioral slant on things. They are using ideas from social networks like, I am at this location in the social space, and I form a tie with this person, and I should move at the next time step, in this other direction. There are sort of dynamical models, people starting to work on these things with these ideas here. DR. BLEI: Your latent variables seem to model the connections between rows and columns that happen outside of the covariates. I can imagine, say, with countries, that two countries that donât get along, that the covariates explain that, that their latent positions would be far apart. Is there some way to deal with that? DR. HOFF: In one sense, you could use this sort of approach with a slight modification to actually let the latent variables model the variation that is unexplained by those covariates. In a sense, that might be the goal. 117

DR. BLEI: You could actually model those jointly with the connection. DR. HOFF: Yes, depending on what your goal is, you could to it in a variety of different ways. For example, you might want to leave that variable out of it. I mean, if you know what the variable is, you might leave the variable out of the model, fit the latent characteristics, and then plot, say, the latent characteristics versus that variable. It depends on what you are trying to get out. One issue with including things in a model is, you are saying what the scale is, or saying it is additive. In a lot of cases it is not. It is like multiplicative. There can be an advantage to leaving it out and then plotting it versus the latent characteristics later. You could do what I just said, try toâthe way I have the model set up, you can constrain the latent characteristics to be orthogonal to things, and you could actually restrict them to be orthogonal to sort of the design structure of the regression and say, yes, it is the left-over variation. It depends on what issue you are trying to address at the moment. DR. BLEI: One last question. A lot of these networks like the Internet are very sparsely connected, and I imagine that if you tried to fit a model that was just overrun with zeroes, say a logistic regression, you would end up just fitting the zeroes and you might not get anything interesting out of it. Have you had that issue? DR. HOFF: Yes and no. Actually, the networks that I have been modeling, these conflict networks that my political science colleagues fit, are extremely sparse. So, actually we do that. One thing he likes to show to political science crowds is that he fits this logistic regression without any sort of latent variables, and then you make a prediction on where the conflicts are. Well, your best bet is to just predict no conflict and that is it, and you just predict all zeroes. So, the r2 is very bad for these types of models. The benefit from doing something like this is, yes, you do predict mostly zeroes, but you can identify sort of those actors in the system that are active, and they get sort of big, latent vectors, and then you can sort of see what is going on there. So, you still predict mostly zeroes for most of the countries, but then you actually do pick up these things that are not sparse, you pick up the activity, and the fit is a lot better. DR. BANKS: Do you have any thoughts on how to robustify this kind of analysis? DR. HOFF: Against what? DR. BANKS: In this business you can wind up with one or two atypical actors that donât quite follow the model that you specified for everybody else but, because of the interactivity, they influence so many other things that it gets really hard. DR. HOFF: This type of approach actually can pick those things up pretty well, because it is fitting characteristics for each node. If it is true that there is a node that is sort of an outlier, it will have characteristics that are estimated as being way out there and they are very large and are 118

very active, without influencing too much the rest of the network. Actually, that sort of outlying stuff fit well with this type of structure. What is harder is when you have it not as a node that is very active. It is that a couple of pairs are outliers and you might get into trouble. Because of the nature of the model and where it fits, it is fitting these parameters for each node. You are very robust against outlying nodes. REFERENCES Handcock, Mark S., P.D. Hoff, and A.E. Raftery. 2002. âLatent Space Approaches to Social Network Analysis.â Journal of the American Statistical Association (97). Hoff, P.D. 2005. âBilinear Mixed-Effects Models for Dyadic Data.â Journal of the American Statistical Association, vol. 100. Hoff, P.D., K. Nowicki, and T. Snijders. 2001. âEstimation and Prediction for Stochastic Blockstructures.â Journal of the American Statistical Association (96). 119

Next: Dynamic Networks--Embedded Networked Sensing (Redux?)--Deborah Estrin, University of California at Los Angeles »

Proceedings of a Workshop on Statistics on Networks (CD-ROM) (2007)

Chapter: Mixing Patterns and Community Structure in Networks--Mark Newman, University of Michigan and Santa Fe Institute

Welcome to OpenBook!

Get Email Updates