Click for next page ( 12


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 11
11 James Schatz "introduction by Session Chair" Transcript of Presentation Summary of Presentation Video Presentation James Schatz is the chief of the Mathematics Research Group at the National Security Agency. 11

OCR for page 11
12 DR. SCHATZ: Thank you, Peter. We had a session in February down at the Rayburn Building in Washington to talk about homeland security that the American Mathematical Society sponsored and I thought I would like to just as an introduction to our first session here give some of the remarks we made down there which I think are relevant here. It is a wonderful privilege to be here today. In these brief remarks I would like to describe the critical role that mathematics plays at the National Security Agency and explain some of the immediate tangible connections between the technical health of mathematics in the United States and our national security. As you may know already the National Security Agency is the largest employer of mathematicians in the world. Our internal mathematics community is a dynamic professional group that encompasses full-time agency employees, three world-class research centers at the Institute for Defense Analyses that work exclusively for NSA and a network of hundreds of fully cleared academic consultants from our top universities. As the Chief of the Mathematics Research Group at NSA, and the executive of our mathematics hiring process I 12

OCR for page 11
13 have been the agency's primary connection to the greater US mathematics community for the past 7 years. The support we have received from the mathematicians in our country has been phenomenal. Our concern for the technical health of mathematics in our country is paramount. Perhaps the most obvious connection between mathematics and intelligence is the science of cryptology. This breaks down into two disciplines, cryptography, the making of codes and cryptanalysts, the breaking of codes. All modern methods of encryption are based on very sophisticated mathematical ideas. Coping with complex encryption algorithms requires at the outset a working knowledge of the most advanced mathematics being taught at our leading universities and at the higher levels a command of the latest ideas at the frontiers of research. Beyond cryptology the information age that is now upon us has opened up a wealth of new areas for pure and applied mathematics research, areas of research that are directly related to the mission of the National Security Agency. While advances in telecommunications science and computer science have produced the engines of the information age, that is the ability to move massive 13

OCR for page 11
14 amounts of digital data around the world in seconds, any attempt to analyze this extraordinary volume of data to extract patterns, to predict behavior of the system or recognize anomalies quickly gives rise to profound new mathematical problems. If you could visit the National Security Agency on a typical work day you would see many, many groups of mathematicians engaged in lively discussions at blackboards, teaching and attending classified seminars on the latest advances in cryptologic mathematics, arguing, exchanging and analyzing wild new ideas, mentoring young talent and most importantly pooling their knowledge to attack the most challenging technical problems ever seen in the agencyls history. You would hear conversations on number theory, abstract algebra, probability theories, statistics, combinatorics, coding theory, graph theory, logic and Fourier analysis. It would probably be hard to imagine that out of this chaotic flurry of activity and professional camaraderie anything useful could emerge. However, there is a serious sense of urgency underlying every project, and you would soon realize that the mathematicians of NSA are relentless in their pursuit 14

OCR for page 11
15 of tangible, practical solutions that deliver critical intelligence to our nation's leadership. The mathematicians of NSA, the Institute for Defense Analyses and our academic partners are the fighter pilots in a way that takes place in the information and knowledge layer of cyberspace. As Americans you would be very proud of their achievements in the war on terrorism. The National Security Agency's need for mathematicians is extreme right now. Although we hire approximately 50 highly qualified mathematicians per year, we actually require more than that, but the talent pool will not support more. Over 60 percent of our hires have a doctorate in mathematics, about 20 percent a master's and 20 percent a bachelor's degree. We are very proud of the fact that 40 percent of our mathematics hires are women and that 15 percent are from under represented minority groups. Of course, the agency depends solely on the greater US mathematics community to educate each new generation of students, but we, also, depend on the professors at universities across the country to advance the state of mathematics research. 15

OCR for page 11
16 If the US math community is not healthy the National Security Agency is not healthy, and I always like to use an occasion like this to thank everybody here for all they have done for math in this country because our agency benefits so greatly. Okay, this first session here is on data mining, unsupervised learning and pattern recognition. This is a very exciting, very active area of research for my office and for the agency at large. We attended just recently the Siam Conference on Data Mining about, when was that, just about a week ago here in Washington, and had a great presence there. It is a wonderful topic. There is absolutely nothing going on in this conference that isn't immediately relevant to NSA and homeland security for us, and this first topic is an area of research that I think we had a bit of a head start on. We have been out there doing this for a few years, but there is a whole lot to learn. It is a young science. So, let me without further ado bring up our first speaker for this session, and that is Professor Jerry Friedman from Stanford University. 16

OCR for page 11
17 Introduction by Session Chair James Schatz Perhaps the most obvious connection between mathematics and intelligence is the science of cryptology. This breaks clown into two disciplines cryptography, the making of cocles, and cryptanalysts, the breaking of cocles. All moclern methods of encryption are basest on very sophisticated mathematical ideas. Coping with complex encryption algorithms requires at the outset a working knowledge of the most acivancect mathematics being taught at our Ieacting universities and at the higher levels a command of the latest icleas at the frontiers of research. Beyoncl cryptology the information age that is now upon us has opened up a wealth of new areas for pure and applied mathematics research. While advances in telecommunications science and computer science have proclucect the engines of the information age that is, the ability to move massive amounts of digital data around the worIct in seconds any attempt to analyze this extraordinary volume of data to extract patterns, to predict behavior of the system, or to recognize anomalies quickly gives rise to profound new mathematical problems. Although the National Security Agency hires approximately 50 highly qualifiecl mathematicians per year, it actually requires more than that, but the talent pool will not support more. Of course, the agency depends on the greater U.S. mathematics community to educate each new generation of students, and it also ctepencts on the professors at universities across the country to advance the state of mathematics research. If the U.S. math community is not healthy, the National Security Agency is not healthy. 17

OCR for page 11
18 Jerry Friedman "Role of Data Mining in Homeland Defense" ~ ranscript of Presentation Summary of Presentation PDF Slides Video Presentation Jerry Friedman is a professor in the Statistics Department at Stanford University and at the Stanford Linear Accelerator Center. 18

OCR for page 11
19 PROF. FRIEDMAN: Jim asked me to talk about the role of data mining in homeland defense, and so in a weak moment I agreed to do so, and in looking it over I discovered that unlike any other areas of the mathematical sciences there is a well-perceived need among decision makers for data mining on homeland defense. So, I have a few examples. Here is an excerpt from a recent speech by Vice President Cheney, and he said, "Another objective of homeland defense is to find connections with huge volumes of seemingly disparate information. This can only be done with computers and only then with the very latest in data linkage analysis." So, that was in a recent speech by Vice President Cheney. Here is a slightly higher decision maker. This is from the President's Office on Homeland Security Presidential Directive 2, and it is a section on the use of the best opportunities for data sharing and enforcement efforts. It says, "Recommend ways in which existing government databases can best be utilized to maximize the ability of the government to identify and locate and apprehend terrorists in the United States. The utility of advanced data-mining software should be addressed." Here is the trade journal, the Journal of Homeland Security. Technologies such as data mining, as 19

OCR for page 11
20 well as regular statistical analysis can be used at the back end of biodefense communication networks to help analyze implied signatures for covert terrorist events. In addition data mining applications might look for financial transactions occurring as terrorists prepare for their attacks. Okay, here is the popular press. Computer databases are designed to become a prime component of homeland defense. Once the databases merge the really interesting software kicks in, data mining programs that can dig up needles in gargantuan haystacks. Okay, as many of you know DACHA has set up an information awareness office and they were charged among other things to look into biometric speech recognition and machine translation tools, data sharing among the agencies for quick decisions and knowledge discovery technology; knowledge discovery is another code word for data mining, that uncovers and displays links among people, content and topics. Here is my favorite. It is not quite germane but this is a comment by Peter W. Hoover, not Peter Hoover the statistician but the engineer from MIT, and he said that in this new era of terrorism it will be their sons versus our silicon, a rather startling point, but I think part of our 20

OCR for page 11
21 silicon will be data mining algorithms running our computers, and the data mining bureau, Interpol and the DARCY(?) coined the phrase MacInt for machine intelligence. It sounds like something that might come from either Apple computer or a hamburger chain, but we need a MacInt or machine intelligence capabilities to provide cuing or early warning from data pertaining to national security. So, there doesn't seem to be a need to convince decision makers that data mining is relevant to national security issues. Many think it is central for national . . security Issues. So, I think the problem here is not in convincing decision makers of the need for data mining technology but to live up to the current expectations of its capabilities, and that I think is going to be a big job. Now, what is data mining? Well, data mining is about data and about mining. Okay, let us talk about data. What are the kinds of data we are going to see in homeland security applications? Well, there would be real-time high- volume data streams, massive amounts of archived data, distributed data that may or may not be centrally warehoused; hopefully it will be centrally warehoused, but you can't centrally warehouse it all and of course many different data types that will have to be merged and 21

OCR for page 11
124 trying to put out products and academic folks need to be tenured and government agencies keep what they are doing secret, and it is all for very valid reasons. If people know how AT&T is going to detect phone fraud, they will get around that. So, it makes it difficult to exchange the science sometimes I think because of all the proprietary information and that is a difficult situation here, but at the same time if we don't keep our sources and methods quiet when we need to they won't be effective either. DR. STEUTZLE: Getting around things is only one problem and once you know how the system works then you can also flood the system so you have both options that you can flood it or get around it, you know. So, that is I guess why the airlines don't want to tell you exactly what they are looking for when they profile you at the gate. PARTICIPANT: The reason I asked the question is because even with many false negatives you feel the positives are still useful. With many false positives it is untenable and the terrorists may force us to remove the system entirely and that is why I asked the question. DR. SCHATZ: Good point. PARTICIPANT: I don't think it is a question. It is more of a comment. It is not so hard to find a needle in 124

OCR for page 11
125 a haystack. The point is that you often find sometimes the thing that you are looking for is a special feature that occurs in a population not necessarily for a single individual. The project that we had with fraudulent access to a computer system, sometimes you can just ignore the data and look for some movements or command that is not typical. Very much to what Werner mentioned your suspicions of the sixties and seventies looking for a robust method, sort of trying hard to avoid, how to evolve more for extreme events, sort of reverse thinking in trying to find these things in the bulk of the data and then from there on you can sort of try to find the individual. In fact, even in a synthetic model if you look at Diane's data if you take Diane's data and say, "Can I detect fraudulent usage of phones?" having the phone bills of individuals for say a year or two, taking just random streams will give you this phone and my phone and somebody else's phone bill but just for a week you can observe for a week there is ongoing data. Could you find fraud in that type of data, and the answer is probably you could because like Werner said if you see calls to Nigeria that may be a good start. DR. LAMBERT: Maybe I should say something. I wasn't actually trying to say that you can use these for terrorists. This is far beyond what we ever try to do. 125

OCR for page 11
126 The other thing is maybe we are focusing too much on trying to accomplish the final goals whereas it might be useful just to give people a filtered set of information so that they have less than actually puts that by hand, which is you know it is not that we are trying to accumulate analysts. We are so far from that; we are not trying to do that at all. Another thing is that even you know, actually in detecting fraud you don't have long histories on people because if people are going to commit fraud they don't go into the system that they have long distance service with for example. They make a call and access somebody else's system where you have no history whatsoever. So, you are right, being able to handle people is very important, and I will have to defer to the comment about earthquakes. That is just some math that I had which had little symbols on it. I actually don't know. It could have been that they developed the signals and used the signal extraction from 20 years after the original application. You do have to take all the information you have. The trick is to figure out how do you handle it. DR. CHAYES: Just as a summary I think I am not someone who knows about any of this but what I am hoping is that what we are going to get out of all of these sessions are some questions that mathematicians can approach, and so 126

OCR for page 11
127 I have just been writing down some more mathematical questions that have been coming up. That is also one of the things that we want the final report to do, to come out with a list of questions that mathematicians can look at, and I guess the one that has been coming up the most is how do we focus on extreme events and what I heard from everybody is that we really have to know how to model extreme events properly. So, I am not sure how much of that is the mathematical question. DR. AGRAWAL: That assumes that you know something. DR. CHAYES: That assumes that you know something. So, on a general level how do you get extreme events and it sounds like we are very far away from that. Another one that Jim mentioned was if you have a lot of data how do you visualize the data and I know that there are people working on this. I am certainly not one of them. I am not sure if there is anyone here who can speak to the question of how do you visualize data. PARTICIPANT: Not just visualize. DR. CHAYES: Yes, I mean in a metaphorical sense how do you visualize data and then there is also the question that seems to me the one that we are furthest along on which a number of people talked about which is how 127

OCR for page 11
128 do we randomize data to try to ensure privacy along with security. However being further along doesn't mean that we are very far. So, it struck me that those are three areas that could set a mathematical agenda and if anybody has any comments on any of those? PARTICIPANT: I have one comment. If we are talking about addressing terrorism are we talking about preventing a small number of events and data mining to prevent these events to make sure that you have got every single individual and on the other hand there are a number of organizations like Al Qaeda and maybe we could concentrate more on the structures of these organizations and then you are not talking about identifying every individual, identifying every possible conspiracy but identifying plans of the organization. DR. KARR: I am Alan Karr from the National Institute of Statistical Sciences and I would just like to point out that there is a wealth of techniques associated with preserving privacy in data other than randomization. Randomization has some well-recognized shortcomings in other cases, but I think this point is a lot broader and there is a whole area of statistical disclosure that ought to be brought into this. ornani z at for 128

OCR for page 11
129 DR. LASKEY: With these issues that have been brought up I would like to add one more which is combining, by the way, I am Kathy Laskey from George Mason University and combining human expertise with statistical data and that does in fact have mathematical issues associated with it because of methods where you represent the human knowledge and ability distributions to combine them to data, and there are lots of important innovations in that area. I would, also, like to point out on the varied events the importance of outliers of rare events have been mentioned a lot, but the importance of multivariate outliers, data points that are not particularly unusual on any one feature. It is in combination that they are unusual and in fact in the events leading up to September 1l, these people blended in with the society, but if you look at the configuration of their behaviors if somebody had actually been able to home in on those individuals and say, "Okay, you know, they paid cash for things, plus they were taking flying lessons, plus, they were from the Middle East, plus, this, plus," and then you discover that an Al Qaeda cell was planning to use airplanes as bombs there were enough pieces that could have been put together ahead of time, not that I am saying that it would be easy, but pieces were not 129

OCR for page 11
130 individually significant enough to set off anybody's warning system. It was the combination that was the issue. DR. KAFADAR: I am Karen Kafadar from University of Denver. I think I heard someone from the FAA say that actually the airlines did identify something unusual about at least one of those. The response was to recheck the check level, rescreen the check level. There was another variable there. They didn't know that. DR. SCHATZ: I don't want to cut off any discussions although it looks like we are getting into the lunch break. We will take a couple more quick ones and then I am sure there will be lots of time later to talk. PARTICIPANT: I am trying to put together a couple of things that seem related. One is that we classify this and we can see this and this and this, and that ought to be intuitively meaningful and it is information that ought to be the model used, but I also have a sense that we are looking for the kinds of things you can see looking back but not forward so easily. In retrospect every newscaster would know what was coming. So, what is the potential for these folks who are analysts who are in the business of knowing how the targets are changing? Are we talking about being experts in real time participating in the development a system that might have

OCR for page 11
131 to change in time as well and what are the odds that the system can say, "Is this interesting?" and have them say, "Yes," or "No," and then the system from what the analysts thought of it with the perspective of the analyst looking at it Tuesday of this week instead a month ago and with all the complexity that you are not going to be able to deal in rules no matter how careful you are; so, is that, I know analysts are probably overworked like everybody else, but maybe you could participate in something like this. DR. SCHATZ: We do a fair amount of analyzing the analysts if that is what you are asking. I get in trouble at our agency when I talk about rebuilding the analysts because they don't like that, but we do; a lot of our activities and algorithms have to do with on the one hand helping them prioritize data for them that we think they are interested in based on what they have been doing, try to predict things they should have looked at that they are not getting time to get to but modeling analyst behavior is something that we do all the time and will be more and more important for us, absolutely. PARTICIPANT: The third time that a rep came to us and said, "Bush is linked to the White House," you know the system should be one because the analyst knows well that that is not interesting. 131

OCR for page 11
132 DR. SCHATZ: Yes, there certainly is for us again we enjoy a population of people to study in that regard that other people don't have access to, but certainly when we do have access to it, and we do, a lot of what we do is studying analyst behavior and trying to correlate did they pull the document; did they look at a document; did they act on a document and try to maximize our advantage there because at the end of the day no matter how many individuals you have it is a minuscule epsilon number compared to the data size. So, what they actually do and act on is critically important. One more, Rakesh? DR. AGRAWAL: It is not a question. Many times I like to go and look at things, but sometimes I think I wish there was more computational aspect to it. So, in decisions and so on the interesting thing is like the combination of things. There is a lot of very interesting work happening and it is interesting for somebody in this Committee to understand what happened and to look at it, to understand the computational people and something I very strongly believe that we don't have hope for doing some of the massive common warehouse kind of things that somebody would pay for. I don't have the experience to look at the kind of data you have they are critical for commercial testing 132

OCR for page 11
133 in the field and they can be done. So, how would you solve all the complications that you have which essentially assume that there is one data source but think how would you do all the computations you wanted to do where you have these data sources which are kind of ready to share something through a mode of computation and these are some of the kinds of data points here which would be useful. DR. SCHATZ: Very relevant, absolutely. Okay, Andrew? DR . ODLYZKO: I f you look at the broad technology what we have to capacity. know in the next few years is storage DR. SCHATZ: Good wrap up. Thanks, everybody. Thanks to the speakers for the morning. (Applause.) DR. SCHATZ; Twelve-thirty, here. (Thereupon, at Il:50 a.m., a recess was taken until 12:40 p.m., the same day.) 133

OCR for page 11
134 Remarks on Data Mining: Unsupervised [earning, and Pattern Recognition Werner Stuetzle There appear to be unrealistically high expectations for the usefulness of data mining for homeland security. When a Presidential Directive refers to "using existing government databases to detect, identify, locate, and apprehend potential terrorists," that is certainly an extremely ambitious goal. For example, pinpointing the financial] transactions that occur as terrorists prepare for their attack is difficult given that it doesn't take a lot of money to commit terrorist acts. Using data-mining systems to combat counterterrorism is more difficult than applying data mining in the commercial arena. For example, to flag people who may be committing calling card fraud, a Iong-distance company has extensive records of usage. As a result, there are profiles of all users. However, such convenient data are nonexistent when detecting people who might be terrorists. In addition, errors and oversights in the commercial arena are, in general, not terribly costly, whereas charging innocent people with suspected terrorism is unacceptable. Biometrics will have to be a crucial part of any strategy in order to combat attempted identity theft. 134