H
Data Mining and Information Fusion

This appendix addresses the science and technology of data mining and information fusion and their utility in a counterterrorism context. The use of these techniques for counterterrorist purposes has substantial implications for personal privacy and freedom. While technical and procedural measures offer some opportunities for reducing the negative impacts, there is a real tension between the use of data mining for this purpose and the resulting impact on personal privacy, as well as other consequences from false positive identification. These privacy implications are primarily addressed in other parts of this report.

H.1
THE NEED FOR AUTOMATED TECHNIQUES FOR DATA ANALYSIS

In the past 20 years, the amount of data retained by both business and government has grown to an extraordinary extent, mainly due to the recent, rapid increase in the availability of electronic storage and in computer processing speed, as well as the opportunities and competitiveness that access to information provides. Moreover, the concept of data or information has also broadened. Information that is retained for analytic purposes is no longer confined to quantitative measurements, but also includes (digitized) photographs, telephone call and e-mail content, and representations of web travels. This new view of what constitutes information that one would like to retain is inherently linked to a broader set of questions to which mathematical modeling has now been profitably



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 185
H Data Mining and Information Fusion This appendix addresses the science and technology of data mining and information fusion and their utility in a counterterrorism context. The use of these techniques for counterterrorist purposes has substan- tial implications for personal privacy and freedom. While technical and procedural measures offer some opportunities for reducing the negative impacts, there is a real tension between the use of data mining for this purpose and the resulting impact on personal privacy, as well as other consequences from false positive identification. These privacy implica- tions are primarily addressed in other parts of this report. H.1 THE NEED FOR AUTOMATED TECHNIQUES FOR DATA ANALYSIS In the past 20 years, the amount of data retained by both business and government has grown to an extraordinary extent, mainly due to the recent, rapid increase in the availability of electronic storage and in computer processing speed, as well as the opportunities and competitive- ness that access to information provides. Moreover, the concept of data or information has also broadened. Information that is retained for analytic purposes is no longer confined to quantitative measurements, but also includes (digitized) photographs, telephone call and e-mail content, and representations of web travels. This new view of what constitutes infor- mation that one would like to retain is inherently linked to a broader set of questions to which mathematical modeling has now been profitably 

OCR for page 185
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS applied. For example, handwritten text can now be considered to be data, and progress in automatic interpretation of handwritten text has already reached the point that over 80 percent of handwritten addresses are automatically read and sorted by the U.S. Postal Service every day. A problem of another type on which substantial progress has also been made is how to represent the information in a photograph efficiently in digital form, since every photograph has considerable redundancy in terms of information content. It is now possible to automatically detect and locate faces in digital images and, in some restricted cases, to identify the face by matching it against a database. This new world of greatly increased data collection and novel approaches to data representation and mathematical modeling have been accompanied by the development of powerful database technologies that provide easier access to these massive amounts of collected data. These include technologies for dealing with various nonstandard data struc- tures, including representing networks between units of interest and tools for handling the newer forms of information touched on above. A ques- tion not addressed here—but of considerable importance and a difficult challenge for the agencies responsible for counterterrorism in the United States—is how best to represent massive amounts of very disparate kinds of data in linked databases so that all relevant data elements that relate to a specific query can be easily and simultaneously accessed, contrasted, and compared. Even with these new database management tools, the retention of data is still outpacing its effective use in many areas of application. The common concern expressed is that people are “drowning in data but starving for knowledge” (Fayyad and Uthurusamy1 refer to this phenom- enon as “data tombs”). This might be the result of several disconnects, such as collecting the wrong data, collecting data with insufficient quality, not framing the problem correctly, not developing the proper mathemati- cal models, or not having or using an effective database management and query system. Although these problems do arise, in general, more and more areas of application are discovering novel ways in which math- ematical modeling, using large amounts and new kinds of information, can address difficult problems. Various related fields, referred to as knowledge discovery in data- bases (KDD), data mining, pattern recognition, machine learning, and information or data fusion (and their various synonyms, such as knowl- edge extraction and information discovery) are under rapid development and providing new and newly modified tools, such as neural networks, 1 U. Fayyad and R. Uthurusamy, “Evolving data mining into solutions for insights,” Com- munications of the ACM 45(3):28-31, 2002.

OCR for page 185
 APPENDIX H support vector machines, genetic algorithms, classification and regression trees, Bayesian networks, and hidden Markov models, to make better use of this explosion of information. While there has been some overrepresentation of the gains in cer- tain applications, these techniques have enjoyed impressive successes in many different areas.2 Data mining and related analytical tools are now used extensively to expand existing business and identify new business opportunities, to identify and prevent customer churn, to identify pro- spective customers, to spot trends and patterns for managing supply and demand, to identify communications and information systems faults, and to optimize business operations and performance. Some specific examples include: • In image classification, SKICAT outperformed humans and tradi- tional computational techniques in classifying images from sky surveys comprising 3 terabytes (1012 bytes) of image data. • In marketing, American Express reported a 10-15 percent increase in credit card use through the application of marketing using data mining techniques. • In investment, LBS Capital Management uses expert systems, neu- ral nets, and genetic algorithms to manage portfolios totaling $600 mil- lion, outperforming the broad stock market. • In fraud detection, PRISM systems are used for monitoring credit card fraud; more generally, data mining techniques have been dramati- cally successful in preventing billions of dollars of losses from credit card and telecommunications fraud. • In manufacturing, CASSIOPEE diagnosed and predicted prob- lems for the Boeing 737, receiving the European first prize for innovative application. • In telecommunications, TASA uses a novel framework for locating frequently occurring alarm episodes from the alarm stream, improving the ability to prune, group, and develop new rules. • In the area of data cleaning, the MERGE-PURGE system was suc- cessfully applied to the identification of welfare claims for the State of Washington. • In the area of Internet search, data mining tools have been used to improve search tools that assist in locating items of interest based on a user profile. Under their broadest definitions, data mining techniques include a 2 U. Fayyad, G.P. Shapiro, and P. Smyth, “From data mining to knowledge discovery in databases,” AI Magazine 17(3):37-54, 1996.

OCR for page 185
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS diverse set of tools for mathematical modeling, going by such names as knowledge discovery, machine learning, pattern recognition, and infor- mation fusion. The data on which these techniques operate may or may not be personally identifiable information, and indeed they may not be associated with individuals at all, although of course privacy issues are implicated when such information is or can be linked to individuals. Knowledge discovery is a term, somewhat broader than that of data mining, which denotes the entire process of using unprocessed data to generate information that is easy to use in a decision-making context. Machine learning is the study of computer algorithms that often form the core of data mining applications. Pattern recognition refers to a class of data mining approaches that are often applied to sensor data, such as digital photographs, radiological images, sonar data, etc. Finally, data and information fusion are data mining methods that combine information from disparate sources (often so much so that it is difficult to define a formal probabilistic model to assist in summariz- ing the information). Information fusion seeks to increase the value of disparate but related information above and beyond the value of the individual pieces of information (“obtaining reliable indications from unreliable indicators”). Because data mining has been useful to decision making in many diverse problem domains, it is natural and important to consider the extent to which such methodologies have utility in counterterrorism efforts, even if there is considerable uncertainty regarding the problems to which data mining can be productively applied. One issue is whether and to what extent data mining can be effec- tively used to identify people (or events) that are suspicious with respect to possible engagement in activities related to terrorism; that is, whether various data sources can be used with various data mining algorithms to help select people or events that intelligence agents working in counter- terrorism would be interested in investigating further. Data mining algo- rithms are proposed as being able to effectively rank people and events from those of greatest interest, with the potential to dramatically reduce the cases that intelligence agents have to examine. Of course, human beings would be still required both to set the thresholds that delineate which people would receive further review and which would not (presumably dependent on available resources) and to check the cases that were selected for further inspection prior to any actions. That is, human experts would still decide, probably on an indi- vidual basis, which cases were worthy of further investigation. A second issue is the possibility that data mining has additional uses beyond identifying and ranking candidate people and events for intel- ligence agents. Specifically, data mining algorithms might also be used

OCR for page 185
 APPENDIX H as components of a data-supported counterterrorist system, helping to perform specific functions that intelligence agents find useful, such as helping to detect aliases, or combining all records concerning a given indi- vidual and his or her network of associates, or clustering events by certain patterns of interest, or logging all investigations into an individual’s activ- ity history. Data mining could even help with such tasks as screening bag- gage or containers. Such tools may not specifically rank people as being of interest or not of interest, but they could contribute to those assessments as part of a human-computer system. This appendix considers these pos- sible roles in an examination of what is currently known about data min- ing and its potential for contributing to the counterterrorism effort. An important related question is the issue of evaluating candidate techniques to judge their effectiveness prior to use. Evaluation is essen- tial, first, because it can help to identify which among several contending methods should be implemented and whether they are sufficiently accu- rate to warrant deployment. Second, it is also useful to continually assess methods after they have been fielded to reflect external dynamics and to enable the methods to be tuned to optimize performance. Also, assuming that these new techniques can provide important benefits in counterter- rorist applications, it is important to ask about the extent to which their application might have negative effects on privacy and civil liberties and how such negative effects might be ameliorated. This topic is the focus of Appendix L. H.2 PREPARING THE DATA TO BE MINED It is well known by those engaged in implementing data mining methods that a large fraction of the energy expended in using these meth- ods goes into the initial treatment of the various input data files so that the data are in a form consistent with the intended use (data correction and cleaning, as described in Section C.1.2). The goal here is not to provide a comprehensive list of the issues that arise in these efforts, but simply to mention some of the common hurdles that arise prior to the use of data mining techniques so that the entire process is better understood. The following discussion focuses on databases containing personal information (information about many specific individuals), but much of the discussion is true for more general databases. Several common data deficiencies need prior treatment: • Reliable linkages. Often several databases can be used to provide information on overlapping sets of individuals, and in these cases it is extremely useful to identify which data entries are for the same indi- viduals across the various databases. This is a surprisingly difficult and

OCR for page 185
0 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS error-prone process due to a variety of complications: (1) identification numbers (e.g., Social Security numbers, SSNs) are infrequently repre- sented in databases, and when they are, they are sometimes incorrect (SSNs, in particular, have deficiencies as a matching tool, since in some cases more than one person has the same SSN, and in others people have more than one SSN, not to mention the data files that attribute the wrong SSNs to people). (2) There are often several ways of representing names, addresses, and other characteristics (e.g., use of nicknames and maiden names). (3) Errors are made in representing names and other characteristics (e.g., misspelled names, switching first and last names). (4) Matching on a small number of characteristics, such as name and birth date, may not uniquely identify individuals. (5) People’s characteristics can change over time (e.g., people get married, move, and get new jobs). Furthermore, deduplication—that is, identifying when people have been represented more than once on the same database—is hampered by the same deficiencies that complicate record linkage. Herzog et al. point out the myriad challenges faced in conducting record linkage.3 They point out that the ability to correctly link records is surprisingly low, given the above listed difficulties. (This is especially the case for people with common names.) The prevalence of errors for names, addresses, and other characteristics in public and commercial data files greatly increases the chances of records either being improperly linked or improperly left unlinked. Furthermore, given the size of the files in ques- tion, record linkage generally makes use of blocking variables to reduce the population in which matches are sought. Errors in such blocking variables can therefore result in two records for the same individual never being compared. Given that data mining algorithms use as a fundamental input whether the joint activities of an individual or group of individuals are of interest or not, the possibility that these joint activities are actually for different people (or that activities that are joint are not viewed as joint since the individuals are considered to be separate people) is a crucial limitation to the analysis. • Appropriate database structure. The use of appropriate database management tools can greatly expedite various data mining methods. For example, the search for all telephone numbers that have either called a particular number or been called by that number can be carried out orders of magnitude faster when the database has been structured to facilitate such a search. The choice of the appropriate database framework can therefore be crucially important. Included in this is the ability to link 3 T.N. Herzog, F.J. Scheuren, and W.E. Winkler, Data Quality and Record Linkage Techniques, Springer Science+Business Media, New York, N.Y., 2007

OCR for page 185
 APPENDIX H relevant data entries, to “drill down” to subsets of the data using various characteristics, and to answer various preidentified queries of interest. • Treatment of missing data. Nonresponse (not to mention undercov- erage) is a ubiquitous feature of large databases. Missing characteristics can also result from the application of editing routines that search for joint values for variables that are extremely unlikely, which if found are therefore deleted. (A canonical example is a male who reports being pregnant.) Many data mining techniques either require or greatly benefit from the use of data sets with no missing values. To create a data file with the missing values filled in, imputation techniques are used, which collectively provide the resulting database with reasonable properties, with the assumption that the missing data are missing at random. (Miss- ing at random means that the distribution of the missing information is not dependent on unobserved characteristics. In other words, missing values have the same joint distribution as the nonmissing values, given other nonmissing values available in the database.) If the missing data are not missing at random, the resulting bias in any subsequent analysis may be difficult to address. The generation of high-quality imputations is extremely involved for massive data sets, especially those with a com- plicated relational structure. • Equating of ariable definitions. Very often, when merging data from various disparate sources, one finds information for characteristics that are similar, but not identical, in terms of their definition. This can result from time dynamics (such as similar characteristics that have different reference periods), differences in local administration, geographic differ- ences, and differences in the units of data collection. (An example of dif- ferences in variable definitions is different diagnostic codes for hospitals in different states.) Prior to any linkage or other combination of informa- tion, such differences have to be dealt with so that the characteristics are made to be comparable from one person or unit of data collection to the next. • Oercoming different computing enironments. Merging data from different computer platforms is a long-standing difficulty, since it is still common to find data files in substantially different formats (including some data not available electronically). While automatic translation from one format to another is becoming much more common, there still remain incompatible formats that can greatly complicate the merging of data bases. • Data quality. Deficiencies in data quality are generally very dif- ficult to overcome. Not only can there be nonresponse and data linkage problems as indicated above, but also there can be misresponse due to a number of problems, including measurement error and dated responses. (For example, misdialing a phone number might cause one to become

OCR for page 185
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS classified as a person of interest.) Sometimes use of multiple sources of data can provide opportunities for verification of information and can be used to update information that is not current. Also, while not a data problem per se, sometimes data (that might be of high quality) have little predictive power for modeling the response of interest. For example, data on current news magazine subscriptions might be extremely accurate, but they might also provide little help in discriminating those engaged in ter- rorist activities. H.3 SUBJECT-BASED DATA MINING AS AN EXTENSION OF STANDARD INVESTIGATIVE TECHNIQUES This appendix primarily concerns the extent to which state-of-the-art data mining techniques, by combining information in relatively sophis- ticated ways, may be capable of helping police and intelligence officers reduce the threat from terrorism. However, it is useful to point out that there are applications of data mining, sometimes called subject-based data mining,4 that are simply straightforward extensions of long-standing police and intelligence work, which through the benefits of automation can be greatly expedited and broadened in comparison to former prac- tices, thereby providing important assistance in the fight against terror- ism. Although the extent to which these more routine uses of data have already been implemented is not fully known, there is evidence of wide- spread use both federally and in local police departments. For example, once an individual is under strong suspicion of partici- pating in some kind of terrorist activity, it is standard practice to examine that individual’s financial dealings, social networks, and comings and goings to identify coconspirators, for direct surveillance, etc. Data min- ing can expedite much of this by providing such information as (1) the names of individuals who have been in e-mail and telephone contact with the person of interest in some recent time period, (2) alternate residences, (3) an individual’s financial withdrawals and deposits, (4) people that have had financial dealings with that individual, and (5) recent places of travel. Furthermore, the activity referred to as drilling down—that is, exam- ining that subset of a dataset that satisfies certain constraints—can also be used to help with typical police and intelligence work. For example, knowing several characteristics of an individual of interest, such as a 4 J. Jonas and J. Harper, “Effective counterterrorism and the limited role of predictive data mining,” pp. 1-12 in Policy Analysis, No. 584, CATO Institute, Washington, D.C., December 11, 2006.

OCR for page 185
 APPENDIX H description of their automobile, a partial license plate, and/or partial fingerprints, might be used to provide a much smaller subset of possible suspects for further investigation. The productivity and utility of a subject-based approach to data min- ing depends entirely on the rules used to make inferences about subjects of interest. For example, if the rules for examining the recent places to which an individual has traveled are unrelated to the rules for flagging the national origin of large financial transactions, inferences about activi- ties being worthy of further investigation may be less useful than if these rules are related. Counterterrorism experts thus have the central role in determining the content of the applicable rules, and most experts can make up lists of patterns of behavior that they would find worrisome and therefore worthy of further investigation. For example, these might include the acquisition of such materials as toxins, biological agents, guns, or components of explosives (when their occupations do not involve their use) by a community of individuals in regular contact with each other. Implemented properly, rule-based systems could be very useful for reducing the workload of intelligence analysts by helping them to focus on subjects worthy of further investigation. The committee recognizes that when some of the variables in question refer to personal characteristics rather than behavior, issues of racial, reli- gious, and other kinds of stereotyping immediately arise. The committee is silent on whether and under what circumstances personal characteris- tics do have predictive value, but even if they do, policy considerations may suggest that they not be used anyway. In such a situation, policy makers would have to decide whether the value for counterterrorism added by using them would be large enough to override the privacy and civil liberties interests that might be implicated through such use. H.4 PATTERN-BASED DATA MINING TECHNIQUES AS ILLUSTRATIONS OF MORE SOPHISTICATED APPROACHES Originating in various subdisciplines of computer science, statistics, and operations research, a class of relevant data mining techniques for counterterrorist application includes (1) those that might be used to iden- tify combinations of variables that are associated with terrorist activities and (2) those that might identify anomalous patterns that experts would anticipate would have a higher likelihood of being linked to terrorist activities. The identification of combinations of variables that are associ- ated with terrorist activities essentially requires a training set—which is a set of data representing the characteristics of people (or other units) of interest and those not of interest, so that the patterns that best discrimi-

OCR for page 185
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS nate between these two groups can be discerned.5 This use of a training set is referred to as supervised learning. The creation of a training set requires the existence of ground truth. That is, for a supervised learning application to learn to distinguish X (i.e., things or people or activities of interest) from not-X (i.e., things or people or activities not of interest), the training set must contain a significant number of examples of both X and not-X. For an example, consider airport baggage inspections. Here, super- vised learning techniques can provide an improvement over rule-based expert systems by making use of feedback loops using training sets to refine algorithms through continued use and evaluation. Machines that use various types of sensing to “look” inside baggage for weapons and explosives can be trained over time to discriminate between suspicious bags and nonsuspicious ones. It might be possible, given the large volume of training data that can be collected from many airports, that they might be trained over time to demonstrate greater proficiency than human inspectors. The inputs to such a procedure could include the types of bags, the arrangement of items inside the bags, the images recorded when the bags are sensed, and information about the traveler. Useful training sets should be very easy to produce in this application for two reasons. First, many people (sometimes inadvertently) pack forbidden items in carry-on lug- gage, thereby providing many varied instances of data from which the system could learn. Second, ground truth is available, in the sense that bags selected for further inspection can be objectively determined to con- tain forbidden items or not. (It would be useful, in such an application, to randomly select bags that were viewed as uninteresting for inspection to measure the false negative rate.) Furthermore, if necessary, a larger num- ber of examples of forbidden articles can be introduced artificially—this process would increase the number of examples from which an algorithm might learn to recognize such items.6 The requirement in supervised learning methods that a training set must contain a significant number of labeled examples of both X and not-X places certain limitations on their use. In the context of dis- tinguishing between terrorist and nonterrorist activity, because of the relative infrequency of terrorist activity, only a few instances can be included in a training set, and thus learning to discriminate between 5 There is a slightly different definition of a training set when the goal is estimation instead of classification. 6 However, performance would improve only with respect to information contained in images of the bag—because such seeding would necessarily be carried out by a nonrandom set of the population, it would not be possible to improve performance with respect to information about bag owners.

OCR for page 185
 APPENDIX H normal activity and preterrorist activity through use of a labeled train - ing set will be extremely challenging. Moreover, even a labeled train- ing set can miss unprecedented types of attacks, since the ground truth they contain (whether or not an attack occurred) is historical rather than forward-looking. By contrast, a search for anomalous patterns is an example of unsu- pervised learning, which is often based on examples for which no labels are available. The definition of anomalous behavior that is relevant to terrorist activity is rather fuzzy and unclear, although it can be separated into two distinct types. First, behavior of an individual or household can be distinctly different from its own historical behavior, although such differences may not (indeed, most often will not) relate specifically to ter- rorist behavior. For example, credit card use or patterns of telephone calls can be distinctly different from those observed for the same individual or individuals in the past. This is referred to as signature-based anomaly detection. Second, behavior can be distinctly different cross-sectionally; that is, an individual or household’s behavior can be distinctly different from that of other comparable individuals or households. Unsupervised learning seeks to identify anomalous patterns, some of which might indi- cate novel forms of terrorist activity. Candidate patterns must be checked against and validated by expert judgment. As an example, consider the simultaneous booking of seats on an aircraft of a group of unrelated individuals from the same foreign country without a return ticket. A statistical model could be developed to estimate how often this pattern would occur assuming no terrorism and therefore how anomalous this circumstance was. If it turned out that such a pat- tern was extremely common, possibly no further action would be taken. However, if this were an extremely rare occurrence, and assuming that intelligence analysts viewed this pattern as suspicious, further investiga- tion could be warranted. A more recent class of data mining techniques, which are still under development, use relational databases as input.7 Relational databases represent linkages between units of analysis, and in a counterterrorism context the key example is social networks. Social networks are people who regularly communicate with each other, for example, by telephone or e-mail, and who might be acting in concert. Certainly, if one could produce a large relational database of individuals known to be in com- munication, it would be useful. One could then identify situations similar to those in which each member acquired an uninteresting amount of some chemical, but in which the total amount over all communicating individu- 7 E. Segal, D. Pe’er, A. Regev, D. Koller, and N. Friedman, “Learning module networks,” Journal of Machine Learning Research 6(Apr):557-588, 2005.

OCR for page 185
0 APPENDIX H These multiple and significant roles for expert judgment remain even with the best of data mining technologies. Over time, it may be that more of this expertise can be represented in the portfolio of techniques used in an automated way, but there will always be substantial deficiencies that will require expert oversight to address. H.7 ISSUES CONCERNING THE DATA AVAILABLE FOR USE WITH DATA MINING AND THE IMPLICATIONS FOR COUNTERTERRORISM AND PRIVACY It is generally the case that the effectiveness of a data mining algo- rithm is much more dependent on the predictive power of the data col- lected for use than on the precise form of the algorithm. For example, it typically does not matter that much, in discriminating between two popu- lations, whether one uses logistic regression, a classification tree, a neural net, a support vector machine, or discriminant analysis. Priority should therefore be given to obtaining data of sufficient quality and in sufficient quantity to have predictive value in the fight against terrorism. The first step is to ensure that the data are of high quality, especially when they are to be linked. When derived from record linkages, data tend to assume the worst accuracies in the original data sets rather than the best. Inaccurate data, regardless of quantity, will not produce good or useful results in this counterterrorism context. A second step is to ensure that the amount of data is adequate— although as a general rule, the collection of more data on people’s activ- ities, movements, communications, financial dealings, etc., results in greater opportunities for a loss of privacy and the misuse of the informa- tion. Portions of the committee’s framework provide for best practices to minimize the damage done to privacy when information is collected on individuals, but ultimately, a policy still needs to be identified that specifies how much additional data should be used for obtaining better results. Insight into the specifics of the trade-off can be obtained through the use of synthetic data for the population at large (i.e., the haystack within which terrorist needles are hiding) without compromising privacy. At the outset, researchers would use as much synthetic data as they were able to generate in order to assess the effectiveness of a given data mining technique. Then, by removing databases one by one from the scope of the analysis, they would be able to determine the magnitude of the negative impact of such removal. With this analysis in hand, policy makers would have a basis on which to make decisions about the trade-off between accuracy and privacy.

OCR for page 185
0 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS H.8 DATA MINING COMPONENTS IN AN INFORMATION-BASED COUNTERTERRORIST SYSTEM It is too limiting a perspective to view data mining algorithms only as stand-alone procedures and not to view them as potentially components of a data-supported counterterrorist system. Consider, for example, that data mining techniques have played an essential role as various compo- nents of the algorithm that comprises an Internet search engine. A search engine, at the user level, is not a data mining system, but instead a database with a natural query language. However, the com- ponent processes of populating this database, ranking the results, and making the query language more robust are all carried out through the essential use of data mining algorithms. These component processes include (1) spell correction, (2) demoting Web sites that are trying vari- ous techniques to inflate their “page rank,” (3) identifying Web sites with duplicate content, (4) clustering web pages by concept or similarity of central topic, (5) modifying ranking functions based on the history of users’ click sequences, and (6) indexing images and video. Without these and other features, implemented partially in response to efforts to game search engines, search results would be nearly use- less compared with their current value. But as these features have been added over the years, they have increased the value of search engines enormously over their initial implementations, and today search engines are an indispensable part of an individual’s online experience. In a somewhat similar way, one can imagine a search engine, in a general sense of the term, that was designed and optimized for counter- terrorist applications. Such a system could, among other things: (a) gen- eralize/specialize the detection of aliases and/or address the ambiguity in foreign names, (b) combine all records concerning a given individual and his or her network of associates, (c) cluster related events by certain patterns of interest and other topics (such as the acquisition of materi- als and expertise useful for the development of explosives, toxins, and biological agents), (d) log all investigations into an individual’s activity history and develop ratings of people as to their degree of interest, and (e) index audio/images/video from surveillance monitors. All of these are typical data mining applications that do not depend on the existence of training data, and they would seem to be critical components in any counterterrorism system that is designed to collect, organize, and make available for query information on individuals and other units of interest for possibly further data collection, investigation, and analysis. Therefore, data mining might provide many component processes of what would ideally be a large counterterrorism system, with human analysts and investigators playing an essential role alongside specific data mining tools.

OCR for page 185
0 APPENDIX H Over time, as more data are acquired and different sources of data are found to be more or less useful, as attempts at gaming are continuously monitored and addressed, as various additional unforeseen complexities arise and are addressed, a system could conceivably be developed that could provide substantial assistance in reducing the risk from terrorism. Few of the necessary components of this idealized system currently exist, and therefore this is not something that could be implemented quickly. However, in the committee’s view, the threat from terrorism is very likely to persist, and therefore the committee is in support of a fully supported research and development program with the goal of examining the poten- tial effectiveness of such a system. It is important to point out that each of the above component applica- tions is quite non-trivial. For example, part (b) “combine all records con- cerning a given individual and his or her network of associates” would be an extremely complicated tool to develop in a way that would be easy to access and use. And it is useful to point out that when viewing data mining applica- tions as part of a system, their role and therefore their evaluation changes. For example, consider a data mining algorithm that was extremely good at identifying patterns of behavior that are not indicative of terrorist activ- ity but was not nearly as effective at identifying patterns that are. Such a component process could be useful as a filter, reducing the workload of investigators, and thereby freeing up resources to devote to a smaller group of individuals of potential interest. This algorithm would fail as a stand-alone tool, but as part of a system, it might perform a useful function. Development of such a system would certainly be extremely chal- lenging, and success in reducing the threat from terrorism would be a significant achievement. Therefore, research and development of such an approach requires the direct involvement of data mining experts of the first rank. What is needed is not simply the modification of commercial off-the-shelf techniques developed for various business applications, but a dedicated collaborative research effort involving both data miners and intelligence analysts with the goal of developing what are currently non- existent techniques and tools. H.9 INFORMATION FUSION Another class of data mining techniques, referred to as “information fusion,” might be useful in counterterrorism. Information fusion refers to a class of methods for combining information from disparate sources in order to make inferences that may not be possible from a single source. One possible, more limited application to counterterrorism is matching

OCR for page 185
0 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS people using a variety of sources of information, including address, name, and date of birth, as well as fingerprints, retinal scans, and other biometric information. A broader application of information fusion is identifying patterns that are jointly indicative of terrorist activity. With respect to the narrower application of person matching, there are different ways of aggregating information to measure the degree to which the personal information matches. One can develop (a) distance metrics using sums of distances using the measured quantities them- selves, (b) sums of measures of the assessment of the degree of match for each characteristic, and (c) voting rules that aggregate over whether or not there is a match for each characteristic. There may be advantages in different applications to combining information at different levels of the decision process. (A common approach to joining information at level (a) is through use of the Fellegi-Sunter algorithm.) The committee thinks that information fusion might prove helpful in this limited application. How- ever, the problems mentioned above concerning the difficulties of record linkage will greatly reduce the effectiveness of many information fusion algorithms that are used to assist in person matching. Regarding the broader application, consider the problem of iden- tifying whether there is a terrorist threat from the following disparate sources of information: recent meetings of known terrorists, greater than usual movement of funds from countries known to harbor terrorists, and greater than usual purchases of explosives in the United States. Infor- mation fusion uses such techniques as the Kalman filter and Bayesian networks to learn how to optimally join disparate pieces of information at different levels of the decision process, by either combining individual data elements or combining higher level assessments for the decision at hand, in order to make improved decisions in comparison to more infor- mal use of the disparate information. Clearly, information fusion directly addresses an obvious need that arises repeatedly in the attempt to use various data sources and types of data for counterterrorism. Intelligence agencies will have surveillance photographs, information on monetary transactions, information on the purchase of dangerous materials, communications of people with sus- pected terrorists, movements of suspected people into and out of the country, and so on, all of which will need to be combined in some way to make decisions as to whether to initiate further and more intrusive investigations. To proceed, information fusion for these broader applications typi- cally requires estimates of a number of parameters, such as conditional probabilities, that model how to link the evidence received at various levels of the decision process to the phenomenon of interest. An example might be the probability that a terrorist act is planned in country B in the next three months, given a monetary movement of more than X dollars

OCR for page 185
 APPENDIX H from a bank in country A to one in country B in the last six months and the purchase in the last two months of more than the usual amounts of explosives of a certain type and greater than usual air travel in the last two months of individuals from country A to country B. Clearly, a conditional probability like this would be enormously useful to have, but how could one estimate it? It is possible that this conditional probability could be expressed as an arithmetic function of simpler conditional probabilities under some conditional independence assumptions, but then there is the problem of validating those assumptions to link those more primitive conditional probabilities to the desired conditional probability. More fundamentally, information fusion for the broader problem of counterterrorism requires a structure that expresses the forms in which information is received and how it should be combined. At this time, especially given the great infrequency of terrorist events, it will be extremely difficult to validate either the above assumptions or the overall structure proposed for use. Therefore, while information fusion is likely to be useful for some limited problems, it does not currently seem likely to be productive for the broad problem of identifying people and events of interest. H.10 AN OPERATIONAL NOTE The success of any data mining enterprise depends on the availability of relevant data in the universe of data being mined and the ability of the data mining algorithms being used to identify patterns of interest. In the first instance (availability of data), the operational security skills of the would-be terrorists are the determining factor as to whether data is informative. For terrorists planning high-end attacks (e.g., nuclear explosions involving tens or hundreds of thousands of deaths), the means and planning needed for carrying out a successful attack are complex indeed. On one hand, almost by definition, a terrorist group that could carry out such an attack would have a considerable level of sophistica- tion, and it would take great care to minimize its database tracks. Thus, for attacks at the high end, those intending to carry out such attacks may be better able to reduce the evidence of their activities. On the other hand, the complicated planning necessary for these attacks might provide greater opportunity for data mining to succeed. The trade-off in this case is difficult to evaluate. In the second instance, regarding the identification of patterns of interest against a noisy background, the primary issue is the fact that the means to carry out small-scale terrorist attacks (e.g., attacks that might result in a few to a few dozen deaths) are easily available. Though not a terrorist, in 2007 the Virginia Tech shooter, for example, killed a few dozen individuals with guns purchased over the counter at a gun store.

OCR for page 185
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS BOX H.2 An Illustrative Compromise in Operational Security from a Terrorist Perspective A conversation between a U.S. person and an unknown individual in Pakistan is intercepted. The call was initiated in the Detroit area from a pay phone using a prepaid phone card. The conversation was conducted in the Arabic language. The initiator is informing the recipient of the upcoming “marriage” of the initiator’s brother in a few weeks. The initiator makes reference to the “marriage” of the “dead infidel” some years ago and says this “marriage” will be “similar but bigger.” The recipient cautions the initiator about talking on the telephone and terminates the call abruptly. The intelligence analyst’s interpretation of this conversation is that “marriage” is open code for martyrdom. Interrogation of another source indicates that the association of “marriage” and “dead infidel” is a reference to the Oklahoma City bombing. It is the analyst’s assessment that a major ANFO or ANNM attack on the continental United States is imminent. Red team analysis concludes that large quantities of ammonium nitrate can be untraceably acquired by making cash pur- chases that are geographically and temporally distributed. A “tip” such as this phone conversation might well trigger a major ad hoc data mining exercise through previously unsearched databases, such as those of home improvement and gardening suppliers. Moreover, the planning needed to carry out such an attack is fairly mini - mal, especially if the terrorist is willing to die. Thus, those intending to carry out relatively small-scale attacks might in principle leave a relevant database track, but the difficult (and for practical purposes, probably insoluble) problem would be the ability to identify that track and infer terrorist actions against a much larger background of innocuous activity. For practical purposes, then, data mining tools may be most useful against the intermediate scale of terrorist attack (say, car or truck bombs using conventional explosives that might cause many tens or hundreds of deaths). Moreover, as a practical matter, terrorists must face the pos- sibility of unknown leakages—telltale signs that a terrorist group may not know they are leaving, or human intelligence tips that cue counterterror- ism authorities about what to look for (Box H.2)—and likelihood of such leakages can be increased by a comprehensive effort that aggressively seeks relevant intelligence information from all sources. This point further underscores the importance of seeing data mining as one element of a comprehensive counterterrorist effort.

OCR for page 185
 APPENDIX H H.11 ASSESSMENT OF DATA MINING FOR COUNTERTERRORISM Past successes in applying data mining techniques in many diverse domains have interested various government agencies in exploring the extent to which data mining could play a useful role in counterterrorism. On one hand, this track record alone is not an unreasonable basis for inter- est in exploring, through research and development, the potential applica- bility of data mining for this purpose. On the other hand, the operational differences between the counterterrorism application and other domains in which data mining has proven its value are significant, and the intel- lectual burden that researchers must surmount in order to demonstrate the utility of data mining for counterterrorism is high. As an illustration of these differences, consider first the use of data mining for credit scoring. Credit scoring, as described in Hand and in Lambert,10 makes use of the history of financial transactions, current debts, income, and accumulated wealth for a given individual, as well as for similar individuals, to develop models of how people behave who are likely to default on a loan, and those who are not likely. Such histories are extensive and have been collected for many years. Training sets are developed that contain the above information on people who have been approved for loans who later paid in full and also those who were approved for loans and who later defaulted. Training sets are sometimes augmented by data on a sample of those who would not have been approved for a loan but who were granted one nonetheless, and whether or not they later defaulted on the loan. Training sets in this application can be used to develop very predictive models that discrimi- nate well between those for whom additional loans would be both a good and a bad decision on the part of the credit granting institution. The utility of training sets in this application benefits from the prev- alence of the failure to repay loans. While there is a great interest in reducing the number of bad loans to the extent possible, missing a small percentage of bad loans is not a catastrophe. Therefore, false negatives are to be avoided, but a few bad loans are acceptable. While there is a substantial effort to game the process of awarding credit, it has been pos- sible to discover ways to adjust the models that are used to discriminate between good and bad loan applications to retain their utility. Finally, while applications for credit from those new to the database are problem- 10 D.J. Hand and W.E. Henley, “Statistical classification methods in consumer credit scor- ing: A review,” Journal of the Royal Statistical Society, Series A 160(3):523-541, 1997; also D. Lambert, “What Use is Statistics for Massive Data?,” Bell Labs/Lucent Technologies, Murray Hill, N.J., unpublished paper, 2000.

OCR for page 185
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS atic, it has also been possible to develop models that can be used for initial loan applicants to handle those without a credit history.11 By contrast, consider the contrasting problem of implementing a “no- fly” list. Although the details of actual programs remain secret, enough is known in the public domain to identify key differences between this problem and that of credit scoring. Some data on behavior relevant to potential terrorist activity (or more likely past activity) are available, but they are very incomplete, and the predictive power of the data collected and the patterns viewed as being related to terrorist activity is quite low. (For example, it is known that an individual with a name that is similar to that of a person on a terrorist watch list is cause for suspicion and addi- tional screening.) Labeled training sets for supervised learning methods cannot be developed because the number of people that have attempted to initiate attacks on aircraft and other terrorist activity is extremely small. Furthermore, gaming—for example, the use of aliases and false documen- tation, including passports—is difficult to adjust to. Finally, as in credit scoring, there is a need for a process to deal with individuals for whom no data are available, but in this application there seems to be much less value in “borrowing information” from other people. Given these differences, it is not surprising that the base technolo- gies in each example have compiled vastly different track records: data mining for credit scoring is widely acknowledged as an extremely suc- cessful application of data mining, while the various no-fly programs (e.g., CAPPS II) have been severely criticized for their high rate of false positives.12 Box H.3 describes the largely unsuccessful German experi- ence with counterterrorist profiling based on personal characteristics and backgrounds. At a minimum, subject-based data mining (Section H.3) is clearly relevant and useful. This type of data mining—for example, structured searches for identifying those in regular contact with known terrorists 11 This description ignores some complexities. All loans are not of equal dollar amount, so making a number of mistakes on a group of loan decisions is not well summarized by the number of mistakes made, that is, the amount loaned in error is also useful to know. Furthermore, it may be profitable to let in some poor loans if more profit is made collec- tively through the group of loans. Also, there is a selection problem, in that typically it is not known for those rejected for a loan whether that decision was appropriate or not. Finally, external circumstances can change, for example, an economic recession can occur, which may impact the effectiveness of the models used. 12 Implementing the no-fly list also illustrates the importance of human intervention. In most cases, individuals flagged for further screening are indeed allowed to board aircraft, although they may miss their flight or suffer further inconvenience or harm. The reason they are allowed to do so is because the data mining technology has flagged them as likely risks, but the additional (human-based) screening efforts, though time-consuming, have determined that the individual in question is not likely to be a risk.

OCR for page 185
 APPENDIX H BOX H.3 The German Experience with Profiling In the aftermath of the September 11, 2001, terrorist attacks on the United States, German law enforcement authorities sought to explore the possibilities of using large-scale statistical profiling of entire sectors of the population with the purpose of identifying potential terrorists. An initial profile was developed, largely based on the social characteristics of the known perpetrators of 9/11 (male, 18- 40 years old, current or former student, Islamic, legal resident in Germany, and originating from one of a list of 26 Muslim countries). This profile was scanned against the registers of residents’ registration offices, universities, and the Central Foreigners’ Register to identify individuals matching the defined profile—an exer- cise that resulted in approximately 32,000 entries. Individuals in this database were then checked against another database of about 4 million individuals identified as possibly having the relevant knowledge to carrying out a terrorist attack, or who had familiarity with places that could consti- tute possible terrorist targets. This included, for example, individuals with a pilot’s license (or attending a course to obtain it), members of sporting aviation associa- tions, as well as employees of airports, nuclear power plants, chemical plants, the rail service, laboratories and other research institutes, as well as students of the German language at the Goethe Institutes. The comparison of these two databases yielded 1,689 individuals as potential “sleepers.” These individuals were investigated at greater length by the German police, but after one year not one sleeper had been identified. Seven individuals suspected of being members of a terrorist cell in Hamburg were arrested, but they did not fit the statistical profile. In the entire profiling exercise, data were collected and analyzed on about 8.3 million individuals—with a null result to show for it. The exercise was terminated after about 18 months (in summer 2003) and the databases deleted. (In April 2006, the German Federal Constitutional Court declared the then-terminated exercise unconstitutional.) SOURCE: Adapted from Giovanni Capoccia, “Institutional Change and Constitutional Tradition: Responses to 9/11 in Germany,” in Martha Crenshaw (ed.), The Consequences of Counterter- rorist Policies in Democracies, New York, Russell Sage, forthcoming. or identifying those, possibly as part of a group, who are collecting large quantities of toxins, biological agents, explosive material, or mili- tary equipment—might well identify individuals of interest that warrant further investigation, especially if their professional and personal lives indicate that they have no need for such material. (Such searches could also result in a large number of false positives that would require human judgment to dispose of.) Such searches are within the purview of law enforcement and intelligence analysts today, and it would be surprising if

OCR for page 185
 PROTECTING INDIVIDUAL PRIVACY IN THE STRUGGLE AGAINST TERRORISTS such searches were not being conducted today as extensions of standard investigative techniques. These approaches have been criticized because they are relevant pri- marily to future events that have a nontrivial similarity to past events, thus providing little leverage in anticipating terrorist activities that are qualitatively different from those carried out in the past. But even if this criticism is valid (and only research and experience will provide such indications), there is definite and important benefit in being able to reduce the risk from known forms of terrorist activity. Forcing terrorists to use new approaches implies new training regimes, new operational difficul- ties, and new resource requirements—all of which complicate their own planning and reduce the likelihood of successful execution. The jury is still out on whether pattern-based data mining algorithms produced without the benefits of machine learning will be similarly use- ful, and in particular whether such techniques could be useful in discov- ering more subtle, novel patterns of behavior as being indicative of the planning of a terrorist event that would have been unrecognized a priori as such by intelligence analysts. Jonas and Harper (2006) refer to this kind of data mining as “pattern-based” data mining.13 The distinction between subject-based and pattern-based data mining is important. Subject-based data mining is focused on terrorist activities that are either precedented (because analysts have some retrospective understanding of them) or anticipated (because analysts have some basis for understanding the pre- cursors to such activities), while pattern-based data mining is focused on future terrorist activities that are unanticipated and unprecedented (that is, activities that analysts are not able to predict or anticipate). Subject-based techniques have the advantage of being based on strongly predictive models. For example, being a close associate of some- one suspected of terrorist activity and having similar connections to per- sons or groups of interest are strong predictors that a given person will also be of interest for further investigation. By contrast, pattern-based techniques, in the absence of a training set, are likely to have substantially less predictive power than the subject-based patterns chosen by counter- intelligence experts based on their experience—and consequently a very large false positive rate. (Indeed, one might expect such an outcome, since pattern-based techniques, by definition, seek to discover anomalous pat- terns that are not a priori associated with terrorist activity and therefore have no historical precedents to support them. Pattern-based techniques 13 J . Jonas and J. Harper, “Effective counterterrorism and the limited role of predictive data mining,” pp. 1-12 in Policy Analysis, No. , CATO Institute, Washington, D.C., December 11, 2006.

OCR for page 185
 APPENDIX H are also, at their roots, tools for identifying correlations, and as such they do not provide insight into why a particular pattern may arise.) Jonas and Harper (2006) identify three factors that are likely to have a bearing on the utility of data mining for counterterrorist purposes: • The ability to identify subtle and complex data patterns indicating likely terrorist activity, • The construction of training sets that facilitate the discovery of indicative patterns not previously recognized by intelligence analysts, and • The high false positive rates that are likely to result from the prob- lems in the first two bullets. A number of approaches can be taken to possibly address this argu- ment. For example, as mentioned above, it may be possible to develop training sets by broadening the definition of what patterns of behavior are of interest for further investigation, although that raises the false positive rate. Also, it may be possible to reduce the rate of false positives to a man- ageable percentage by using a judicious mix of human analysis and differ- ent automated tools. However, this is likely to be very resource intensive. The committee does not know whether there are a large number of useful behavioral profiles or patterns that are indicative of terrorist activity. In addition to these issues, a variety of practical considerations are relevant, including the paucity of data, the often-poor quality of primary data, and errors arising from linkage between records. (Section H.2 dis- cusses additional issues in more detail.)