Appendix F
Computerized Scoring of Polygraph Data

INTRODUCTION

A critical part of polygraph examination is the analysis and interpretation of the physiological data recorded on polygraph charts. Currently, polygraph examiners rely on their subjective global evaluation of the charts, various partly objective numerical scoring methods, computerized algorithms for chart scoring, or some combination of the three. Computerized systems have the potential to reduce bias in the reading of charts and eliminate problems of imperfect inter-rater variability that exist with human scoring. The extent to which they can improve accuracy depends on how one views the appropriateness of using other knowledge available to examiners, such as demographic information, historical background of the subject, and behavioral observations.1

Computerized systems have the potential to perform such tasks as polygraph scoring better and more consistently than human scorers. This appendix summarizes the committee’s review of existing approaches to such scoring systems. Specifically, it focuses on two systems: the Computerized Polygraph System (CPS) developed by Scientific Assessment Technologies based on research conducted at the psychology laboratory at the University of Utah, and the PolyScore® algorithms developed at Johns Hopkins University Applied Physics Laboratory. We also comment on the Axciton™ and Lafayette™ polygraph instruments that use the PolyScore algorithms.

The statistical methods used in classification models are well devel-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 298
The Polygraph and Lie Detection Appendix F Computerized Scoring of Polygraph Data INTRODUCTION A critical part of polygraph examination is the analysis and interpretation of the physiological data recorded on polygraph charts. Currently, polygraph examiners rely on their subjective global evaluation of the charts, various partly objective numerical scoring methods, computerized algorithms for chart scoring, or some combination of the three. Computerized systems have the potential to reduce bias in the reading of charts and eliminate problems of imperfect inter-rater variability that exist with human scoring. The extent to which they can improve accuracy depends on how one views the appropriateness of using other knowledge available to examiners, such as demographic information, historical background of the subject, and behavioral observations.1 Computerized systems have the potential to perform such tasks as polygraph scoring better and more consistently than human scorers. This appendix summarizes the committee’s review of existing approaches to such scoring systems. Specifically, it focuses on two systems: the Computerized Polygraph System (CPS) developed by Scientific Assessment Technologies based on research conducted at the psychology laboratory at the University of Utah, and the PolyScore® algorithms developed at Johns Hopkins University Applied Physics Laboratory. We also comment on the Axciton™ and Lafayette™ polygraph instruments that use the PolyScore algorithms. The statistical methods used in classification models are well devel-

OCR for page 298
The Polygraph and Lie Detection oped. Based on a set of data with predictor variables (features in the polygraph test) of known deceptive and nondeceptive subjects, one attempts to find a function of the predictor variables with high values for deceptive and low values for nondeceptive subjects. The conversion of continuous polygraph readings into a set of numeric predictor variables requires many steps and detailed decisions, which we outline below. In particular, we discuss aspects of choosing a small number of these predictors that together do the best job of predicting deception, and we consider the dangers of attempting to use too many variables when the test data set is relatively small. We examined the two scoring systems with sufficient documentation to allow evaluation. The CPS system has been designed with the goal of automating what careful human scorers currently do and has focused from the outset on a relatively small set of data features; PolyScore has been developed from a much larger set of features, and it is more difficult to evaluate because details of development are lacking. Updates to these systems exist, but their details are proprietary and were not shared with us. The description here focuses on the PolyScore and CPS scoring algorithms since no information is publicly available on statistical methods utilized by these more recently developed algorithms, although the penultimate section includes a summary of the performance of five algorithms, based on Dollins, Kraphol, and Dutton (2000).2 Since the 1970s, papers in the polygraph literature have proffered evidence claiming to show that automated classification algorithms could accomplish the objective of minimizing both false positive and false negative error rates. Our own analyses based on a set of several hundred actual polygraphs from criminal cases provided by the U.S. Department of Defense Polygraph Institute (DoDPI), suggest that it is easy to develop algorithms that appear to achieve perfect separation of deceptive and nondeceptive individuals by using a large number of features or classifying variables selected by discriminant analysis, logistic regression, or a more complex data-mining technique. Statisticians have long recognized that such a process often leads to “overfitting” of the data, however, and to classifiers whose performance deteriorates badly under proper cross-validation assessment (see Hastie, Tibshirani, and Friedman [2001] for a general discussion of feature selection). Such overestimation still occurs whenever the same data are used both for fitting and for estimating accuracy even when the appropriate set of features is predetermined (see Copas and Corbett, 2002). Thus, on a new set of data, these complex algorithms often perform less effectively than alternatives based on a small set of simple features. In a recent comparison, various computer scoring systems performed similarly and with only modest accuracy on a common data set used for

OCR for page 298
The Polygraph and Lie Detection validation (see Dollins, Krapohl, and Dutton, 2000). The committee believes that substantial improvements to current numerical scoring may be possible, but the ultimate potential of computerized scoring systems depends on the quality of the data available for system development and application and the uniformity of the examination formats with which the systems are designed to deal. STATISTICAL MODELS FOR CLASSIFICATION AND PREDICTION Before turning to the computer algorithms themselves, we provide some background on the statistical models that one might naturally use in settings such as automated polygraph scoring. The statistical methods for classification and prediction most often involve structures of the form: response variable = g(predictor variables, parameters, random noise). (1) For prediction, the response variable can be continuous or discrete; for classification, it is customary to represent it as an indicator variable, y, such that, in the polygraph setting, y = 1 if a subject is deceptive, and y = 0 is the subject is not. Some modern statistical approaches, such as discriminant analysis, can be viewed as predicting the classification variable y directly, while others, such as logistic regression, focus on estimating its functions, such as Pr(y = 1). Typically, such estimation occurs conditional on the predictor variables, x, and the functional form, g. Thus, for linear logistic regression models, with k predictor variables, x = ( x1, x2, x3, x4, . . . , xk), the function g is estimated in equation (1) using a linear combination of the k predictors: score(x) = ß0+ ß1 x1 + ß2 x2 + ß3 x3+ ß4 x4 +...+ ßk xk, (2) and the “response” of interest is (3) (This is technically similar to choosing g = score(x), except that the random noise in equation (1) is now associated with the probability distribution for y in equation (3), which is usually taken to be Bernoulli.) The observations on the predictor variables here lie in a k-dimensional space and, in essence, we are using an estimate of the score equation (2) as a hyperplane to separate the observations into two groups, deceptives and

OCR for page 298
The Polygraph and Lie Detection nondeceptives. The basic idea of separating the observations remains the same for nonlinear approaches as well. Model estimates do well (e.g., have low errors of misclassification) if there is real separation between the two groups. Model development and estimation for such prediction/classification models involve a number of steps: Specifying the list of possible predictor variables and features of the data to be used to assist in the classification model (1). Individual variables can often be used to construct multiple prediction terms or features. Choosing the functional form g in model (1) and the link function to the classification variable, y, as in equation (3). Selecting the actual features from the feature space to be used for classification. Fitting the model to data to estimate empirically the prediction equation to be used in practice. Validating the fitted model through classification of observations in a separate dataset or through some form of cross-validation. Hastie, Tibshirani, and Friedman (2001) is a good source of classification/prediction models, cross-validation, related statistical methodologies and discussions that could be applied to the polygraph problem. Recently, another algorithmic approach to prediction and classification problems has emerged from computer science, which is also called data mining. It focuses less on the specification of formal models and treats the function g in equation (1) more as a black box that produces predictions. Among the tools used to specify the black box are regression and classification trees, neural networks, and support vector machines. These still involve finding separators for the observations, and for any method one chooses to use, step 1 and algorithmically oriented analogues of steps 2-5 listed above still require considerable care. Different methods of fitting and specification emphasize different features of the data. The standard linear discriminant analysis is developed under the assumption that the distributions of the predictors for both the deceptive group and the nondeceptive group are multivariate normal, with equal covariance matrices (an assumption that can be relaxed), which gives substantial weight to observations far from the region of concern for separating the observations into two groups. Logistic regression models, in contrast, make no assumptions about the distribution of the predictors, and the maximum likelihood methods typically used for their estimation put heavy emphasis on observations close to the boundary between the two sets of observations. Common experience with all prediction models

OCR for page 298
The Polygraph and Lie Detection of the form (1) is that with a large number of predictor variables, one can fit a model to the data (using steps 1 through 4) that completely separates the two groups of observations. However, implementation of step 5 often shows that the achieved separation is illusory. Thus, many empirical approaches build cross-validation directly into the fitting process and set aside a separate part of the data for final testing. The methods used to develop the two computer-based scoring algorithms, CPS and PolyScore, both fit within this general statistical framework. The CPS developers have relied on discriminant function models, and the PolyScore developers have largely used logistic regression models. But the biggest differences that we can discern between them are the data they use as input, their approaches to feature development and selection, and the efforts that they have made at model validation and assessment. The remainder of this appendix describes the methodologies associated with these algorithms and their theoretical and empirical basis. DEVELOPMENT OF THE ALGORITHMS A common goal for the development of computer-based algorithms for evaluating polygraph exams is accuracy in classification, but the devil is in the details. A proper evaluation requires an understanding of the statistical basis of classification methods used, the physiological data collected for assessment, and the data on which the methods have been developed and tested. CPS builds heuristically on the Utah numerical manual scoring, which is similar in spirit to the Seven-Position Numerical Analysis Scale, a manual scoring system currently taught by DoDPI. PolyScore, in contrast, does not attempt to recreate the manual scoring process that the examiners use. Neither appears to rely on more fundamental research on information in the psychophysiological processes underlying the signals being recorded, except in a heuristic fashion. CPS was developed by Scientific Assessment Technologies based on research conducted at the psychology laboratory at the University of Utah by John Kircher and David Raskin (1988) and their Computer Assisted Polygraph System developed in the 1980s. While the latter system was developed on data gathered in the laboratory using mock crime scenarios, the newer CPS versions have been developed using polygraph data from criminal cases provided by U.S. Secret Service Criminal Investigations (Kircher and Raskin, 2002). The CPS scoring algorithm is based on standard multivariate linear discriminant function analysis followed by a calculation that produces an estimate of the probability of truthfulness or equivalently, deception (Kircher and Raskin, 1988, 2002). The most recent version utilizes three features in calculating a discriminant score: skin

OCR for page 298
The Polygraph and Lie Detection conductance amplitude, the amplitude of increase in the baseline of the cardiograph, and combined upper and lower respiration line-length (excursion) measurement (Kircher and Raskin, 2002). PolyScore was developed by Johns Hopkins University Applied Physics Laboratory (JHU-APL), and version 5.1 is currently in use with the Axciton and Lafayette polygraph instruments. The algorithm has been developed on polygraph tests for actual criminal cases provided by the DoDPI. The input to PolyScore is the digitized polygraph signal, and the output is a probability of deception based either on a logistic regression or a neural network model. The PolyScore algorithm transforms these signals on galvanic skin response, blood pressure (cardio), and upper respiration into what its developers call “more fundamental” signals that they claim isolate portions of the signals that contain information relevant to deception. It is from these signals that the PolyScore developers extracted features for use, based on empirical performance rather than a priori psychophysiological assumptions. The next sections describe how the two algorithms treat data used, signal processing, feature extraction, statistical analysis, and algorithm evaluation. These descriptions provide the basis for a discussion of possible future efforts at algorithm development and assessment. Since virtually all of the development and testing of algorithms has been done on specific-incident data, with highly varying formats and structures, some of the observations and comments on the algorithms may not always have as much relevance to highly structured screening polygraph tests, like the Test for Espionage and Sabotage (TES), but other problems, such as low base rates, do have salience for the TES. The final sections of this appendix on algorithm evaluation and summary describe some of these issues. Data Used Current polygraph machines typically record four signals during a polygraph examination: thoracic and abdominal respirations, a cardiovascular signal, and an electrodermal signal. Differences between specific analog and digital machines exist in the recording of the physiological measurements. Sampling rates may vary between different systems. Analog to digital conversion, filtering, and pen adjustments may also vary. One crucial difference lies in the recording of the electrodermal channel, which is believed by many polygraph researchers to be the most diagnostic (Kircher and Raskin, 2002). Stoelting (and CPS) records skin conductance; Lafayette appears to record skin resistance, a signal that requires further filtering in order to stabilize the baseline of the response; Axciton actually uses a hybrid of skin resistance and skin conductance

OCR for page 298
The Polygraph and Lie Detection (Dollins, Kraphol, and Dutton, 2000) (see the discussion of the advantages and disadvantages of these two measures in Appendix D). Kircher and Raskin (2002) provide more details on the physiological recordings and conversion of analog to digital signal, although they focus mainly on the procedures used by CPS. These matters are, in effect, precursors to the development of automated scoring algorithms, which presume that the analyzed signals “accurately” reflect the psychophysiological phenomena that are capable of distinguishing deception and nondeception. PolyScore® 3.0 was developed by analyzing polygraph data from 301 presumed nondeceptive and 323 presumed deceptive criminal incident polygraph examinations, with six Axciton instruments. The apparatus specifications for these cases are not available. “Truth” for these cases was obtained in three ways: confession or guilty plea, consensus on truthful subjects by two or more different examiners, or confirmed truthful. Version 5.1 of PolyScore used Zone Comparison Test (ZCT) and Modified General Question Test (MGQT) data from 1,411 real cases (J. Harris, personal communication, Johns Hopkins University Applied Physics Laboratory, 2001). Chapters 2 and 4 of this report describe many of the biases that can result from the use of field cases selected from a larger population on the basis of truth and point out that consensus among multiple examiners is not acceptable as a criterion of deceptive/nondeceptive status. In effect, the use of such data can be expected to produce exaggerated estimates of polygraph accuracy. Nonetheless, most of the discussion that follows sets these concerns aside. Using field data, especially from criminal settings, to develop algorithms poses other difficulties. Actual criminal case polygraphs exhibit enormous variability, in the subject of investigation, format, structure, and administration, etc. These data are hard to standardize for an individual and across individuals in order to develop generalizable statistical procedures. We analyzed polygraph data from 149 criminal cases using the ZCT and MGQT test formats, data that overlapped with those used in the development of PolyScore. Besides differences in the nature of the crime under investigation, our analyses revealed diverse test structures, even for the same test format, such as ZCT. The questions varied greatly from test to test and were clearly semantically different from person to person, even within the same crime. The order of questions varied across charts for the same person. In our analyses, we found at least 15 different se-

OCR for page 298
The Polygraph and Lie Detection quences for relevant and control questions, ignoring the positioning of the irrelevant questions. The number of relevant questions asked varied. Typically, there were three relevant questions. Accounting for irrelevant/ control questions substantially increases the number of possible sequences. These types of differences across cases pose major problems for both within- and between-subject analyses, unless all the responses are averaged. Finally, in the cases we examined there was little or no information available to control for differences among examiners, examiner-examinee interactions, delays in the timing of questions, etc. Some of these problems can be overcome by careful systematic collection of polygraph field data, especially in a screening setting, and others cannot. Controlling for all possible dimensions of variation in a computer-scoring algorithm, however, is a daunting task unless one has a large database of cases. The laboratory or mock crime studies so commonly found in the polygraph literature typically remedy many of these problems, but they have low stakes, lack realism, and do not replicate the intensity of the stimulus of the real situations. Laboratory test formats are more structured. The same sequence of questions is asked of all the subjects, making these exams more suitable for statistical analysis. For laboratory data, the experimental set-up predetermines a person’s deceptive and nondeceptive status, thus removing the problem of contaminated truth. Laboratory studies can have more control over the actual recording of the measurements and running of the examinations, as well as information on examiners, examinees, and their interactions. A major shortcoming of laboratory polygraph data for developing computer-based algorithms, however, is that they do not represent the formats that will be ultimately used in actual investigations or screening settings. Similarly, laboratory subject populations differ in important ways from those to whom the algorithms will be applied. Signal Processing With modern digital polygraphs and computerized systems, the analog signals are digitized, and the raw digitized electrodermal (skin conductance), cardiovascular and respiratory (abdominal and thoracic) signals are used in the algorithm development. The analog-to-digital conversion process may vary across different polygraph instruments. We were unable to determine Axciton instrument specifications. Kircher and Raskin (1988) provide some procedures used by Stoelting’s polygraph instruments for CPS. Once the signals have been converted, the primary objective of signal processing is to reduce the noise-to-information ratio.

OCR for page 298
The Polygraph and Lie Detection This traditionally involves editing of the data, e.g., to detect artifacts and outliers, some signal transformation, and standardization. Artifact Detection and Removal Artifacts indicate distortions in the signal that can be due to the movement of the examinee or some other unpredicted reactions that can modify the signal. Outliers account for both extreme relevant and control responses. The PolyScore algorithms include components for detecting artifacts and deciding if a signal is good or not. Kircher and Raskin (2002) report that they developed algorithms for artifact removal and detection in the 1980s, but they were not satisfied with their performance and did not use them as a part of CPS. Thus, examiners using CPS need to manually edit artifacts before the data are processed any further. PolyScore tests each component of each question for artifacts and outliers. If any are detected, the algorithms remove those portions of the record from scoring, but examiners can review the charts and change the labeled artifacts, if they find it appropriate. Olsen et al. (1997) report that PolyScore labels a portion of a record as an extreme reaction (outlier) if it accounts for more than 89 percent of the variability among all the responses on the entire polygraph exam for a person; although the precise meaning of this is not totally clear, a portion of the individual’s data would probably need to be totally off the scale to account for so much of the variation. The committee was told that the PolyScore algorithms are proprietary and not available for evaluation. Thus, we were unable to examine the appropriateness of the procedures used in connection with artifact adjustment and the accuracy of any of the related claims. Signal Transformation A second step in data editing is signal transformation. Both CPS and PolyScore algorithms transform the raw digitized signals in different ways, but with a common goal of further signal enhancement. PolyScore detrends the galvanic skin response and cardio signals by removing the “local mean,” based on 30-second intervals both before and after the point, from each point in the signal, thus removing long-term or gradual changes unrelated to a particular question. This removes pen adjustments caused by the examiner. After detrending, PolyScore separates the cardio signal through a digital filter into the high-frequency portion representing pulse and the low-frequency component corresponding to overall blood volume. The derivative of the detrended blood volume then measures the rate of change and uncovers the remnants of the

OCR for page 298
The Polygraph and Lie Detection pulse in the blood volume signal, which are further eliminated by a second filter. The respiration signal, like the cardio signal, has two frequency components: a high frequency corresponding to each breath and a low frequency representing the residual lung volume. Baselining, achieved by matching each low point of exhalation between breaths to a common level, separates these frequencies and makes it easier to compare the relative heights of breaths (Harris et al., 1994). CPS creates response curves (waveforms) for the digitized signals of skin conductance, thoracic respiration, and abdominal respiration by the sequence of stored poststimulus samples for a 20-second period following the onset of each question (Kircher and Raskin, 1988). To produce the blood pressure response waveform, CPS averages the systolic and diastolic levels for each second. Finger pulse amplitude is a second-by-second waveform like the blood pressure. However, this waveform is the difference of diastolic and systolic levels, not the average. Diastolic levels at 2 seconds prestimulus and 20 seconds poststimulus are subtracted from the corresponding systolic levels. Twenty poststimulus ratios are calculated by dividing each poststimulus amplitude by the average of the two pre-stimulus values. Each proportion is then subtracted from unity, reflecting the finger pulse amplitude waveform that rises with decrease in amplitude of finger pulse. Features are extracted from the times and levels of inflection points. Signal Standardization PolyScore performs signal standardization to standardize the extracted features; CPS does not. Harris et al. (1994) stress the importance of this step in the development of PolyScore. The goal of this step is to allow amplitude measurements across different charts or individuals to be scored by a common algorithm. Typically, standardization is performed by subtracting the mean of the signal from each data point and dividing this difference by the standard deviation. JHU-APL points out that since the data contain outliers, this method is inaccurate and thus PolyScore standardizes by subtracting the median from each data point and dividing it by the interquartile range (1st and 3rd quartiles are used, corresponding to the 25th and the 75th percentile). Feature Extraction The discussion of general statistical methodology for prediction and classification at the beginning of this appendix noted the importance of feature development and selection. The goal is to obtain a set of features from the raw data that can have some relevance in modeling and classifi-

OCR for page 298
The Polygraph and Lie Detection cation of internal psychological states, such as deception. For polygraph data, a feature can be anything measured or computed that represents an emotional signal. The mapping between psychological and physiological states remains a substantial area of investigation in psychophysiology. Some commonly used features in the manual scoring are changes in amplitude in respiration, galvanic skin response and cardiovascular response, changes in baseline of respiration, duration of a galvanic skin response, and change in rate of cardiovascular activity. Computerized analysis of digitized signals offers a much larger pool of features, some of them not easily observable by visual inspection. The general psychophysiological literature suggests describing the skin conductance response using such features as level, changes in the level, frequency of nonspecific responses, event-related response amplitude, latency, rise time, half recovery time, number of trials before habituation, and rate of change of event-related amplitude. Dawson, Schell, and Filion (2000) note that the rise time and half recovery time might be redundant measures and not as well understood as amplitude in association with psychophysiological responses. Similarly, cardiovascular activity is typically analyzed using heart rate and its derivatives, such as the heart rate variability or the difference of the maximum and minimum amplitudes. Brownley, Hurwitz, and Schneiderman (2000), however, state that reliability of heart rate variability as a measure is controversial, and they suggest the use of respiratory sinus arrhythmia, which represents the covariance between the respiratory and heart rate activity. This approach implies a need for frequency-domain analysis in addition to time-domain analysis of the biological signals. Harver and Lorig (2000) suggest looking at respiratory rate and breathing amplitude as possible features that describe respiratory responses. They also point out that recording changes only of upper or only of lower respiration is not adequate to estimate relative breathing amplitude. In general, area measures (integrated activity over time) are less susceptible to high-frequency noise than peak measures, but amplitude measurements are more reliable than latency (Gratton, 2000). Early research focusing specifically on the detection of deception suggested that the area under the curve and amplitudes of both skin conductance and cardiovascular response can discriminate between deceptive and truthful subjects. Other features investigated included duration of rise to peak amplitude, recovery of the baseline, and the overall duration of the response. Kircher and Raskin (1988) report that line length, the sum of absolute differences between adjacent sample points, which captures some combination of rate and amplitude, is a good measure of respiration suppression. Harris (1996, personal communication) reports that the initial feature

OCR for page 298
The Polygraph and Lie Detection ated error rates. One can then choose the model with the smallest combination of error rates. While this strategy may be feasible when the number of features is small, even the preliminary list of 12 features used in the development of the CPS algorithm poses problems. According to Kircher and Raskin (2002), they performed all-possible-subset regression analysis, but they do not provide details on possible transformations considered or how they did cross-validation. When the number of features is larger, the exhaustive approach is clearly not feasible. If one has a small training set of test data (and repeatedly uses the same test data) one can obtain features that are well suited for that particular training or test data but that do not constitute the best feature set in general. One also needs to be careful about the number of selected features. The larger the number of features or variables, the more likely they will overfit the particular training data and will perform poorly on new data. The statistical and data-mining literatures are rife with descriptions of stepwise and other feature selection procedures (e.g., forward selection, backward elimination, etc.), but the multiplicity of models to be considered grows as one considers transformations of features (every transformation is like another feature) and interactions among features. All of these aspects are intertwined: the methodological literature fails to provide a simple and unique way to achieve the empirical objectives of identifying a subset of features in the context of a specific scoring model that has good behavior when used on a new data set. What most statisticians argue is that fewer relevant variables do better on cross-validation, but even this claim comes under challenge by those who argue for model-free, black-box approaches to prediction models (e.g., see Breiman, 2001). For the polygraph, the number of cases used to develop and test models for the algorithms under review was sufficiently small that the apparent advantages of these data-mining approaches are difficult to realize. For the development of PolyScore, JHU-APL’s primary method of feature selection was a linear logistic regression model where “statistical significance” of the features was a primary aspect in the selection process. Harris (personal communication) claims that he and his colleagues primarily chose those features with higher occurrence rate across different iterations of model fitting (e.g., galvanic skin response). We were unable to determine the detailed algorithmic differences between the 3.0 and 5.1 logistic regression versions of PolyScore. For version 5.1, JHU-APL extracted a set of features from its feature space of 10,000 based on statistical significance and then checked their ability to classify by applying the estimated model to a random holdout test set involving 25 percent of the 1,488 cases in its database. This procedure yielded several good models with varying numbers of features, some subsets of others, some

OCR for page 298
The Polygraph and Lie Detection overfitting, and some underfitting the data. Ultimately, JHU-APL claims to have chosen a model based on overall performance and not on the individual features themselves. There are natural concerns about claims for model selection and specification from 10,000 features using a database of only 1,488 cases, concerns that are only partially addressed by the random holdout validation strategy used by JHU-APL. None of the JHU-APL claims or statements has been directly verifiable because JHU-APL refused to make any details or documentation available to the committee, including the variables it ultimately chose for its algorithm. The only way one could evaluate the performance of the algorithm is to apply it to a fresh set of data not used in any way in the model development and validation process and for which truth regarding deception is available from independent information. Further Details on Statistical Modeling In polygraph testing, the ultimate goal of classification is to assign individuals (cases) to classes in a way that minimizes the classification error (i.e., some combination of false positives and false negatives). As we noted above, CPS uses discriminant function analysis and PolyScore has algorithms based on logistic regression and neural networks. PolyScore’s logistic regression procedure can be thought of as having two parts (although the two are actually intertwined). First, the score is calculated as a linear combination of weighted features using maximum likelihood estimation, for example: (6) Table F-1 reports the values of the estimated logistic regression coefficients, or weights, for the five features presented by Harris et al. (1994). A positive sign for a weight indicates an increase in the probability of deception, while a negative sign denotes a decrease. The absolute value of a weight suggests something about the strength of the linear association with deception. These results agree with the general results of CPS, which also claims that the stronger measure is the skin conductance measure, and they assign the most weight to it, while the respiration measure has a negative correlation with deception. Second, one can estimate the probability of deception from the logistic regression: (7)

OCR for page 298
The Polygraph and Lie Detection TABLE F-1 Features Implemented in Version 3.0 of PolyScore with Their Estimated Coefficients Features Weights x1 GSR Range +5.5095 x2 Blood Volume Derivative 75th Percentile +3.0643 x3 Upper Respiration 80th percentile –2.5954 x4 Pulse Line Length –2.0866 x5 Pulse 55th Percentile +2.1633 and then choose the cutoffs for the estimated probabilities (7) with values above the upper cutoff being labeled as deceptive and those below the lower cutoff as nondeceptive. The currently used cutoffs are 0.95 and 0.05, respectively. Different methods can be used to produce the scoring equation (6), and there is a lack of clarity as to precisely what method was used for the final PolyScore algorithm. The CPS algorithm relies on the result of a multivariate discriminant analysis, which is known as a less robust method than the logistic regression with respect to departures from assumptions and which gives more weight to extreme cases in building a classifier. Kircher and Raskin (1988) report that they used all-possible-subsets regression analysis on the 12 feature differences of scores to choose the best model and retained the five features listed in Table F-2. However, Kircher and Raskin’s (2002) most recent model relies on only three features: skin conductance amplitude, the amplitude of increases in the baseline of the cerograph, and the respiration length. Kircher and Raskin’s discriminant analysis provided “optimal” maximum likelihood weights for these variables to be used in a classification equation of the form (6) to produce a score for each subject in the two TABLE F-2 Features Implemented in CPS (reported by Kircher and Raskin, 1988) and Their Estimated Coefficients Features Weights x1 SC Amplitude +0.77 x2 SC full recovery time +0.27 x3 EBF +0.28 x4 BP Amplitude +0.22 x5 Respiration Length –0.40

OCR for page 298
The Polygraph and Lie Detection groups. Note that these coefficients are essentially on a different scale than those of the PolyScore logistic regression model. They need to be converted into estimates for the probabilities of observing the scores given deception and nondeception by means of the normal probability density function. Kircher and Raskin allow these probability functions to have different variances: (8) (9) where NDNDand are the estimates of the mean and standard deviation, respectively, of the discriminant scores from the nondeceptive subjects, and D and D are the estimates of the mean and standard deviation, respectively of the discriminant scores from the deceptive subjects.3 Finally, one can convert these estimated values into estimated probabilities of deception through Bayes’ theorem: (10) where P(ND) and P(D) are the prior probabilities of being nondeceptive (ND) and deceptive (D), respectively. Kircher and Raskin take these prior probabilities to be equal to 0.5. Despite the use of Bayes’ theorem in this final step, this is not a proper Bayesian approach to producing a classification rule. Kircher and Raskin (1988) report that if (ND|Score) based on three charts is greater than 0.70 they classify that person as nondeceptive, and if (ND|Score) is less than 0.30, the person is classified as deceptive. For those whose estimated probability is between these two cutoff points, they calculate a new discriminant score based on five charts and then recalculate (ND|Score) and use the same cutoff points. At that point, they label the test for subjects whose scores fall between 0.30 and 0.70 as inconclusive. Both PolyScore and CPS seem to rely on the presumption of equal base rates for deceptive and nondeceptive cases, and they have been “evaluated” on databases with roughly equal sized groups. The performance of the algorithm in new instances or with differently structured “populations” of examinees is conjectural, and appropriate prior prob-

OCR for page 298
The Polygraph and Lie Detection abilities and operational cutoff points for algorithms for use in security screening are unclear. Algorithm Evaluation We lack detailed information from the developers on independent evaluations of the PolyScore and CPS algorithms. We do have limited information on a type of cross-validation and a jackknife procedure to evaluate PolyScore® 3.0, neither of which provides a truly independent assessment of algorithm performance in light of the repeated reanalyses of the same limited sets of cases. Kircher and Raskin (2002) report the results of 8 selected studies of the CPS algorithm, none involving more than 100 cases, and most of which are deeply flawed according to the criteria articulated in Chapter 4. Moreover, only one of the two field studies described includes comparative data for deceptive and nondeceptive individuals. They report false negative rates ranging from 0 to 14 percent, based on exclusion of inconclusives. If inconclusives are included as errors, the false negative rates range from 10 to 36 percent. Similarly, they reported false positive rates ranging from 0 to 19 percent, based on exclusion of inconclusives. If inconclusives are included in the calculation of error rates, as for example in the calculation of ROC (receiver operating characteristics) curves, then the false positive rates ranges from 8 to 37 percent. It would be a mistake to treat these values as illustrative of the validity of the CPS computer scoring algorithm. Kircher and Raskin also list a ninth study (Dollins, Krapohl, and Dutton, 2000) that, as best we have been able to determine, is the only one that attempts independent algorithm evaluation. The values for false positive and false negative error rates that it reports appear to be highly exaggerated, however, because of the selection bias associated with the cases used. Dollins and colleagues (Dollins, Krapohl, and Dutton, 2000) compared the performance of five different computer-based classification algorithms in late 1997: CPS, PolyScore, AXCON, Chart Analysis, and Identifi. Each developer was sent a set of 97 charts collected with Axciton instruments for “confirmed” criminal cases and used the versions of their software available at the time. Test formats included both ZCT and MGQT. None of the developers at the time of scoring knew the truth, confirmed by a confession or from indisputable corroborating evidence. An examination was labeled as nondeceptive if someone else confessed to the crime. The data contained 56 deceptive and 41 nondeceptive cases and came from a mix of federal and nonfederal agencies. All of the computer programs were able to read the Axciton proprietary format except the CPS program,

OCR for page 298
The Polygraph and Lie Detection and Axciton Systems, Inc., provided the CPS developers with a text-formatted version of the data (see below). Dollins and associates (Dollins, Krapohl, and Dutton, 2000) report that there were no statistically significant differences in the classification powers of the algorithms. All programs agreed in correctly classifying 36 deceptive and 16 nondeceptive cases. And all incorrectly classified the same three nondeceptive cases, but there was not a single case that all algorithms scored as inconclusive. CPS had the greatest number of inconclusive cases and the least difference between the false positive and false negative rates. Four other algorithms all showed tendencies toward misclassifying a greater number of innocent subjects. The results, summarized in Table F-3, show false negative rates ranging from 10 to 27 percent and false positive rates of 31 to 46 percent (if inconclusives are included as incorrect decisions). As Dollins and colleagues (Dollins, Krapohl, and Dutton, 2000) point out, there are a number of problems with their study. The most obvious is a sampling or selection bias associated with the cases chosen for evaluation. The data were submitted by various federal and nonfederal agencies to the DoDPI and most of these were correctly classified by the original examiner and are supported by confessions. This database is therefore not representative of any standard populations of interest. If the analyzed cases correspond, as one might hypothesize given that they were “correctly” classified by the original examiner, to the easy classifiable tests, then one should expect all algorithms to do better on the test cases than in uncontrolled settings. Because all algorithms produce relatively high rates of inconclusive tests even in such favorable circumstances, performance with more difficult cases is likely to degrade. There was no control over the procedures that the algorithm developers used to classify these cases, and they might have used additional editing and manual TABLE F-3 Number of Correct, Incorrect, and Inconclusive Decisions by Subject’s Truth   Deceptive (n = 56) Nondeceptive (n = 41) Algorithm Correct Incorrect Inconclusives Correct Incorrect Inconclusive CPS 41 4 11 28 3 10 PolyScore 49 1 6 26 7 8 AXCON 50 1 5 24 9 8 Chart Analysis 49 2 5 22 8 11 Identifi 49 1 6 22 8 11   SOURCE: Dollins, Krapohl, and Dutton (2000:239).

OCR for page 298
The Polygraph and Lie Detection examination of the data, as well as modifications to the software for classification cutoffs. The instrumentation used was also a possible problem in this study, particularly for the CPS algorithm. Data were collected with the Axciton instrument that records a hybrid of skin conductance and skin resistance. The CPS algorithm relies on true skin conductance and the data recorded with the Stoelting instrument. The CPS algorithm was unable to process the Axciton proprietary data and was provided with the text format, in which there was also a possibility of error in rounding the onsets of the questions with further negative effect on the CPS performance. The other algorithms performed very similarly, which is not surprising because they were developed on data collected with Axciton instruments and in most cases with very similar databases. IMPLICATIONS FOR TES JHU-APL is currently working on a beta-test version of PolyScore 5.2 that has prototype algorithms for scoring screening test formats such as TES and relevant/irrelevant formats. The current version of the TES-format algorithm uses the same features as the ZCT/MGQT–format algorithm, but this may change. Polygraph examiners review each chart in a TES separately; PolyScore analyzes them together. We are not aware of other scoring algorithms for the TES format. Table F-4 reports very preliminary results of the TES algorithm provided to us by JHU-APL. The current difficulty in developing this algorithm is the overall small number of deceptive cases. As a result, they are giving up the power to detect (that is, keeping the sensitivity of the test at lower levels) in order to keep the false positive rates lower, in effect changing the base rate assumptions. These data indicate that sensitivity of 70 percent may be attained in conjunction with 99 percent specificity (1 percent false positive rate). JHU-APL believes these numbers can be im- TABLE F-4 Preliminary TES Results Type of Analysis Total Number Inc Corr TN FP FN TP Binarya 716 0 707 692 4 5 15 Ternary 524 192 520 510 3 1 10 NOTES: Inc, inconclusive; Corr, correct; TN, true negative; FP, false positive; FN, false negative; TP, true positive. aInconclusives forced to deceptive, nondeceptive.

OCR for page 298
The Polygraph and Lie Detection proved. Of about 2,100 cases, one-third have been used strictly for training, one-third for training and testing, and one-third have been withheld for independent validation, a step that has not yet occurred. A major problem with this database is independent determination of truth. SUMMARY The PolyScore and CPS computerized scoring algorithms take the digitized polygraph signals as inputs and produce estimated probabilities of deception as outputs. They both assume, a priori, equal probabilities of being truthful and deceptive. PolyScore was developed on real criminal cases, and the Computer Assisted Polygraph System (CAPS) (the precursor to CPS) was developed on mock crimes. CAPS truth came solely from independent blind evaluations, while PolyScore relied on a mix of blind evaluations and confessions. The more recent CPS versions seem to rely on actual criminal cases as well although we have no details. Both algorithms do some initial data transformation of the raw signals. CPS keeps these to a minimum and tries to retain as much of the raw signal as possible. PolyScore uses more initial data editing tools such as detrending, filtering, and baselining. PolyScore and CPS standardize signals, using different procedures and on different levels. They extract different features, and they seem to use different criteria to find where the maximal amounts of discriminatory information lie. Both, however, give the most weight to the electrodermal channel. PolyScore combines all three charts into one single examination record and considers reactivities across all possible pairs of control and relevant questions. CAPS compares adjacent control and relevant questions as is done in manual scoring, but it also uses difference of averaged standardized responses on the control and relevant questions to discriminate between guilty and nonguilty people. CPS does not have an automatic procedure for the detection of artifacts, but it allows examiners to edit the charts themselves before the algorithm calculates the probability of truthfulness. PolyScore has algorithms for artifacts and outliers detection and removal, but JHU-APL treats the specific details as proprietary and will not reveal them. While PolyScore uses logistic regression or neural networks to estimate the probability of deception from an examination, CPS uses standard discriminant analysis and a naïve Bayesian probability calculation to estimate the probability of deception.4 Overall, PolyScore claims to do as well as experienced examiners on detecting deceptives and better on detecting truthful subjects. CPS claims to perform as well as experienced evaluators and equally well on detection of both deceptive and nondeceptive people. Computerized systems clearly have the potential to reduce the variability that comes from bias

OCR for page 298
The Polygraph and Lie Detection and inexperience of examiners and chart interpreters, but the evidence that they have achieved this potential is meager. Porges and colleagues (1996) evaluated PolyScore and critiqued the methodology it used as unscientific and flawed. Notwithstanding the adversarial tone taken by Porges and colleagues, many of the flaws they identified apply equally to CPS, such as the lack of adequate evaluation.5 Dollins and associates (Dollins, Krapohl, and Dutton, 2000) compared the performance of these two algorithms with three other algorithms on an independent set of 97 selected confirmed criminal cases. CPS performed equally well on detection of both innocent and guilty subjects, while the other algorithms were better at detecting deceptive examinees than clearing nondeceptive ones. Unfortunately, the method of selecting these cases makes it difficult to interpret the reported rates of misclassification. One could argue that computerized algorithms should be able to analyze the data better than human scorers because they incorporate potentially useful analytic steps that are difficult even for trained human scorers to perform (e.g., filtering and other transformations, calculation of signal derivatives), look at more information, and do not restrict comparisons to adjacent questions. Moreover, computer systems never get careless or tired. The success of both numerical and computerized systems, however, still depends heavily on the pretest phase of the examination. How well examiners formulate the questions inevitably affects the quality of information recorded. PolyScore is currently working on algorithms for scoring the screening data coming from TES and relevant/irrelevant tests. An a priori base rate might be introduced in these algorithms to increase accuracy and to account for the low number of deceptive cases. There has yet to be a proper independent evaluation of computer scoring algorithms on a suitably selected set of cases, for either specific incidents or security screening, which would allow one to accurately assess the validity and accuracy of these algorithms. NOTES 1.   Some computerized systems store biographical information such as examinee’s name, social security number, age, sex, education, ethnicity, marital status, subject’s health, use of drugs, alcohol, and prior polygraph history (e.g., see www.stoelting.com), but it is unclear how this type of information would be appropriately used to improve the diagnostic accuracy of a computer scoring system. 2.   Matte (1996) and Kircher and Raskin (2002) provide more details on the actual polygraph instruments and hardware issues and some of the history of the development of computerized algorithms. 3.   Under the assumption of unequal variance for the two groups, which Kircher and

OCR for page 298
The Polygraph and Lie Detection     Raskin say they are using, a more statistically accepted procedure is to calculate a score using a quadratic discriminant function. 4.   A proper Bayesian calculation would be far more elaborate and might produce markedly different results. 5.   The distinctions made regarding the logistic regression and discriminant analysis methods used by the two systems are not especially cogent for present purposes. REFERENCES Breiman, L. 2001 Statistical modeling: The two cultures (with discussion). Statistical Science 16:199-231. Brownley, K.A., B.E. Hurwitz, and N. Schneiderman 2000 Cardiovascular psychophysiology. Chapter 9, pp. 224-264, in Handbook of Psychophysiology, 2nd ed., J.T. Cacioppo, L.G. Tassinary, and G.G. Bernston, eds. New York: Cambridge University Press. Copas, J.B., and P. Corbett 2002 Overestimation of the receiver operating characteristic curve for logistic regression. Biometrika 89:315-331. Dawson, M., A.M. Schell, and D.L. Filion 2000 The electrodermal system. Chapter 8, pp. 200-223, in Handbook of Psychophysiology, 2nd ed., J.T. Cacioppo, L.G. Tassinary, and G.G. Bernston, eds. New York: Cambridge University Press. Dollins, A.B., D.J. Krapohl, and D.W. Dutton 2000 A comparison of computer programs designed to evaluate psychophysiological detection of deception examinations: Bakeoff 1. Polygraph 29(3):237-257. Gratton, G. 2000 Biosignal Processing. Chapter 33, pp. 900-923, in Handbook of Psychophysiology, 2nd ed., J.T. Cacioppo, L.G. Tassinary, and G.G. Bernston, eds. New York: Cambridge University Press. Harris, J. 1996 Real Crime Validation of the PolyScore® 3.0 Zone Comparison Scoring Algorithm. Unpublished paper. The Johns Hopkins University Applied Physics Laboratory. Harris, J., et al. 1994 Polygraph Automated Scoring System. U.S. Patent Document. Patent Number: 5,327,899. Harver, A., and T.S. Lorig 2000 Respiration. Chapter 10, pp. 265-293, in Handbook of Psychophysiology, 2nd ed., J.T. Cacioppo, L.G. Tassinary, and G.G. Bernston, eds. New York: Cambridge University Press. Hastie, T., R. Tibshirani, and J. Friedman 2001 The Elements of Statistical Learning: Data Mining, Inference and Prediction. New York: Springer-Verlag. Kircher, J.C., and D.C. Raskin 1988 Human versus computerized evaluations of polygraph data in a laboratory setting. Journal of Applied Psychology 73:291-302. 2002 Computer methods for the psychophysiological detection of deception. Chapter 11, pp. 287-326, in Handbook of Polygraph Testing, M. Kleiner, ed. London: Academic Press.

OCR for page 298
The Polygraph and Lie Detection Matte, J.A. 1996 Forensic Psychophysiology Using Polygraph–Scientific Truth Verification Lie Detection. Williamsville, NY: J.A.M. Publications. Olsen, D.E, J.C. Harris, M.H.Capps, and N. Ansley 1997 Computerized Polygraph Scoring System. Journal of Forensic Sciences 42(1):61-71. Podlesny, J.A., and J.C. Kircher 1999 The Finapres (volume clamp) recording method in psychophysiological detection of deception examinations: Experimental comparison with the cardiograph method. Forensic Science Communication 1(3):1-17. Porges, S.W., R.A. Johnson, J.C. Kircher, and R.A. Stern 1996 Unpublished Report of Peer Review of Johns Hopkins University/Applied Physics Laboratory to the Central Intelligence Agency.