2
Validity and Its Measurement

In this chapter we first define some terms needed to clarify what our study did and did not cover. We then discuss concepts of validity and the empirical measurement of the accuracy of polygraph testing. We discuss methods for measuring accuracy and present our rationale for our chosen method of measurement. We conclude by discussing two difficult issues in assessing polygraph validity: (1) distinguishing the validity of the polygraph as an indicator of deception from its utility for such purposes as deterring security threats and eliciting admissions, and (2) defining the appropriate baseline against which to draw inferences about accuracy.

RELIABILITY, ACCURACY, AND VALIDITY

Psychophysiological testing, like all diagnostic activities, involves using specific observations to ascertain underlying, less readily observable, characteristics. Polygraph testing, for example, is used as a direct measure of physiological responses and as an indirect indicator of whether an examinee is telling the truth. Claims about the quantity or attribute being measured are scientifically justified to the degree that the measures are reliable and valid with respect to the target quantities or attributes.

Reliability

The term reliability is generally used to indicate repeatability across different times, places, subjects, and experimental conditions. Test-retest



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 29
The Polygraph and Lie Detection 2 Validity and Its Measurement In this chapter we first define some terms needed to clarify what our study did and did not cover. We then discuss concepts of validity and the empirical measurement of the accuracy of polygraph testing. We discuss methods for measuring accuracy and present our rationale for our chosen method of measurement. We conclude by discussing two difficult issues in assessing polygraph validity: (1) distinguishing the validity of the polygraph as an indicator of deception from its utility for such purposes as deterring security threats and eliciting admissions, and (2) defining the appropriate baseline against which to draw inferences about accuracy. RELIABILITY, ACCURACY, AND VALIDITY Psychophysiological testing, like all diagnostic activities, involves using specific observations to ascertain underlying, less readily observable, characteristics. Polygraph testing, for example, is used as a direct measure of physiological responses and as an indirect indicator of whether an examinee is telling the truth. Claims about the quantity or attribute being measured are scientifically justified to the degree that the measures are reliable and valid with respect to the target quantities or attributes. Reliability The term reliability is generally used to indicate repeatability across different times, places, subjects, and experimental conditions. Test-retest

OCR for page 29
The Polygraph and Lie Detection reliability is the extent to which the same measurement procedure (with the polygraph, this includes the examiner, the test format, and the equipment) used to examine the same subject for the same purpose yields the same result on repetition.1 Inter-rater reliability is the extent to which different examiners would draw the same conclusions about a given subject at a given time for a given examination. In practice and in the literature we have considered, discussions of inter-rater reliability have focused almost exclusively on the repeatability of chart scoring across human or computer raters. Inter-rater reliability has been a critical issue in some celebrated practical uses of the polygraph. (Appendix C describes the use of the polygraph in investigations of Wen Ho Lee for espionage or other security violations; part of the story concerns differing interpretations of the results of a 1998 polygraph ordered by the U.S. Department of Energy.) There is also potentially large variability in ways an examination is conducted: which questions are asked, how they are asked, and the general atmosphere of the examination. This variability can in principle seriously threaten test-retest reliability to the extent that polygraph examiners have latitude in asking questions.2 Reliability across examinees is another important component of overall test reliability. For example, two examinees may have engaged in the same behaviors and may give the same answers to the same test questions, but due to different interpretations of a question, may have differing beliefs about the truthfulness of their responses and so produce different polygraph readings. Internal consistency is another aspect of reliability. For example, a polygraph test may be judged to indicate deception mainly because of a strong physiological response to a single relevant question. If the examinee shows similar responses to other relevant questions about the same event or piece of information, the test is internally consistent. Reliability is usually defined as a property of a measure as used on a particular population of people or events being measured. If the polygraph is to be applied in standard ways across a range of people and situations, it is desirable that measures be reliable across the range of people and situations being measured—whether subjects and examiners are calm or nervous, alert or sleepy, relaxed or under time pressure, male or female, from the same or different cultural backgrounds, in the laboratory or in the field, etc. Accuracy and Validity Scientific inference requires measures that exhibit strong reliability. However, a highly reliable test has little use if it is measuring something

OCR for page 29
The Polygraph and Lie Detection different from its intended target. A measurement process is considered valid if it measures what it is supposed to measure. As with reliability, there are several aspects to validity. It is particularly important for the committee’s work to distinguish between the empirical concept of criterion validity, or accuracy, and the theoretical concept of construct validity. Criterion Validity (Accuracy) Criterion validity refers to how well a measure, such as the classification of polygraph test results as indicating deception or nondeception, matches a phenomenon that the test is intended to capture, such as the actual deceptiveness or truthfulness of examinees on the relevant questions in the test. When the test precedes the criterion event, the term predictive validity is used; criterion validity is the more general term that applies even when the criterion event precedes the test, as it normally does with the polygraph. The term “’accuracy’’ is often used as a nontechnical synonym for criterion validity, and it is used in that way in this report. Polygraph accuracy is the extent to which test results correspond to truth with actual examinees. The proportion of correct judgments made by a polygraph examiner is a commonly used measure of accuracy for the polygraph test. (We discuss the shortcomings of this measure of accuracy and propose a more appropriate one below.) Individual polygraph validation studies typically include accuracy measures that apply to the specific population that was tested. Evidence of accuracy becomes more general to the extent that test results are strongly and distinctively associated with truthfulness or deception in a variety of populations. Populations of interest include those containing high proportions of individuals who can be presumed to be deceptive on the critical questions (e.g., criminal suspects); those with low proportions of such people (e.g., nuclear scientists, intelligence agents); special populations that may be likely to show false negative results (e.g., people who want to deceive the examiner and who use countermeasures to try to “beat” the test); and populations that may be likely to show false positive results (e.g., truthful people who are highly anxious about the test). The same is true for test situations. Evidence of accuracy becomes more general as test results correspond with actual truthfulness or deceptiveness across situations (e.g., in criminal investigations, in employee security screening, and so forth). It is possible for a test such as the polygraph to be more accurate in some situations (e.g., criminal investigations) than in others (e.g., employee screening).

OCR for page 29
The Polygraph and Lie Detection Construct Validity Accuracy, or criterion validity, is essential for the overall validity of a test: no test that lacks it can be accepted as valid. However, it is not sufficient: additional evidence of validity is needed to give confidence that the test will work well with kinds of examinees and in examination settings that have not yet been tested. Thus, another critical element of validity is the presence of a theory of how and why the test works and of evidence supporting that theory. Construct validity refers to how well explanatory theories and concepts account for performance of a test. Users can have greater confidence in a test when evidence of its accuracy is supported by evidence of construct validity, that is, when there is a chain of plausible mechanisms that explain the empirical findings and evidence that each mechanism operates as the theory prescribes. In the case of lie detection by polygraph, one theory invokes the following presumed chain of mechanisms. Lying leads to psychological arousal, which in turn creates physiological arousal. The polygraph measures physiological responses that correspond to this arousal: galvanic skin response, respiration, heart rate, and relative blood pressure. The measurements taken by the polygraph machine are processed, combined, and then scored to compute an overall index, which is used to make a judgment about the examinee’s truthfulness. The validity of psychophysiological detection of deception by the polygraph depends on validity all along this chain. Important threats to construct validity for this theory come from the fact that the physiological correlates of psychological arousal vary considerably across individuals, from the lack of scientific evidence to support the claim that deception has a consistent psychological significance for all individuals, and from the fact that psychological arousal is associated with states other than deception. We discuss these issues further in Chapter 3. As just noted, evidence supporting the construct validity of the test is important to give confidence in its validity in settings where criterion validity has not yet been established. It is also important for refining theory and practice over time: according to the theory mentioned, better measures of psychological arousal should make a more valid test. And it is important for anticipating and defeating countermeasures: knowing the strengths and weaknesses of the theory tells practitioners which possible countermeasures to the test are likely to fail and which ones to worry about. The strongest scientific basis for a test’s validity comes from evidence of both criterion validity and construct validity. Nevertheless, it may be possible to demonstrate that an appropriately selected set of physiological measures has sufficient accuracy in certain settings to have practical

OCR for page 29
The Polygraph and Lie Detection value in those settings, despite lack of strong support for the underlying theory and even in spite of threats to construct validity. A useful analogy for understanding the issues of reliability, accuracy, and validity is the use of X-ray equipment in airport security screening. The X-ray examination is reliable if the same items are detected on repeated passes of a piece of luggage through the detection machine (test-retest reliability), if the same items are detected by different operators looking at the same image (inter-rater reliability), and if the same items are detected when the test is conducted in different ways, for example, by turning the luggage on different sides (internal consistency). The examination is accurate at detection if, in a series of tests, the X-ray image allows the examiner to correctly identify both the dangerous objects that are the targets of screening and the innocuous objects. Confidence in the validity of the test is further increased by evidence supporting the theory of X-ray screening, which includes an understanding of how the properties of various materials are registered in X-ray images. Such an understanding would increase confidence that the X-ray machine could detect not only ordinary dangerous objects, but also objects that might be concealed or altered in particular ways to avoid detection—including ways that have not yet been used in any test runs with the equipment. For X-ray detection, as for the polygraph, reliability and validity depend both on the measuring equipment and on the capabilities and training of the operators. Validity depends on the ability of the equipment and the operators to identify target objects or conditions even when they appear in unusual ways or when efforts have been made to make them less detectable. Successful countermeasures to X-ray detection would diminish the validity of the screening. It is important to note that successful countermeasures would only decrease the test’s accuracy if they were used frequently in particular trial runs—accuracy might look quite impressive if such countermeasures had not yet been tested. This is one reason that evidence of accuracy, though necessary, is not sufficient to demonstrate test validity. X-ray screening is not presumed to have perfect validity: this is why objects deemed suspicious by X-rays are checked by direct inspection, thus reducing the number of false positive results on the X-ray examination. There is no corrective, however, for false-negative X-ray results that allow dangerous objects on an aircraft. Measuring Accuracy Because of the many elements that contribute to construct validity, it is difficult to represent the construct validity of a test by any single numerical indicator. This section therefore focuses on criterion validity, or accuracy, which can be measured on a single scale.

OCR for page 29
The Polygraph and Lie Detection To measure criterion validity, it is necessary to have a clearly defined criterion. The appropriate criterion depends on whether the polygraph is being used for event-specific investigation, employee screening, or preemployment screening. For event-specific investigation, the polygraph is intended to measure the examinee’s truthfulness about a specific incident. The accuracy of the polygraph test is the correspondence of the test outcome with actual truthfulness, which in this context is easy to define (although not necessarily to ascertain). Thus, measurement of accuracy in the specific-event case is straightforward in principle. It can be difficult in practice, however, if there is no way of independently determining what actually occurred. Measuring accuracy in the employee screening polygraph setting raises more difficult issues. The Test of Espionage and Sabotage (TES) polygraph examination commonly used for screening at the U.S. Department of Energy weapons laboratories is intended to test whether an individual has committed espionage, engaged in sabotage, provided classified information to an unauthorized person, or had unauthorized contact with a foreign national. The examination asks whether the examinee intends to answer the security questions truthfully and whether he or she has engaged in any of the target behaviors. Accuracy of this screening polygraph might be defined as the extent to which the polygraph scoring corresponds to actual truthfulness of responses to these target questions. It might also be defined for a multi-issue polygraph screening test as the extent to which the test results correctly identify which of the target behaviors an examinee may have engaged in. These seem straightforward criteria at first glance. However, there often is a large class of events that may be relevant to the examination, and it may not be clear to the examinee which of these is intended to be covered. For example, if asked whether one has ever provided classified information to an unauthorized person, one employee might have an emotional reaction brought on by remembering an incident in which he or she failed to properly wrap a classified report for a one-minute trip outside a secured area. Another employee might not have such a reaction. Such an event is a security violation, but individuals may differ about how serious it is and how relevant it is to the test question. The U.S. Department of Energy (DOE) has developed guidelines regarding the behaviors that are and are not covered by TES questions, which probably resolve many ambiguities for examinees (a detailed description of how the terms espionage and sabotage are explained to examinees in research uses of the TES appears in Dollins [1997]). However, there appear to be ambiguous, even inconsistent definitions for the target of the TES for examiners. Agency officials repeatedly told the committee that the counterintelligence program at DOE is intended to identify serious

OCR for page 29
The Polygraph and Lie Detection breaches of security, not minor security infractions (such as leaving a secure computer on when leaving one’s office briefly or what examiners call “pillow talk”). Yet, we were also told that all examinees who showed “significant response” results, requiring additional charts or repeat tests, were “cleared” after admitting such minor infractions. We were told that there were 85 such cases among the first 2,000 tested in the DOE polygraph security screening program. Under the assumption that the TES is intended to find serious problems, these 85 are false positives—tests that give positive results even though the target violations did not occur— (assuming, of course, that there were no unadmitted major infractions). However, in discussions with the committee, DOE polygraph examiners seemed to indicate that an instance of “pillow talk” revealed in response to follow-up questions triggered by a polygraph chart indicating “significant response” was regarded as a true positive, suggesting that the target of the screening was any security infraction, regardless of severity. Under this broader target, the same minor infraction in an individual who showed “no significant response” should be regarded as a false negative, whereas the DOE polygraph examiners seemed to indicate that it would be counted as a true negative, suggesting a switch to the narrower definition of target. Assessing the polygraph’s accuracy for screening cannot be done without agreement on the criterion—what it is supposed to be accurate about. The committee has seen no indication of a clear and stable agreement on what the criterion is, either in general or within any particular organization that uses polygraph screening. In addition to an agreed definition of the criterion, an appropriate point of comparison is necessary to assess accuracy. Some representatives of the DOE polygraph screening program believe that the program is highly accurate because all 85 employees whose polygraphs indicated deception eventually admitted to a minor security infraction. If detecting minor security violations is the target of a security polygraph screening test, then these 85 are all true positives and there are no false positives. However, the significance of these admissions for accuracy cannot be evaluated in the absence of data from an appropriate comparison group. Such a group might consist of examinees who were interrogated as if the polygraph test indicated deception, even though it did not. We have been told on numerous occasions that almost everyone who has held a security clearance has committed at least one minor security infraction. If this is true, the suggested interrogation of a comparison group whose polygraph tests did not indicate deception might have uncovered a large number of minor infractions that the polygraph did not detect. Such members of the comparison group would be false negatives. Thus, the high accu-

OCR for page 29
The Polygraph and Lie Detection racy suggested by the lack of false positives would be undercut by the presence of perhaps many false negatives. All these considerations make it obvious that evaluating the accuracy of the employee screening polygraph is a nontrivial task. It requires more care in defining the criterion than is evident in current practice; it also requires great care in analyzing the evidence. When the polygraph is used for preemployment screening, defining and measuring accuracy poses additional challenges. In this setting, the polygraph test is being used, in effect, to predict particular aspects of future job performance, such as the likelihood that the examinee, if employed, will commit security violations in the future.3 As is the case for employee screening, defining accuracy requires a clear statement of which specific aspects of future job performance constitute the appropriate criterion. Given such a statement, one way to measure the accuracy of a preemployment polygraph test would be to compare those aspects of job performance among people who are scored as deceptive with the same aspects of performance for people who are scored as nondeceptive. This is impractical if people who score as deceptive are not hired and therefore do not get the chance to demonstrate their job performance. It would be practical, however, to compare the job performance of employees whose scores on the preemployment polygraph varied across the range of scores observed among those hired. In particular, it would be useful to examine the extent to which a person’s score on a preemployment screening polygraph correlated with later instances of target behaviors, such as security violations, that came to the attention of management. We know of no such studies. Another difficulty in measuring the accuracy of preemployment polygraph tests is that adverse personnel decisions made on the basis of preemployment polygraph examinations are not necessarily due to readings on the polygraph chart.4 For instance, we were told at the FBI that applicants might be rejected for employment for any of the following reasons: they make admissions during the polygraph examination that specifically exclude them from eligibility for employment (e.g., admitting a felony); they provide information during the polygraph interview that is not itself a bar to employment but that leads the applicant to be judged deceptive (e.g., admitting past activities that were not disclosed on the job application); their behavior during the polygraph interview leads to the conclusion that they are trying to evade detection (e.g., the examiner concludes that the applicant is using countermeasures); or

OCR for page 29
The Polygraph and Lie Detection the scoring of the polygraph chart supports an assessment that the applicant is deceptive. Only the last of these reasons is unambiguously a function of the physiological responses measured by the polygraph.5 For the other reasons, the chart itself is only one input to the decision-making process. The relative importance of physiological responses, interrogation technique, and astute observation by an examiner is difficult to determine and is rarely explored in research. These distinctions may not be considered important for judging the usefulness or utility of polygraph examinations as screening tools, but they are critical if the personnel decisions made on the basis of the polygraph examination are to be used for measuring accuracy. There are difficulties with using polygraphs (or other tests) for preemployment screening that go beyond accuracy. Perhaps most critical, it is necessary to make inferences about future behavior on the basis of polygraph evidence about past behaviors that may be quite different in kind. The construct validity of such inferences depends on specifying and testing a plausible theory that links evidence of past behavior, such as illegal drug use, to future behavior of a different kind, such as revealing classified information. We have not found either any explicit statement of a plausible theory of this sort in the polygraph literature or any appropriate evidence of construct validity. A CONSISTENT APPROACH TO MEASURING ACCURACY For choosing appropriate measures of accuracy it is helpful to consider the polygraph as a diagnostic test of truthfulness or deception and the criterion as consisting of independent indicators of what actually occurred. In this respect, the polygraph is similar to other diagnostic tests; the scientific work that has gone into measuring the accuracy of such tests can be applied to measuring the accuracy of the polygraph. This section draws on this scientific work and explains the measure of accuracy we have chosen for this study. It introduces a number of technical terms that are needed for understanding our measure of accuracy. Diagnostic tests generally result in a binary judgment—yes or no— concerning whether or not some condition is present. The tests themselves, however, usually give more than two values. For example, cholesterol tests give a range of values that are typically collapsed into two or three categories for purposes of medical decision: high risk, justifying medical intervention; low risk, leading to no intervention; and an intermediate category, justifying watchful waiting or low-risk changes in diet and life-style, but not medical intervention. Polygraph tests similarly

OCR for page 29
The Polygraph and Lie Detection give a range of values that are typically collapsed into a few categories for decision purposes, such as “significant response,” “no significant response,” and an intermediate category called “inconclusive.” There are two distinct aspects to accuracy. One is sensitivity. A perfectly sensitive indicator of deception is one that shows positive whenever deception is in fact present: it is a test that gives a positive result for all the positive (deceptive) cases; that is, it produces no false negative results. The greater the proportion of deceptive examinees that appear as deceptive in the test, the more sensitive the test. Thus, a test that shows negative when an examinee who is being deceptive uses certain countermeasures is not sensitive to deception. The other aspect of accuracy is specificity. An indicator that is perfectly specific to deception is one that always shows negative when deception is absent (is positive only when deception is present). It produces no false positive results. The greater the proportion of truthful examinees who appear truthful on the test, the more specific the test. Thus, a test that shows positive when a truthful examinee is highly anxious because of a fear of being falsely accused is not specific to deception because it also indicates fear. Box 2-1 gives precise definitions of sensitivity, specificity, and other key terms relevant to measuring the accuracy of polygraph testing. It also shows the quantitative relationships among the terms. The false positive index (FPI) and the positive predictive value (PPV) are two closely related measures of test performance that are critical to polygraph screening decisions.6 The FPI is the ratio of false positives to true positives and thus indicates how many innocent examinees will be falsely implicated for each spy, terrorist, or other major security threat correctly identified. The PPV gives the probability that an individual with a deceptive polygraph result is in fact being deceptive. The two are inversely related: PPV = 1/(1 + FPI); the lower the PPV, the higher the FPI. Much research on diagnostic accuracy draws on a general theory of signal detection that treats the discrimination between signals and noise. Signals are “positive” conditions—the polygraph test readings of respondents who are being deceptive, for example. Noise is any “negative” event that may mimic and be difficult to distinguish from a signal—such as the polygraph test readings of respondents who are not being deceptive (Peterson, Birdsall, and Fox, 1954; Green and Swets, 1966). Developed for radar and sonar devices during and following World War II, signal detection theory has since been applied extensively in clinical medicine (now upward of 1,000 articles per year) and also in nondestructive testing, information retrieval, aptitude testing, weather forecasting, cockpit warning systems, product inspection, survey research, clinical psychology, and other settings (see Swets, 1996). In the model of diagnosis that is provided by the theory, a diagnosis

OCR for page 29
The Polygraph and Lie Detection BOX 2-1 Terms Relevant to Measuring the Accuracy of Polygraph Testing The table below shows the four possible combinations of actual truthfulness and polygraph test results. The text under the table defines terms that are used to describe the quantitative relationships among these outcomes.   True Condition Test Result Positive (truly deceptive) Negative (truly truthful) Total Positive (testing deceptive) a true positive b false positive a + b Negative (testing truthful) c false negative d true negative c + d Total (n) a + c b + d a + b + c + d Sensitivity—The proportion of truly positive (deceptive) cases that give positive results on the test (a/[a + c]). This is also known as the conditional probability of a true-positive test or the true-positive proportion. False negative probability—The proportion of truly positive cases that give negative results on the test (c/[a + c]). This quantity is the conditional probability of a false-negative test and is the complement of sensitivity (that is, the difference between sensitivity and 100 percent). Specificity—The proportion of truly negative (truthful) cases that give negative results on the test (d/[b + d]). This quantity is also known as the conditional probability of a true-negative test. False positive probability—The proportion of truly negative cases that give positive results on the test (b/[b + d]). This quantity is the conditional probability of a false-positive test and is the complement of specificity. Three terms use test results as a reference point and reveal how well the test results indicate the true conditions (see text for further discussion). Positive predictive value—The predictive value of a positive test, that is, the percentage of positive tests that are correct (a/[a + b]). Negative predictive value—The predictive value of a negative test, that is, the percentage of negative tests that are correct (d/[c + d]). False positive index—Number of false positives for each true positive (b/a). This is another way of conveying the information described by positive predictive value, in order to make clearer the tradeoffs between false positives and true positives.

OCR for page 29
The Polygraph and Lie Detection U.S. armed services, for example, the introduction of random and frequent drug testing has been associated with lower levels of drug use. Deterrence effects depend on beliefs about the polygraph, which are logically distinct from the validity of the polygraph. The deterrent value of polygraph testing is likely to be greater for individuals who believe than who do not believe in its validity for detecting deception. It is worth noting that deterrence has costs as well as benefits for an organization that uses polygraph testing. The threat of polygraph testing may lead desirable job candidates to forgo applying or good employees to resign for fear of suffering the consequences of a false positive polygraph result. The more accurate people believe the test to be—independent of its actual validity—the greater the benefits of deterrence relative to the costs. This is because a test that is believed to be highly accurate in discriminating deception from truthfulness will be more deterring to people whose actions might require deception and more reassuring to others who would be truthful than a test that is believed to be only moderately accurate. It is also worth emphasizing that validity and utility for deterrence, while logically separable, are related in practice. The utility of the polygraph depends on the beliefs about validity and about how results will be used among those who may be subject to testing. Utility increases to the extent that people believe the polygraph is a valid measure of deception and that deceptive readings will have severe negative consequences. To the extent people hold these beliefs, they are deterred from engaging in behaviors they believe the polygraph might detect. If people came to have an equal or greater level of faith in some other technique for the physiological detection of deception, it would acquire a deterrent value equal to or greater than that now pertaining to polygraph testing. Eliciting Admissions and Confessions Polygraph testing is used to facilitate interrogation (Davis, 1961). Polygraph proponents believe that individuals are more likely to disclose information about behaviors that will lead to their punishment or loss of a valued outcome if they believe that any attempts to conceal the information will fail. As part of the polygraph pretest interview, examinees are encouraged to disclose any such information so that they will “pass” the examination. It can be important to security organizations to have their employees admit to past or current transgressions that might not disqualify them from employment but that might be used against them, for example, by an enemy who might use the threat of reporting the transgression to blackmail the employee into spying. Anecdotes suggest that the polygraph context is effective for securing such admissions. As re-

OCR for page 29
The Polygraph and Lie Detection ported by the U.S. Department of Defense (DoD) Polygraph Program (2000:4 of 14) on the cases in which significant information was uncovered during DoD counterintelligence-scope polygraph examinations covered in the report: It should be noted that all these individuals had been interviewed previously by security professionals and investigated by other means without any discovery of the information obtained by the polygraph examination procedure. In most cases, the information was elicited from the subject in discussion with the examiner [italics added]. There is no scientific evidence on the ability of the polygraph to elicit admissions and confessions in the field. However, anecdotal reports of the ability of the polygraph to elicit confessions are consistent with research on the “bogus pipeline” technique (Jones and Sigall, 1971; Quigley-Fernandez and Tedeschi, 1978; Tourangeau, Smith, and Rasinski, 1997). In bogus pipeline experiments, examinees are connected to a series of wires that are in turn connected to a machine that is described as a lie detector but that is in fact nonfunctional. The examinees are more likely to admit embarrassing beliefs and facts than similar examinees not connected to the bogus lie detector. For example, in one study in which student research subjects were given information in advance on how to respond to a classroom test, 13 of 20 (65 percent) admitted receiving this information when connected to the bogus pipeline, compared to only 1 of 20 (5 percent) who admitted it when questioned without being connected (Quigley-Fernandez and Tedeschi, 1978). Admissions during polygraph testing of acts that had not previously been disclosed are often presented as evidence of the utility and validity of polygraph testing. However, the bogus pipeline research demonstrates that whatever they contribute to utility, they are not necessarily evidence of the validity of the polygraph. Many admissions do not depend on validity, but rather on examinees’ beliefs that the polygraph will reveal any deceptions. All admissions that occur during the pretest interview probably fall into this category. The only admissions that can clearly be attributed to the validity of polygraph are those that occur in the posttest interview in response to the examiner’s probing questions about segments of the polygraph record that correctly indicated deception. We know of no data that would allow us to estimate what proportion of admissions in field situations fall within this category. Even admissions in response to questions about a polygraph chart may sometimes be attributable to factors other than accurate psychophysiological detection of deception. For example, an examiner may probe a significant response to a question about one act, such as revealing classified information to an unauthorized person, and secure an admission of a

OCR for page 29
The Polygraph and Lie Detection different act investigated by the polygraph test, such as having undisclosed contact with a foreign national. Although the polygraph test may have been instrumental in securing the admission, the admission’s relevance to test validity is questionable. To count the admission as evidence of validity would require an empirically supported theory that could explain why the polygraph record indicated deception to the question on which the examinee was apparently nondeceptive, but not to the question on which there was deception. There is also a possibility that some of the admissions and confessions elicited by interrogation concerning deceptive-looking polygraph responses are false. False confessions are more common than sometimes believed, and standard interrogation techniques designed to elicit confessions—including the use of false claims that the investigators have definitive evidence of the examinee’s guilt—do elicit false confessions (Kassin, 1997, 1998). There is some evidence that interrogation focused on a false-positive polygraph response can lead to false confessions. In one study, 17 percent of respondents who were shown their strong response on a bogus polygraph to a question about a minor theft they did not commit subsequently admitted the theft (Meyer and Youngjohn, 1991). As with deterrence, the value of the polygraph in eliciting true admissions and confessions is largely a function of an examinee’s belief that attempts to deceive will be detected and will have high costs. It likely also depends on an examinee’s belief about what will be done with a “deceptive” test result in the absence of an admission. Such beliefs are not necessarily dependent on the validity of the test. Thus, admissions and confessions in the polygraph examination, as important as they can be to investigators, provide support for claims of validity only in very limited circumstances. Admissions can even adversely affect the assessment of validity in field settings because in field settings an admission is typically the end of assessment of the polygraph— even if interrogation and investigation continue. The polygraph examination is concluded to have been productive. In our efforts to secure data from federal agencies about the specific circumstances of admissions secured during security screening polygraph examinations, we have learned that agencies do not classify admissions according to when in the examination those admissions occurred. This practice makes it impossible to assess the validity of federal polygraph screening programs from the data those programs provide. Polygraph examinations that yield admissions may well have utility, but they cannot provide evidence of validity unless the circumstances of the admission are taken into account and unless the veracity of the admission itself is independently confirmed. Using the polygraph record to confirm an admission that was elicited because of the polygraph record does not count as independent confirmation.

OCR for page 29
The Polygraph and Lie Detection Fostering Public Confidence Another purpose of the polygraph is to foster public confidence in national security. Public trust is obviously challenged by the revelation that agents acting on behalf of foreign interests occupy sensitive positions in the U.S. government. Counterintelligence necessarily includes programs that are secret. Because these programs’ responses to revelations of spying cannot be made public, they do little to reassure the public of the integrity of U.S. national security procedures. Calls for increased polygraph testing appear to us to be intended in part to reassure the public that all that can be done is being done to protect national security interests. To the extent that the public believes in the polygraph, attribution theory (Jones, 1991) suggests it may serve this function. We know of no scientific evidence to assess the net effect of polygraph screening policies on public confidence in national security or security organizations. We note that as with the value of the polygraph for deterrence and for eliciting admissions and confessions, its value for building confidence depends on people’s beliefs about its validity and only indirectly on its proven validity. Public confidence in the polygraph that goes beyond what is justified by evidence of its validity may be destructive to public purposes. An erroneously high degree of belief in validity can create a false sense of security among policy makers, among employees in sensitive positions, and in the general public. This false sense of security can in turn lead to inappropriate relaxation of other methods of ensuring security. In particular, the committee has heard suggestions that employees may be less vigilant about potential security violations by coworkers in facilities in which all employees must take polygraph tests. Some agencies permit new hires who have passed a polygraph but for which the background investigation is not yet complete to have the same access to classified material as other employees with no additional security precautions. Implications for Assessing Validity of Polygraph Testing The detection of deception from demeanor, deterrence, and effects on public confidence may all contribute to the utility of polygraph testing. These effects do not, however, provide evidence of the validity of the polygraph for the physiological detection of deception. Rather, those effects depend on people’s beliefs about validity. Admissions and confessions, as noted above, provide evidence supportive of the validity of polygraph tests only under very restricted conditions, and the federal agencies that use the polygraph for screening do not collect data on admissions and confessions in a form that allows these field tests to be used to assess polygraph validity. Moreover, even with data on when in the examina-

OCR for page 29
The Polygraph and Lie Detection tion admissions or confessions occurred and on whether the admitted acts corresponded to significant responses to relevant questions about those specific acts, information from current field screening examinations would have limited value for assessing validity because of the need for independent validation of the admissions and confessions. There is in fact no direct scientific evidence assessing the value of the polygraph as a deterrent, as a way to elicit admissions and confessions, or as a means of supporting public confidence. What indirect scientific evidence exists does support the plausibility of these uses, however. This evidence implies that for the polygraph or any other physiological technique to achieve maximal utility, examinees and the public must perceive that there is a high likelihood of deception being detected and that the costs of being judged deceptive are substantial. If people do not have these beliefs, then the value of the technique as a deterrent, as an aid to interrogation, and for building public confidence, is greatly diminished. Indeed, if the public does not believe a technique such as the polygraph is valid, using it to help reinstate public trust after a highly visible security breach may be counterproductive. Regardless of people’s current beliefs about validity, if polygraph testing is not in fact highly accurate in distinguishing truthful from deceptive responses, the argument for utility diminishes in force. Convincing arguments could then be made that (a) polygraphs provide a false sense of security, (b) the time and resources spent on the polygraph would be better spent developing alternative procedures, (c) competent or highly skilled individuals would be or are being lost due to suspicions cast on them by erroneous decisions based on polygraph tests, (d) agencies that use polygraphs are infringing civil liberties for insufficient benefits to the national security, and (e) utility will decrease rapidly over time as people come to appreciate the low validity of polygraph testing. Polygraph opponents already make such arguments. The utility benefits claimed for the polygraph, even though many of them are logically independent of its validity, depend indirectly on the polygraph being a highly valid indicator of deception. In the long run, evidence that supports validity can only increase the polygraph test’s utility and evidence against validity can only decrease utility. The scientific evidence for the ability of the polygraph test to detect deception is therefore crucial to the test’s usefulness. The evidence on validity is discussed in Chapters 3, 4, and 5. CRITERION VALIDITY AS VALUE ADDED For the polygraph test to be considered a valid indicator of deception, it must perform better against an appropriate criterion of truth than do

OCR for page 29
The Polygraph and Lie Detection indicators that have no validity. That is, it must add predictive value. It is therefore necessary to define the nonvalid indicators that serve as points of comparison.13 One possible reference point is the level of performance that would be achieved by random guessing about the examinee’s truthfulness or deceptiveness on the relevant questions. In this comparison, the predictive validity of the polygraph test is the difference between its predictive value and that of random guessing. This reference point provides a minimal comparison that we consider too lenient for most practical uses, and particularly for employee screening applications. For the polygraph to have sufficient validity to be of more than academic interest, it must do considerably better than random guessing. A second possible reference point is the extent to which deception is accurately detected by other techniques normally used in the same investigations as the polygraph (background checks, questionnaires, etc.). Comparisons of the incremental validity (Fiedler, Schmid, and Stahl, in press) of the polygraph consider the improvement provided by the polygraph over other methods of investigation (e.g., background checks). We consider this reference point to be important for making policy decisions about whether to use the polygraph (see Chapter 7), but not for judging validity. The scientific validity of the polygraph is unaffected by whether or not other techniques provide the same information. A third possible reference point for the validity of polygraph testing is a comparison condition that differs from the polygraph examination only in the absence of the chart data, which is purportedly the source of the valid physiological detection of deception in the polygraph examination. This logic implies a comparison similar to the placebo control condition in medical research. The reference point is an experimental treatment condition that is exactly the same as the one being investigated, except for its active ingredient. For the polygraph, that would mean a test that both the examiner and examinee believed yielded valid detection of deception, but that in fact did not. Polygraph research does not normally use such comparisons, but it could. Doing so would help determine the extent to which the effectiveness of the polygraph is attributable to its validity, as distinct from other features of the polygraph examination, such as beliefs about its validity. Bogus pipeline research illustrates what might be involved in assessing validity of the polygraph using an experimental condition analogous to a placebo. An actual polygraph test might be compared with a bogus pipeline test in which the examinee is connected to polygraph equipment that, unbeknownst both to examiners and examinees, produced charts that were not the examinee’s (perhaps the chart of a second examinee whose actual polygraph is being read as the comparison to the bogus

OCR for page 29
The Polygraph and Lie Detection one). The polygraph’s validity would be indicated by the degree to which it uncovered truth more accurately than the bogus pipeline comparison. Such a comparison might be particularly useful for examining issues of utility, such as the claimed ability of the polygraph to elicit admissions and confessions. These admissions and confessions might be appropriately attributed to the validity of the polygraph if it produced more true admissions and confessions than a bogus pipeline comparison condition. However, if similar proportions of deceptive individuals could be induced to admit transgressions when connected to an inert machine as when connected to a polygraph, their admissions could not be counted as evidence of the validity of the polygraph. We believe that such a comparison condition is an appropriate reference point for judging the validity of polygraph testing, especially as that validity contributes to admissions and confessions during the polygraph interview. However, we have found no research attempting to assess polygraph validity by making this kind of comparison. This gap in knowledge may not present a serious threat to the quality of laboratory-based polygraph research, in which examinees normally do not admit their mock crimes, but it is important for making judgments about whether research on polygraph use under field conditions provides convincing evidence of criterion validity. CONCLUSIONS Validity and Utility The appropriate criteria for judging the validity of a polygraph test are different for event-specific and for employee or preemployment screening applications. The practical value of a polygraph testing and scoring system with any given level of accuracy also depends on the application because in these different applications, false positive and false negative errors differ both in frequency and in cost. No clear consensus exists on what polygraphs are intended to measure in the context of federal employee security screening. Evidence of the utility of polygraph testing, such as its possible effects of deterring potential spies from employment or increasing the frequency of admissions of target activities, is relevant to polygraph validity only under very restricted circumstances. This is true in part because any technique that examinees believe to be a valid test of deception is likely to produce deterrence and admissions, whether or not it is in fact valid.

OCR for page 29
The Polygraph and Lie Detection The federal agencies that use the polygraph for screening do not collect data on admissions and confessions in a form that allows these field tests to be used to assess polygraph validity. There is no direct scientific evidence assessing the value of the polygraph as a deterrent, as a way to elicit admissions and confessions, or as a means of supporting public confidence. The limited scientific evidence does support the idea that these effects will occur when examinees (and the public) perceive that there is a high likelihood of deception being detected and that the costs of being judged deceptive are substantial. Measurement of Accuracy For the purposes of assessing accuracy, or criterion validity, it is appropriate to treat the polygraph as a diagnostic test and to apply scientific methods based on the theory of signal detection that have been developed for measuring the accuracy of such tests. Diagnostic test performance depends on both the accuracy of the test, which is an attribute of the test itself, and the threshold value selected for declaring a test result positive. There is little awareness in the polygraph literature and less in U.S. polygraph practice of the concept that false positives can be traded off against false negatives by adjusting the threshold for declaring that a chart indicates deception. We have seen indications that practitioners implicitly adjust thresholds to reflect perceived organizational priorities, but may not be fully aware of doing so. Explicit awareness of the concept of the threshold and appropriate policies for adjusting it to reflect the costs of different kinds of error would eliminate a major source of uncontrolled variation in polygraph test results. The accuracy of the polygraph is appropriately summarized by the accuracy index A, as defined in the theory of signal detection. To estimate the accuracy of the polygraph, it is appropriate to calculate values for this index for the validation studies that meet standards of scientific acceptability and to consider whether these values are systematically related to other factors, such as populations of examinees, characteristics of individual examinees or examiners, relationships established in the interview, testing methods, and the use of countermeasures.

OCR for page 29
The Polygraph and Lie Detection NOTES 1.   In practice, test-retest reliability can be affected by memory effects, the effects of the experience of testing on the examinee, the effects of the experience on the examiner, or all of these effects. 2.   In most applications of the comparison question technique, for example, examiners select comparison questions on the basis of information gained in the pretest interview that they believe will produce a desired level of physiological responsiveness in examinees who are not being deceptive on the relevant questions. It is plausible that tests using different comparison questions—for example, tests by different examiners with the same examinee—might yield different test results (compromising test-retest reliability). Little research has been done on the test-retest reliability of comparison question polygraph tests. Some forms of the comparison question test, notably the Test of Espionage and Sabotage used in the U.S. Department of Energy’s security screening program, offer examiners a very limited selection of possible relevant and comparison questions in an attempt to reduce variability in a way that can reasonably be expected to benefit test-retest reliability in comparison with test formats that allow an examiner more latitude. 3.   The polygraph examination for preemployment or preclearance screening may have other purposes than the diagnostic purpose served by the test. For example, an employer may want to gain knowledge of information about the applicant’s past or current situations that might be used to “blackmail” the individual into committing security violations such as espionage, but that could not be used in this way if the employer already had the information. 4.   Policies for use of the polygraph in preemployment screening vary considerably among federal agencies. 5.   We were told that the FBI administered approximately 27,000 preemployment polygraph examinations between 1994 and 2001. More than 5,800 of these tests (21 percent) led to the decision that the examinee was being deceptive. Of these, almost 4,000 tests (approximately 69 percent of “failures”) involved obtaining direct admissions of information that disqualified applicants from employment (about 2,300 tests) or of information not previously disclosed in the application process that led to a judgment of deceptiveness (about 1,700 tests). More than 1,800 individuals who did not provide direct admissions also were judged deceptive; the proportion of these attributed to detected or suspected countermeasures is not known. Thus, only the remainder of those judged deceptive—less than 1,800—resulted from the direct and unambiguous result of readings of the polygraph chart. 6.   The false positive index is not commonly used in research on medical diagnosis but seems useful for considering polygraph test accuracy. 7.   Many statistics other than the ROC accuracy index (A) might have been used, but they have drawbacks relative to A as a measure of diagnostic accuracy. One class of measures of association assumes that the variances of the distributions of the two diagnostic alternatives are equal. These include the d’ of signal detection theory (also known as Cohen’s d). These measures are adequate when the empirical ROC is symmetrical about the negative diagonal of the ROC graph, but empirical curves often deviate from the symmetrical form. The measures A and d’ are equivalent in the special case of symmetric ROCs, but even then A has the conceptual advantage of being bounded, by 0.5 and 1.0, while d’ is unbounded. Some measures of association, such as the log-odds ratio and Yule’s Q, depend only on the internal four cells of the 2-by-2 contingency table of test results and true conditions (e.g., their cross product) and are independent of the table’s marginal totals. Although they make no assumptions about

OCR for page 29
The Polygraph and Lie Detection     equal variances per se, as measures of accuracy they share the same “symmetric” features of d’. A second class of standard measures of association, which do depend on marginal totals, are functions of the correlation coefficient; they include Cohen’s kappa and measures derived from the Chi-square coefficient, such as the Phi, or four-fold point, coefficient. Like the “percentage correct” index, these measures vary with the base rate of positive cases in the study sample and with the diagnostician’s decision threshold, in a way that is evident only when their ROCs are derived. Their ROCs are not widely known inasmuch as the measures were designed for single 2-by-2 or 2-by-3 tables, rather than for the 2-by-n table that represents the multiple possible thresholds used in estimating an ROC. However, these measures can be shown to predict an ROC of irregular form—one that is not concave downward or that intersects the ROC axes at places other than the (0.0, 0.0) and (1.0, 1.0) corners. Moreover, some of these latter measures were developed to determine statistical significance relative to hypotheses of no relationship, and they lack cogency for assessing degree of accuracy or effect size. Several of these alternative statistics have been analyzed and their theoretical ROCs compared with a broad sample of observed ROCs (Swets, 1986a, 1986b); the two classes of association statistics are discussed by Bishop, Fienberg, and Holland (1975). 8.   The accuracy index (A) is equal to the proportion of correct signal identifications that would be made by a diagnostician confronted repeatedly by pairs of random test results, one of which was drawn from the signal category and one from the noise category. For example, a decision maker repeatedly faced with two examinees, one of whom is truthful, will make the correct choice 8 out of 10 times by using a test with A = 0.8. In other situations, A does not translate easily to percent correct. Under a great many assumptions about test situations that are realistic in certain applications, the percent correct is quite different from A, as is illustrated in Table 2-1. (The measure A is applied to diagnostic performance in several fields; see Swets, [1988, 1996:Chapter 4].) 9.   A conventional way of representing decision thresholds quantitatively is as the slope of the tangent to the ROC curve drawn at the cutoff point that defines the threshold. It can be shown that this slope is equal to the ratio of the height of the signal distribution to the height of the noise distribution (the “likelihood ratio”) at that threshold (see representations in Figure 2-1). At point F in Figure 2-2, this slope is 2, at point B it is 1, and at point S it is 1/2 (Swets, 1992, 1996:Ch. 5). 10.   Computer software exists to give maximum-likelihood fits to empirical ROC points (e.g., Metz, 1986, 1989, 2002; Swets, 1996). There are two common approaches: to draw straight line segments interpolating between estimated ROC points and the lower left and upper right corners of the plotting square; or to assume a curved form that follows from underlying distributions of the measure of evidence that are normal (Gaussian), often with arbitrary variances but sometimes with these assumed equal, and to use maximum likelihood estimation. In either case, A is determined as the area under the estimated ROC; standard errors and confidence bounds for A may also be computed. These methods have technical limitations when used on relatively small samples, but they are adequate to the level of accuracy needed here. 11.   A different distinction between validity and utility is made in some writings on diagnostic testing (Cronbach and Gleser, 1965; Schmidt et al., 1979). That distinction concerns the practical value of a test with a given degree of accuracy in particular decision-making contexts, such as screening populations with low base rates of the target condition. We address these issues in this report (particularly in Chapter 7), but do not apply the term “utility” in that context. Our usage of “utility” in discussing the polygraph follows the usage of the term by polygraph practitioners.

OCR for page 29
The Polygraph and Lie Detection 12.   Using computers or “blind” scoring may not completely remove the effects of demeanor because cues in the examinee’s demeanor can alter the way the examination is given, and this may in turn affect the examinee’s physiological responses on the test. 13.   We found many polygraph validation studies in which assessment was done only by tests of statistical significance without any attempt to estimate effect size or strength of association. We were unable to use these in our quantitative assessment of accuracy because they did not provide the raw data needed to calculate the accuracy index.