| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 58
Appendix B
Performance Metrics for ASPs and PVTs
“Far better an approximate answer to the right question, which is often vague, than
an exact answer to the wrong question, which can always be made precise.”
– John W. Tukey (1962), “The Future of Data Analysis,” Annals of Mathematical
Statistics 33(1), p.1–67. (The citation appears on p.12.)
When evaluating the performance of instruments to identify the system most well suited
to a given task, one needs to consider the correct metric for making the comparison. In the case
of systems such as the Advanced Spectroscopic Portals (ASPs), conventional measures such as
sensitivity and specificity provide useful information, but do not assess directly test performance
in actual field operation. The metrics of interest concern the probabilities of making incorrect
calls -- i.e., the probability that the cargo actually contained dangerous material when the test
system allowed it to pass (a false negative call), and the probability that the cargo actually
contained benign material when the test system alarmed it (a false positive call). In some
contexts, the false negative call probability (FNCP) has been called the ``false non-discovery
rate'' and the false positive call probability (FPCP) has been called the ``false discovery rate'' (see
Note 1). This appendix describes the calculations leading to estimates of these probabilities, the
uncertainties in these values, and how these estimated probabilities can be used to compare two
systems under consideration.
Test system performance usually is characterized in terms of detection probabilities. The
notation for these probabilities comes from the literature for comparing medical diagnostic tests,
and we use the same notation here for radiation detection systems:
• Sensitivity (S) = probability that the test system alarms, given that the underlying cargo
truly contains special nuclear material (SNM)
• Specificity (T) = probability that the test system does not alarm, given that the underlying
cargo truly contained benign material (non-SNM)
• Prevalence (p) = probability that cargo contains SNM
• Positive predictive value (PPV) = probability that the underlying cargo truly contains
SNM, given that the test system alarms
• Negative predictive value (NPV) = probability that the underlying cargo truly contains
non-SNM, given that the test system did not alarm.
Because the definitions of sensitivity (S) and specificity (T) rely on true knowledge of the
cargo contents, we can estimate a system’s sensitivity (S) and specificity (T) only from a
designed experiment. The experimenters insert into the cargo either SNM (true SNM) or benign
material (true non-SNM), and then run the cargo through the test systems; the proportion of
(true-SNM) runs that properly set off the test system alarm is an estimate of the test’s sensitivity,
and the proportion of (benign-SNM) runs that properly pass the test system is an estimate of the
test’s specificity.
In real life, however, we do not know the cargo contents. We see only the result of the
test system: either the test system alarmed, or it did not alarm. Operationally, if the system
alarms, SNM is suspected; if the system does not alarm, the cargo is allowed to pass. We are
58
Prepublication Copy
OCR for page 59
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 59
concerned especially with this question: Given that the test system did not alarm, what is the
probability that the cargo contained SNM? That is, what risk do we take by allowing a “no-
alarm” cargo to pass? From the standpoint of practical operational effectiveness, this probability
(the probability that the cargo contains SNM, given that the test system did not alarm) has grave
consequences. As shown below by Bayes’ Theorem, it is a function of sensitivity (S) and
specificity (T), as well as of prevalence (p) (i.e., how likely a positive – here, a cargo containing
SNM – is likely to occur), but a comparison between two test systems on the same scenario (i.e.,
the same threat) involves the same prevalence, so prevalence does not enter into the comparison
of effectiveness for the two test systems. So accurate estimation of sensitivity (S) and specificity
(T) is important, in that it allows us to compare accurately the performance of two test systems
using the relevant, practically meaningful metric.
The probability of making a false negative call (FNCP) is the probability that the cargo
truly contains SNM, given that the test system did not alarm; it is exactly the same as 1 – NPV.47
Unfortunately, we cannot estimate NPV from real life runs of the radiation test system, because
in real life, we don’t know the true state of the cargo. We can, however, estimate S and T from
designed studies, such as those conducted at the Nevada test site, because we know the cargo
contents in the tests. We also can derive confidence limits on S and T from such designed
experiments, and hence we can estimate (1–NPV) and associated confidence intervals. More
importantly, we can compare the two systems via a ratio, say (1-NPV1)/(1-NPV2); a ratio whose
lower confidence limit exceeds 1 indicates preference for test system 2, while a ratio whose
upper confidence limit falls below 1 indicates preference for test system 1. Note that these ratios
may differ for different scenarios; a table of these ratios may suggest strategies for associating
the ratios with the threat levels presented by different scenarios.
Notice also that the probability of making a false positive call (FPCP) is likewise of
interest for purposes of evaluating costs and benefits: too many false positive calls can also be
costly (e.g., slowing down commerce, diverting CBP personnel from potential threats as they
spend time investigating benign cargo, etc.). Two detection systems that have exactly the same
probability of a false negative call (1–NPV) for a given scenario, but substantially different
values of the probability of making a false positive call, may indicate a preference for one system
over the other. The probability of making a false positive call equals 1–PPV.
We illustrate these calculations from hypothetical data below. Suppose we have 24
trucks, into 12 of which we place SNM and leave only benign material in the remaining 12
trucks. We run all 24 trucks through two test systems, and observe the following results:
Test System 1 Test System 2
No Total No Total
Alarm Alarm
Alarm Runs Alarm Runs
SNM in
10 2 12 11 1 12
cargo
Non-SNM
4 8 12 2 10 12
in cargo
14 10 24 13 11 24
47
The literature (see references) refers to “false discovery rate” and “false non-discovery rate” which are related
to (1–PPV) and (1–NPV), respectively, but their definitions are slightly different (see Note 1).
Prepublication Copy
OCR for page 60
60 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT
Sensitivity is the probability that the system alarmed, given the presence of SNM in the
cargo: among the 12 trucks that contained SNM, 10 alarmed for test system 1 (estimated
sensitivity S1 = 10/12) and 11 alarmed for test system 2 (estimated S 2 = 11/12). Similarly, we
estimate specificity for the two test systems as 8/12 and 10/12, respectively (number of “no
alarm” results out of the 12 non-SNM trucks). Because we specified the number of runs in each
condition ( n1 =12 for SNM runs and n2 =12 for non-SNM runs), we can estimate the
uncertainties in these probabilities using the conventional binomial distribution. In this case,
lower 95% confidence bounds determined from the binomial distribution based on n1 = n2 = 12
are:
Test System 1 Test System 2
0.833 (10/12) 0.917 (11/12)
Estimated Sensitivity
(0.562, 1.000) (0.661, 1.000)
95% confidence interval
0.667 ( 8/12) 0.833 (10/12)
Estimated Specificity
(0.391, 1.000) (0.562, 1.000)
95% confidence interval
(The wide intervals result from the small sample sizes.)
More importantly, the negative predictive value (NPV, the probability that the truck truly
did not contain SNM, given that the alarm did not sound) is 8/10 for test system 1 and 10/11 for
test system 2, and hence we estimate the probability of making a false negative call for the two
systems as
• proportion of cases where test system 1 did not alarm (10 cases) but actually
contained SNM cargo (2 cases) = 2/10 = 0.20
• proportion of cases where test system 2 did not alarm (11 cases) but actually
contained SNM cargo (1 case) = 1/11 = 0.09
Clearly, test system 1 appears to be less reliable than test system 2. The calculation of the
lower bounds on these estimated probabilities is not as straightforward as using the binomial
distribution, as was done for sensitivity and specificity, because the denominator (10 in the
outcome of the performance tests of system 1 and 11 in the outcome of the performance tests on
system 2) arose from the test results, not from the number of trials set by the study design. That
is, the denominator “10” for test system 1 (and “11” for test system 2) is the sum of two numbers
that might differ if the test were re-run. Confidence bounds can be obtained as a function of
sensitivity (S) and specificity (T) (see Note 2).
In formal notation, we estimate the probability of a false negative call from estimates of
sensitivities and specificities, we use the following notation. Let A and B denote two events, say
A = cargo contains SNM
B = Test system alarms
Ac = The complement of A, cargo contains no SNM (benign)
Bc = The complement of B, test system does not alarm
The FNCP is the probability that event A occurs (cargo truly contains SNM), given that
event Bc occurred (test system does not alarm). We write this probability as P{A|Bc}. (The event
after the vertical bar “|” is the event on which the probability is conditioned; i.e., the event that
exists.)
Bayes’ rule (Navidi, 2006) states:
Prepublication Copy
OCR for page 61
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 61
P ⎧ A | B c ⎫ = P { B c | A} × P { A} / [( P { B c | A} × P { A}) + ( P ⎧ B c | Ac ⎫ × P ⎧ Ac ⎫)]
⎨ ⎬ ⎨ ⎬ ⎨ ⎬
⎩ ⎭ ⎩ ⎭ ⎩ ⎭
(1)
where
P{A|Bc} = probability that event A occurs, given confirmation that event B has occurred
(here, P{cargo contains SNM | test system does not alarm} = 1 – NPV)
P{Bc|A} = probability that event Bc occurs, given confirmation that event A has occurred
(here, P{test system does not alarm | cargo contains SNM} = 1 – S)
P{Bc|Ac} = probability that event B occurs, given confirmation that event Ac has occurred
(here, P{test system does not alarm | cargo contains no SNM} = T).
Recall that sensitivity is the probability that the test system alarms, given SNM was in the
cargo; i.e., P{B|A} = sensitivity (S). Both S and T can be estimated from the experimental test
runs (where we know what the cargo contained). Denoting by p, the probability that cargo
contains SNM, we have:
(1 − S ) p 1
= , (2)
FNCP =
[(1 − S ) p + T (1 − p )] 1 + y
where y = [T/(1−S)]×[(1−p)/p]. We prefer systems with lower values of this probability; i.e., with
higher values of y.
Denoting by S1 , T1 , S 2 , T2 the sensitivities and specificities of systems 1 and 2,
respectively, we prefer system 1 to system 2 if FNCP1 < FNCP2 ; i.e., if
y1 > y2
i.e., if
⎛ T1 ⎞⎛ 1 − p ⎞ ⎛ T2 ⎞⎛ 1 − p ⎞
⎜
⎜ 1 − S ⎟⎜ p ⎟ > ⎜ 1 − S ⎟⎜ p ⎟
⎟⎜ ⎟⎜ ⎟⎜ ⎟
1 ⎠⎝ ⎠⎝ 2 ⎠⎝ ⎠
⎝
which is the same as either
T1 T
>2 (3)
1 − S1 1 − S 2
or
T1 1 − S1
> . (4)
T2 1 − S 2
That is, a comparison of FNCP for test system 1 (FNCP1) with that for test system 2
(FNCP2) reduces to a comparison of [(1-sensitivity)/(specificity)] for the two systems. We can
Prepublication Copy
OCR for page 62
OCR for page 64
OCR for page 65
OCR for page 66
OCR for page 67
OCR for page 68
62 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT
estimate uncertainties on our estimates of sensitivity and specificity (based on the binomial
distribution; see above discussion). Hence, we can approximate the uncertainty in [(1 – S)/(T)],
and ultimately the uncertainty in the ratio of false negative call probabilities (see Note 2) —
which does not involve assumptions on p (likelihood of the threat). Notice that test system 1 is
always preferred if T1 ≥ T2 and S1 ≥ S2, because T1 ≥ T2 implies that the left-hand side of (4)
exceeds or equals 1, and S1 ≥ S2 implies that the right-hand side of (4) is less than or equal to 1;
hence (4) is satisfied. (If T1 = T2 and S1 = S2, then the test systems are equivalent, in terms of
sensitivity, specificity, and false negative call probability, so either can be selected.) In real
situations, however, one test system may have a higher test sensitivity may but a lower
specificity. For example, if T1 = 0.70 and T2 = 0.80 (test system 2 is more likely to remain silent
on truly benign cargo than test system 1), but S1 = 0.950 and S2 = 0.930 (test system 1 is slightly
more likely to alarm if the cargo truly contains SNM), then (4) says that test system 1 is
preferred, because T1/T2 = 0.875 and (1-S1)/(1-S2) = 0.05/0.07 = 0.714. The FNCP for the two
systems are
1 1
FNCP1 = =
⎡ ⎛ T1 ⎞⎛ 1 − p ⎞⎤ ⎡ 14.00 ⋅ (1 − p ) ⎤
1+
⎢1 + ⎜
⎜ 1 − S ⎟⎜ p ⎟⎥ ⎢
⎟ ⎥
⎟⎜ p
⎠⎦ ⎣ ⎦
1 ⎠⎝
⎝
⎣
1 1
FNCP2 = =
⎡ 11.43 ⋅ (1 − p ) ⎤
⎡ ⎛ T2 ⎞⎛ 1 − p ⎞⎤
⎢1 +
⎢1 + ⎜ ⎟⎜
⎟⎜ p ⎟⎥ ⎥
⎟
⎜
⎣ ⎝1 − S2 p
⎣ ⎦
⎠⎝ ⎠⎦
so clearly FNCP1
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 63
p = 0.0001: FNCP1 = 0.1429´10-4 and FNCP2 = 0.7778´10-4 (ratio = 0.18369).
•
Here, even with a higher specificity, the increase in sensitivity from 0.3 (test 2) to 0.9
(test 1) results in a five-fold decrease in the FNCP. With either test, the FNCP is small, even
when the threat level is 0.01 (1 in 100 trucks carry threatening cargo).
Calculations for the probability of a false positive call (FPCP, 1-PPV) are similar. Again
from Bayes’ Theorem:
P ⎧ A | B c ⎫ = P { B c | A} × P { A} / [( P { B c | A} × P { A}) + ( P ⎧ B c | Ac ⎫ × P ⎧ Ac ⎫)] (5)
⎨ ⎬ ⎨ ⎬ ⎨ ⎬
⎩ ⎭ ⎩ ⎭ ⎩ ⎭
where
Ac = complement of A = event that cargo does not contain SNM
Bc = complement of B = event that test system does not alarm
P{Ac|B} = probability that event Ac occurs even though B occurred(here, P{cargo
contains no SNM | test system alarms} = 1 – PPV)
P{Bc|Ac} = probability that event B occurs, given confirmation that event Ac has occurred
(here, P{test system does not alarm | cargo contains no SNM} = T).
P{Bc|A} = probability that event Bc occurs, given confirmation that event A has occurred
(here, P{test system does not alarm | cargo contains SNM} = 1 – S)
FPCP = (1T)(1p)/[(1T)(1p)+Sp] = 1/(1+z) where z = [S/(1T)] [p/(1p)].
So test system 1 would be preferred, in these terms, over system 2, if
⎛ S1 ⎞⎛ p ⎞ ⎛ S 2 ⎞⎛ p ⎞
⎜
⎜ 1 − T ⎟⎜ 1 − p ⎟ > ⎜ 1 − T ⎟⎜ 1 − p ⎟
⎟⎜ ⎟⎜ ⎟⎜ ⎟
1 ⎠⎝ ⎠⎝ 2 ⎠⎝ ⎠
⎝
i.e., if
⎛ 1 − T1 ⎞ ⎛ 1 − T2 ⎞
⎜
⎜ S ⎟ < ⎜ S ⎟.
⎟⎜ ⎟
⎝ 1⎠⎝ 2⎠
To calculate the magnitude of FPCP (not just the ratio of the probabilities for the two
systems), consider that p is likely small and that S1 (or S2) may not be orders of magnitude large r
than (1-T1) (or (1-T2). In this case, the “1 +” in the denominator does matter for the absolute
magnitude of this FPCP. For the example above, where S1 = 0.95, S2 = 0.93, T1 = 0.70, T2 = 0.80,
the corresponding FPCP for p=0.10, p= 0.05, p = 0.01, p = 0.001, p = 0.0001 are:
p = 0.10: FPCP = 1/[1 + 0.31579(1/9)] = 0.96610, FPCP2 = 0.97666 (ratio =
• 1
0.9892)
p = 0.05: FPCP = 0.98365 , FPCP2 = 0.98881 (ratio = 0.99478)
• 1
p = 0.01: FPCP = 0.99682 , FPCP2 = 0.99783 (ratio = 0.99899)
• 1
p = 0.001: FPCP = 0.99968 , FPCP2 = 0.99978 (ratio = 0.99990)
• 1
p = 0.0001: FPCP = 0.99997 , FPCP2 = 0.99998 (ratio = 0.99999).
• 1
For these examples, the chance of having to re-inspect every sounded alarm, only to find
benign material, is virtually identical in both systems (and very close to 1 for both). The same is
true when S1 = 0.90 , S 2 = 0.30 , T1 = 0.60 , T2 = 0.80 :
Prepublication Copy
64 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT
p = 0.01: FPCP = 0.95294 , FPCP2 = 0.93103 (ratio = 1.02353)
• 1
p = 0.05: FPCP = 0.97714 , FPCP2 = 0.96610 (ratio = 1.01143)
• 1
p = 0.01: FPCP = 0.99553 , FPCP2 = 0.99331 (ratio = 1.00223)
• 1
p = 0.001: FPCP = 0.99956 , FPCP2 = 0.99933 (ratio = 1.00022)
• 1
p = 0.0001: FPCP = 0.99996 , FPCP2 = 0.99993 (ratio = 1.00002).
• 1
The DNDO criteria for “significant improvement in operational effectiveness” involve
comparisons of sensitivity and specificity. As noted above, a test system that has higher
sensitivity and higher specificity will have a lower false negative rate. But the above calculations
also demonstrate that “nearly equal” sensitivities and specificities result in nearly equivalent
systems, and hence offer rather limited benefit for the cost. For completeness, we re-write the
DNDO criteria for “significant improvement in operational testing” (see Box 2, pp 40–41) using
the S, T notation (for sensitivity and specificity).
Let S A1) (SNM , noNORM ) denote the sensitivity of the ASP system in primary (1)
(
screening when the cargo truly contains SNM and no NORM; i.e., S A1) (SNM , noNORM ) =
(
P{ASP alarms | cargo contains SNM, no NORM}. Likewise, let S P1) (SNM , noNORM ) denote
(
the sensitivity of the current (PVT+RIID) system in primary (1) screening when the cargo truly
contains SNM and no NORM; i.e., S P1) (SNM , noNORM ) = P{PVT alarms in primary screening |
(
cargo contains SNM, no NORM} Using T to denote specificity, let TP( 2) (SNM , noNORM ) =
P{PVT/RIID does not alarm in secondary screening | cargo contains no SNM, but possibly
NORM} (specificity).
(1) (1)
Denote by SA and SP the sensitivities of ASP and PVT+RIID combination, respectively,
(1) (1)
in primary screening, and TA and TP the specificities of ASP and PVT+RIID, respectively;
superscript (2) indicates secondary screening. DNDO has specified its criteria for “operational
effectiveness” as follows:
1. S A1) (SNM , noNORM ) ≥ S P1) (SNM , noNORM )
( (
2. S A1) (SNM + NORM ) ≥ S P1) (SNM + NORM ) (different version of criterion 1
( (
above)
3. TA1) ( MI − Iso) ≥ TP(1) ( MI − Iso) (where “MI-Iso” indicates “licensable medical or
(
industrial isotopes).
4. 1 − TA1) ( NORM ) ≤ 0.20[1 − TP(1) ( NORM )]
(
⇒ 0.8 ≤ TA1) ( NORM ) − 0.2(TP(1) ( NORM )) .
(
5. 1 − S A2) ( SNM ) ≤ 0.5S P1) ( SNM ) ⇒ 0.5 ≤ S A1) ( NORM ) − 0.5( S P1) ( NORM )) .
( ( ( (
6. Time in secondary for ASP ≤ time in secondary for RIID (no connection to
sensitivity/specificity).
Since criterion 4 is more stringent than criterion 3 and criterion 5 is more stringent than
criterion 1, we concentrate on values of sensitivity and specificity that satisfy criteria 4 and 5.
When these two conditions are satisfied (i.e., TA ≥ 0.8 + 0.2TP and SA ≥ 0.5 + 0.5SP), the ratio of
false negative call probabilities (A to B) can be as small as 1:900 – almost 1000 times smaller.
For such improvements, the ratio of both the sensitivities and the specificities must be on the
Prepublication Copy
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 65
order of 0.99/0.10 or 0.95/0.10; in such cases, the false negative call probabilities are on the
order of (10-8 to 10-5). Tables of values of the probabilities of both false negative calls and false
positive calls were calculated when T A , S A , TP , and S P were set equal to 0.1, 0.2, ..., 0.8, 0.9,
0.95, 0.99; of the 114 = 14,641 combinations, only 858 satisfied criteria 4 and 5. These 858
combinations were set along with 5 different values of p = 0.01 (cargo is present in 1 of 100
trucks), 0.001, 0.0001, 0.00001, 0.000001 (1 in 1,000,000 trucks). A plot of the smaller false
negative call probability (denoted FNCP2 in the figure) versus the larger one (denoted FNCP1) is
shown in Figure B.1. (the red dashed line corresponds to the line where the two false negative
call probabilies are equal). The upper left corner shows the cases where the FNCPs are most
different ( 0.00112 < FNCP / FNCP2 < 0.00311), which occurred in 26 of the 858 cases (26 5
1
points are shown, corresponding to 5 values of p). More frequently, the ratio is less dramatic
( 0.00317 < FNCP / FNCP2 < 0.03161 for 257 of the 858 cases;
1
0.0316 < FNCP1 / FNCP2 < 0.3162 for 535 of the 858 cases; 0.3165 < FNCP1 / FNCP2 < 0.4819
for 40 of the 858 cases). In each case, the absolute magnitudes of the false negative call
probabilities are quite small, and the ratios of the false positive call probabilities are almost 1.
Figure B.1: Plot of FNCP2 versus FNCP1 for cases satisfying the criteria TA ≥ 0.8 + 0.2TP and
S A ≥ 0.5 + 0.5S P , for different levels of p (1 x 10-2, 1 x 10-3, 1x 10-4, 1 x 10-5, 1x 10-6). The red
dashed line corresponds to FNCP = FNCP2 . The results are stratified by magnitude of the ratio
1
FNCP1 / FNCP2 (specifically, rounded values of the common logarithm of the ratio: –3, –2, –1,
0, respectively, for the four plots).
Prepublication Copy
66 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT
Note 1: A comment on notation
We denoted by FNCP the probability of making a false positive call and by FPCP the
probability of making a false positive call; i.e.,
FNCP = P{ true + | test calls “–” }
FPCP = P{ true – | test calls “+” } .
We related these probabilities to the following generic two-way table of test outcomes (notation
from Benjamini and Hochberg 1995, p.291, is in parentheses):
: Test calls Test calls Total
Truth “Positive” “Negative” Tests
N + ≡ m − m0
True POSITIVE N ++ (V) N +− (U)
N − ≡ m0
True NEGATIVE N −+ (S) N −− (T)
Total calls R m–R m
We estimated the false negative call probability via the proportion of negative-call tests (mR) that
were in fact positive (N+−), or U/(m−R) in BH95 notation. Similarly, we estimated the false
positive call probability via the proportion of positive-call tests (R) that were in fact negative
(N−+), or V/R in BH95 notation. BH95 address the situation known as “multiple testing,” where
one is conducting many hypothesis tests (e.g., hundreds or thousands of tests as occurs in gene
expression experiments), and wants to control the frequency with which one declares as
“significant” (e.g., “positive”) tests which in fact are negative. Hence Benjamini and Hochberg
(1995) define the expected proportion of false positive calls, E(V/R), as the “false discovery
rate,” or FDR. They provide a procedure based on the m p-values from the m tests so that one has
assurance that, on average, the proportion of "declared significant" tests that in fact are not
significant remains below a pre-set threshold (e.g., 0.05). If we estimate the FPCP as V/R, we can
think of this estimated FPCP as an estimate of Benjamini and Hochberg’s FDR. In analogy with
E(V/R)=FDR, some have termed E(U/(mR)) the “false non-discovery rate.”
Our situation differs from the multiple testing situation in two ways. First, our two-way table
arises from a designed experiment where values of m0 and m are set by design. Second, our
bigger concern lies not with false positive calls but rather with false negative calls; i.e., with the
probability that a cargo declared “safe” (negative) actually is dangerous (true positive). The table
suggests that we can estimate FNCP as U/(mR). Some authors have called the expected value of
this ratio, E(U/(mR)), the “false non-discovery rate” (see Genovese and Wasserman 2004; Sarkar
2006). But with both FNCP and FPCP, one needs further information about the frequency of true
“positives” and true “negatives” (in the form of p = probability that cargo contains SNM or other
threatening material) beyond the m tests given in the design. In fact, as further tests are
conducted, better estimates of FNCP and FPCP can be obtained by incorporating better estimates
of sensitivity and specificity, as well as p, into the formulas for FNCP and FPCP. For that reason,
we have chosen to derive the relevant probabilities using Bayes’ formula, rather than using the
terms “false discovery rate” and “false non-discovery rate,” which often are estimated from only
the table of outcomes from multiple tests. For further information, see the references below.
Prepublication Copy
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 67
Note 2: Uncertainty in the ratio FNCP1/FNCP2
The uncertainty in the ratio FNCP1/FNCP2»[(1-S1)/T1]/[(1-S2)/T2] = [(1-S1)/(1-S2)][T2/T1] can
be approximated using propagation of error formulas. Let ratio = N/D denote a generic ratio (N
= Numerator, D = Denominator).
Var ( N ) Var ( D)
SE (ratio) = SE ( N / D) ≈ ratio × +
N2 D2
When T and S have binomial distributions, Var(T1)=T1(1-T1)/n1, Var(S1)=S1(1-S1)/n1 and likewise
for Var(T2) and Var(S2), where n1 [n2] is the number of trials on which S1 and T1 [S2 and T2] are
estimated (in experimental runs at Nevada Test Site, n1≈n2≈12 or 24). Hence, the standard error
(square root of the variance) of (1-S1)/T1 is approximately
1 − T1
S1
[1 − S1 ]× +
n1 (1 − S1 ) n1T1
so the standard error of the ratio of false negative call probabilities (when p is tiny) is
approximately
⎛ FNCP1 ⎞ ⎛ FNCP ⎞ Var (FNCP ) Var (FNCP2 )
SE⎜⎜ FNCP ⎟ ≈ ⎜ FNCP ⎟ +
1 1
.
⎟⎜ ⎟ FNCP 2 FNCP22
⎝ 2⎠ ⎝ 2⎠ 1
So,
T2 (1 − S1 ) ⎡⎛ S1 ⎤ ⎡⎛ S ⎤
1 − T1 ⎞ 1 − T2 ⎞
SE (FNCP1 / FCNP2 ) ≈ ⎜ ⎟ n1 ⎥ + ⎢⎜ 2 + ⎟ n2 ⎥ .
+
⎢⎜ ⎟ ⎜1− S ⎟
T1 (1 − S 2 ) ⎣⎝ 1 − S1 T1 ⎠ T2 ⎠
⎦ ⎣⎝ ⎦
2
References
1. Benjamini, Y.; Hochberg, Y. (1995), Controlling the false discover rate: A practical and
powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B 57:
289–300.
2. Genovese, Christopher R.; Wasserman, Larry (2004), Controlling the false discovery rate:
Understanding and extending the Benjamini-Hochberg Method,
http://www.stat.cmu.edu/genovese/talks/pitt-11-01.pdf.
3. Genovese, Christopher R.; Wasserman, Larry (2004), A stochastic process approach to
false discovery control, Annals of Statistics 2004: 32(3), 1035–1061.
4. Ku, H.H. (1962), Notes on the Use of Propagation of Errors Formulae, Journal of Research
of the National Bureau of Standards 70C(4), p.269.
5. W. Navidi, Statistics for Engineers and Scientists, McGraw-Hill, 2006.
6. Pawitan, Yudi; Michels, Stefan; Koscielny, Serge; Gusnanto, Arief; Ploner, Alexander
(2005), False discovery rate, sensitivity, and sample size in microarray studies,
Bioinformatics 21(13), 3017–3024.
Prepublication Copy
68 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT
7. Sarkar, Sanat K. (2006), False discovery and false non-discovery rates in single-step
multiple testing procedures, Annals of Statistics 34(1), 394–415.
8. Vardeman, S.B. (1994), Statistics for Engineering Problem Solving, PWS Publishing,
Boston, Massachusetts, 1994, p.257.
Prepublication Copy