Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Appendix C The Value of Factorial Experiments Factorial experiments are extremely useful designs when outcomes are needed for a variety of test conditions. For example, consider the following factors that could affect test performance (e.g., probability of an alarm, or probability of no alarm): ï· Masking (absent, present) ï· Shielding (absent, present) ï· Mask location (front, middle) ï· Mask height (front, middle) ï· Shield location (front, middle) ï· Shield height (front, middle) ï· SNM (none, some) ï· NORM (none, some) More than 8 factors could be envisioned (e.g., cargo density, ambient temperature, ambient humidity, background radiation level), and more than just 2 levels for each factor could be considered. For example, the masking and shielding factors could have levels labeled âabsent,â âfront,â and âmiddle;â and the SNM and NORM factors could have four levels labeled ânone,â âsmall,â âmedium,â and âlarge,â resulting in a 3x3x4x4 design (a total of 144 test conditions). This appendix illustrates the value of factorial designs (and a way to reduce the number of test conditions) with the above design simply for ease of illustration. The same concepts apply to more complex designs. But even with only these 8 factors at these levels, the testing of all 2 x 8 = 16 single-factor tests would not be informative. For example, what happens if a cargo contains some SNM and some NORM with much shielding and some masking placed in the middle of the truck? None of the 16 test runs would answer this question. One might also want to know if the probability of detecting SNM is affected by the combined presence of masking and shielding of different magnitudesâa question that likewise would not be answered by any of the 16 runs. The benefits of running test combinations can be seen already with the following (simpler) test design with these hypothetical results: shielding present absent present 0.20 0.95 masking absent 0.80 0.99 If one tested only âmasking presentâ and âmasking absentâ in the absence of shielding, one might conclude that masking has some effect on SNM detection (0.95 vs. 0.99), but not as great as the effect of shielding in the absence of masking (0.80 vs. 0.99). One needed 3 runs to ascertain this conclusion. But with only one more run (masking and shielding both present), one sees that their combined effect is devastating to the probability of detection (0.20)âfar lower than with 69
70 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT either factor singly. The effect of different combinations of factors can be especially illuminating; hence the value of experimental designs with combinations of factors or âfactorial designs.â Unfortunately, testing all 2x2x2x2x2x2x2x2 = 28 = 256 combinations would be infeasible, especially since the outcome of each test is a âprobability of detectionâ; i.e., (number of runs that sounded alarm)/(total number of runs). To minimize the uncertainty in this estimated probability, several runs must be conducted at each test scenario. With only n=6 or n=12 runs, one would have to conduct 256x6 = 1536 or 256x12 = 3072 test runs, and, even then, the uncertainty in the estimated probability could be as high as 30%-40% (95% confidence). For example, a perfect test of 6 correct actions (6/6) would yield an approximate 95% confidence interval for the true probability of detection as [(1-0.95)1/n, 1] = (0.61, 1.00) if n = 6 or [(1- 0.95)1/n, 1] = (0.78, 1.00) if n = 12. Clearly some reduction in the number of test scenarios is needed. Fractional factorial experiments are factorial experiments with only a fraction of the total number of runs. Consider, for ease of illustration, only 4 factors, denoted A, B. C, D, each at 2 levels (âpresentâ, âabsentâ). Sixteen test scenarios would cover all combinations, as follows: Factor levels Product (Mod 2) Scenario A B C D ABCD 1 1 1 1 1 1 2 1 1 1 0 0 3 1 1 0 1 0 4 1 1 0 0 1 5 1 0 1 1 0 6 1 0 1 0 1 7 1 0 0 1 1 8 1 0 0 0 0 9 0 1 1 1 0 10 0 1 1 0 1 11 0 1 0 1 1 12 0 1 0 0 0 13 0 0 1 1 1 14 0 0 1 0 0 15 0 0 0 1 0 16 0 0 0 0 1 â1â = âpresentâ, â0â = âabsentâ; âProduct (Mod 2)â = 1 with even numbers of 1âs, 0 with odd numbers of 1âs Consider the rows whose last column value is 1: Run # A B C D 1 1 1 1 1 4 1 1 0 0 6 1 0 1 0 7 1 0 0 1 10 0 1 1 0 11 0 1 0 1 13 0 0 1 1 16 0 0 0 0
APPENDIX C: THE VALUE OF FACTORIAL EXPERIMENTS 71 Notice that exactly 4 runs have A absent and 4 runs have A present; the same is true of B, C, or D. Moreover, when A is present (first 4 runs), exactly 2 of the 4 runs have B present and 2 have B absent; the same is true for C and D, and any two of the four factors (A and C, A and D, etc.). In fact, all 8 runs for any combination of 3 factors (A, B, C; A, B, D; B, C, D) are included. So this design allows us to evaluate: ï· The effect of A (present vs. absent) ï· The effect of B ï· The effect of C ï· The effect of D ï· The effect of A and B together ï· The effect of A and C together ï· The effect of A and D together ï· The effect of B and C together ï· The effect of B and D together ï· The effect of C and D together ï· The effect of A, B, and C together ï· The effect of A, B, and D together ï· The effect of B, C, and D together The only effect that we cannot assess is the 4-way interaction, ABCD. But we have reduced the number of runs from 16 to 8, a big savings. The same principle applies with 8 factors. If resources allow us to run only 64 scenarios, then we sacrifice the ability to estimate the interactions that involve 5 or more factors at onceâ e.g., ABCDEFGH, all 7-factor interactions (ABCDEFG, â¦, BCDEFGH)âbut we can estimate all other main effects and 2-way, 3-way, and 4-way interactions. (Usually interactions involving 4 or more factors are hard to interpret anyway.) If we can run only 32 scenarios, we sacrifice the ability to estimate not only these high-order interactions, but also some ability to resolve some two-factor interactions; but we can still assess the main effects (A alone, ..., H alone) and most two-factor interactions (AB, ..., GH)âall with just 32 runs, a huge savings. The designs that NIST provided to DNDO for their test runs followed this principle. The only limiting factors are n, the number of test runs, and the inability to conduct the âSNM presentâ tests as blind tests. The former can be improved by increasing n; the latter can be addressed by hiring âactorsâ to pretend to act as security agents, with only DNDO personnel aware of the true SNM test scenarios. The effect of bias when tests are run unblinded has been documented extensively in the medical literature; unblinded tests must be viewed with great caution and even skepticism.