Reference Guide on Statistics


David H. Kaye, M.A., J.D., is Distinguished Professor of Law and Weiss Family Scholar, The Pennsylvania State University, University Park, and Regents’ Professor Emeritus, Arizona State University Sandra Day O’Connor College of Law and School of Life Sciences, Tempe.

David A. Freedman, Ph.D., was Professor of Statistics, University of California, Berkeley.

[Editor’s Note: Sadly, Professor Freedman passed away during the production of this manual.]


   I. Introduction

A. Admissibility and Weight of Statistical Studies

B. Varieties and Limits of Statistical Expertise

C. Procedures That Enhance Statistical Testimony

1. Maintaining professional autonomy

2. Disclosing other analyses

3. Disclosing data and analytical methods before trial

  II. How Have the Data Been Collected?

A. Is the Study Designed to Investigate Causation?

1. Types of studies

2. Randomized controlled experiments

3. Observational studies

4. Can the results be generalized?

B. Descriptive Surveys and Censuses

1. What method is used to select the units?

2. Of the units selected, which are measured?

C. Individual Measurements

1. Is the measurement process reliable?

2. Is the measurement process valid?

3. Are the measurements recorded correctly?

D. What Is Random?

 III. How Have the Data Been Presented?

A. Are Rates or Percentages Properly Interpreted?

1. Have appropriate benchmarks been provided?

2. Have the data collection procedures changed?

3. Are the categories appropriate?

4. How big is the base of a percentage?

5. What comparisons are made?

B. Is an Appropriate Measure of Association Used?

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 211
Reference Guide on Statistics d aVid h. Kaye and daVid a. freedman David H. Kaye, M.A., J.D., is Distinguished Professor of Law and Weiss Family Scholar, The Pennsylvania State University, University Park, and Regents’ Professor Emeritus, Arizona State University Sandra Day O’Connor College of Law and School of Life Sciences, Tempe. David A. Freedman, Ph.D., was Professor of Statistics, University of California, Berkeley. [Editor’s Note: Sadly, Professor Freedman passed away during the production of this manual.] ConTenTs I. Introduction, 213 A. Admissibility and Weight of Statistical Studies, 214 B. Varieties and Limits of Statistical Expertise, 214 C. Procedures That Enhance Statistical Testimony, 215 1. Maintaining professional autonomy, 215 2. Disclosing other analyses, 216 3. Disclosing data and analytical methods before trial, 216 II. How Have the Data Been Collected? 216 A. Is the Study Designed to Investigate Causation? 217 1. Types of studies, 217 2. Randomized controlled experiments, 220 3. Observational studies, 220 4. Can the results be generalized? 222 B. Descriptive Surveys and Censuses, 223 1. What method is used to select the units? 223 2. Of the units selected, which are measured? 226 C. Individual Measurements, 227 1. Is the measurement process reliable? 227 2. Is the measurement process valid? 228 3. Are the measurements recorded correctly? 229 D. What Is Random? 230 III. How Have the Data Been Presented? 230 A. Are Rates or Percentages Properly Interpreted? 230 1. Have appropriate benchmarks been provided? 230 2. Have the data collection procedures changed? 231 3. Are the categories appropriate? 231 4. How big is the base of a percentage? 233 5. What comparisons are made? 233 B. Is an Appropriate Measure of Association Used? 233 211

OCR for page 211
Reference Manual on Scientific Evidence C. Does a Graph Portray Data Fairly? 236 1. How are trends displayed? 236 2. How are distributions displayed? 236 D. Is an Appropriate Measure Used for the Center of a Distribution? 238 E. Is an Appropriate Measure of Variability Used? 239 IV. What Inferences Can Be Drawn from the Data? 240 A. Estimation, 242 1. What estimator should be used? 242 2. What is the standard error? The confidence interval? 243 3. How big should the sample be? 246 4. What are the technical difficulties? 247 B. Significance Levels and Hypothesis Tests, 249 1. What is the p-value? 249 2. Is a difference statistically significant? 251 3. Tests or interval estimates? 252 4. Is the sample statistically significant? 253 C. Evaluating Hypothesis Tests, 253 1. What is the power of the test? 253 2. What about small samples? 254 3. One tail or two? 255 4. How many tests have been done? 256 5. What are the rival hypotheses? 257 D. Posterior Probabilities, 258 V. Correlation and Regression, 260 A. Scatter Diagrams, 260 B. Correlation Coefficients, 261 1. Is the association linear? 262 2. Do outliers influence the correlation coefficient? 262 3. Does a confounding variable influence the coefficient? 262 C. Regression Lines, 264 1. What are the slope and intercept? 265 2. What is the unit of analysis? 266 D. Statistical Models, 268 Appendix, 273 A. Frequentists and Bayesians, 273 B. The Spock Jury: Technical Details, 275 C. The Nixon Papers: Technical Details, 278 D. A Social Science Example of Regression: Gender Discrimination in Salaries, 279 1. The regression model, 279 2. Standard errors, t-statistics, and statistical significance, 281 Glossary of Terms, 283 References on Statistics, 302 212

OCR for page 211
Reference Guide on Statistics I. Introduction Statistical assessments are prominent in many kinds of legal cases, including antitrust, employment discrimination, toxic torts, and voting rights cases.1 This reference guide describes the elements of statistical reasoning. We hope the expla- nations will help judges and lawyers to understand statistical terminology, to see the strengths and weaknesses of statistical arguments, and to apply relevant legal doctrine. The guide is organized as follows: • Section I provides an overview of the field, discusses the admissibility of statistical studies, and offers some suggestions about procedures that encourage the best use of statistical evidence. • Section II addresses data collection and explains why the design of a study is the most important determinant of its quality. This section compares experiments with observational studies and surveys with censuses, indicat- ing when the various kinds of study are likely to provide useful results. • Section III discusses the art of summarizing data. This section considers the mean, median, and standard deviation. These are basic descriptive statistics, and most statistical analyses use them as building blocks. This section also discusses patterns in data that are brought out by graphs, percentages, and tables. • Section IV describes the logic of statistical inference, emphasizing founda- tions and disclosing limitations. This section covers estimation, standard errors and confidence intervals, p-values, and hypothesis tests. • Section V shows how associations can be described by scatter diagrams, correlation coefficients, and regression lines. Regression is often used to infer causation from association. This section explains the technique, indi- cating the circumstances under which it and other statistical models are likely to succeed—or fail. • An appendix provides some technical details. • The glossary defines statistical terms that may be encountered in litigation. 1. See generally Statistical Science in the Courtroom (Joseph L. Gastwirth ed., 2000); Statistics and the Law (Morris H. DeGroot et al. eds., 1986); National Research Council, The Evolving Role of Statistical Assessments as Evidence in the Courts (Stephen E. Fienberg ed., 1989) [hereinafter The Evolving Role of Statistical Assessments as Evidence in the Courts]; Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers (2d ed. 2001); 1 & 2 Joseph L. Gastwirth, Statistical Reasoning in Law and Public Policy (1988); Hans Zeisel & David Kaye, Prove It with Figures: Empirical Methods in Law and Litigation (1997). 213

OCR for page 211
Reference Manual on Scientific Evidence A. Admissibility and Weight of Statistical Studies Statistical studies suitably designed to address a material issue generally will be admissible under the Federal Rules of Evidence. The hearsay rule rarely is a serious barrier to the presentation of statistical studies, because such studies may be offered to explain the basis for an expert’s opinion or may be admissible under the learned treatise exception to the hearsay rule.2 Because most statistical methods relied on in court are described in textbooks or journal articles and are capable of producing useful results when properly applied, these methods generally satisfy important aspects of the “scientific knowledge” requirement in Daubert v. Merrell Dow Pharmaceuticals, Inc.3 Of course, a particular study may use a method that is entirely appropriate but that is so poorly executed that it should be inadmissible under Federal Rules of Evidence 403 and 702.4 Or, the method may be inappro- priate for the problem at hand and thus lack the “fit” spoken of in Daubert.5 Or the study might rest on data of the type not reasonably relied on by statisticians or substantive experts and hence run afoul of Federal Rule of Evidence 703. Often, however, the battle over statistical evidence concerns weight or sufficiency rather than admissibility. B. Varieties and Limits of Statistical Expertise For convenience, the field of statistics may be divided into three subfields: prob- ability theory, theoretical statistics, and applied statistics. Probability theory is the mathematical study of outcomes that are governed, at least in part, by chance. Theoretical statistics is about the properties of statistical procedures, including error rates; probability theory plays a key role in this endeavor. Applied statistics draws on both of these fields to develop techniques for collecting or analyzing particular types of data. 2. See generally 2 McCormick on Evidence §§ 321, 324.3 (Kenneth S. Broun ed., 6th ed. 2006). Studies published by government agencies also may be admissible as public records. Id. § 296. 3. 509 U.S. 579, 589–90 (1993). 4. See Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999) (suggesting that the trial court should “make certain that an expert, whether basing testimony upon professional studies or personal experience, employs in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.”); Malletier v. Dooney & Bourke, Inc., 525 F. Supp. 2d 558, 562–63 (S.D.N.Y. 2007) (“While errors in a survey’s methodology usually go to the weight accorded to the conclusions rather than its admissibility, . . . ‘there will be occasions when the proffered survey is so flawed as to be completely unhelpful to the trier of fact.’”) (quoting AHP Subsidiary Holding Co. v. Stuart Hale Co., 1 F.3d 611, 618 (7th Cir.1993)). 5. Daubert, 509 U.S. at 591; Anderson v. Westinghouse Savannah River Co., 406 F.3d 248 (4th Cir. 2005) (motion to exclude statistical analysis that compared black and white employees without adequately taking into account differences in their job titles or positions was properly granted under Daubert); Malletier, 525 F. Supp. 2d at 569 (excluding a consumer survey for “a lack of fit between the survey’s questions and the law of dilution” and errors in the execution of the survey). 214

OCR for page 211
Reference Guide on Statistics Statistical expertise is not confined to those with degrees in statistics. Because statistical reasoning underlies many kinds of empirical research, scholars in a variety of fields—including biology, economics, epidemiology, political science, and psychology—are exposed to statistical ideas, with an emphasis on the methods most important to the discipline. Experts who specialize in using statistical methods, and whose professional careers demonstrate this orientation, are most likely to use appropriate procedures and correctly interpret the results. By contrast, forensic scientists often lack basic information about the studies underlying their testimony. State v. Garrison6 illus- trates the problem. In this murder prosecution involving bite mark evidence, a dentist was allowed to testify that “the probability factor of two sets of teeth being identical in a case similar to this is, approximately, eight in one million,” even though “he was unaware of the formula utilized to arrive at that figure other than that it was ‘computerized.’”7 At the same time, the choice of which data to examine, or how best to model a particular process, could require subject matter expertise that a statistician lacks. As a result, cases involving statistical evidence frequently are (or should be) “two expert” cases of interlocking testimony. A labor economist, for example, may supply a definition of the relevant labor market from which an employer draws its employees; the statistical expert may then compare the race of new hires to the racial composition of the labor market. Naturally, the value of the statistical analysis depends on the substantive knowledge that informs it.8 C. Procedures That Enhance Statistical Testimony 1. Maintaining professional autonomy Ideally, experts who conduct research in the context of litigation should proceed with the same objectivity that would be required in other contexts. Thus, experts who testify (or who supply results used in testimony) should conduct the analysis required to address in a professionally responsible fashion the issues posed by the litigation.9 Questions about the freedom of inquiry accorded to testifying experts, 6. 585 P.2d 563 (Ariz. 1978). 7. Id. at 566, 568. For other examples, see David H. Kaye et al., The New Wigmore: A Treatise on Evidence: Expert Evidence § 12.2 (2d ed. 2011). 8. In Vuyanich v. Republic National Bank, 505 F. Supp. 224, 319 (N.D. Tex. 1980), vacated, 723 F.2d 1195 (5th Cir. 1984), defendant’s statistical expert criticized the plaintiffs’ statistical model for an implicit, but restrictive, assumption about male and female salaries. The district court trying the case accepted the model because the plaintiffs’ expert had a “very strong guess” about the assumption, and her expertise included labor economics as well as statistics. Id. It is doubtful, however, that economic knowledge sheds much light on the assumption, and it would have been simple to perform a less restrictive analysis. 9. See The Evolving Role of Statistical Assessments as Evidence in the Courts, supra note 1, at 164 (recommending that the expert be free to consult with colleagues who have not been retained 215

OCR for page 211
Reference Manual on Scientific Evidence as well as the scope and depth of their investigations, may reveal some of the limitations to the testimony. 2. Disclosing other analyses Statisticians analyze data using a variety of methods. There is much to be said for looking at the data in several ways. To permit a fair evaluation of the analysis that is eventually settled on, however, the testifying expert can be asked to explain how that approach was developed. According to some commentators, counsel who know of analyses that do not support the client’s position should reveal them, rather than presenting only favorable results.10 3. Disclosing data and analytical methods before trial The collection of data often is expensive and subject to errors and omissions. Moreover, careful exploration of the data can be time-consuming. To minimize debates at trial over the accuracy of data and the choice of analytical techniques, pretrial discovery procedures should be used, particularly with respect to the qual- ity of the data and the method of analysis.11 II. How Have the Data Been Collected? The interpretation of data often depends on understanding “study design”—the plan for a statistical study and its implementation.12 Different designs are suited to answering different questions. Also, flaws in the data can undermine any statistical analysis, and data quality is often determined by study design. In many cases, statistical studies are used to show causation. Do food additives cause cancer? Does capital punishment deter crime? Would additional disclosures by any party to the litigation and that the expert receive a letter of engagement providing for these and other safeguards). 10. Id. at 167; cf. William W. Schwarzer, In Defense of “Automatic Disclosure in Discovery,” 27 Ga. L. Rev. 655, 658–59 (1993) (“[T]he lawyer owes a duty to the court to make disclosure of core information.”). The National Research Council also recommends that “if a party gives statistical data to different experts for competing analyses, that fact be disclosed to the testifying expert, if any.” The Evolving Role of Statistical Assessments as Evidence in the Courts, supra note 1, at 167. 11. See The Special Comm. on Empirical Data in Legal Decision Making, Recommendations on Pretrial Proceedings in Cases with Voluminous Data, reprinted in The Evolving Role of Statistical Assessments as Evidence in the Courts, supra note 1, app. F; see also David H. Kaye, Improving Legal Statistics, 24 Law & Soc’y Rev. 1255 (1990). 12. For introductory treatments of data collection, see, for example, David Freedman et al., Statistics (4th ed. 2007); Darrell Huff, How to Lie with Statistics (1993); David S. Moore & William I. Notz, Statistics: Concepts and Controversies (6th ed. 2005); Hans Zeisel, Say It with Figures (6th ed. 1985); Zeisel & Kaye, supra note 1. 216

OCR for page 211
Reference Guide on Statistics in a securities prospectus cause investors to behave differently? The design of studies to investigate causation is the first topic of this section.13 Sample data can be used to describe a population. The population is the whole class of units that are of interest; the sample is the set of units chosen for detailed study. Inferences from the part to the whole are justified when the sample is representative. Sampling is the second topic of this section. Finally, the accuracy of the data will be considered. Because making and recording measurements is an error-prone activity, error rates should be assessed and the likely impact of errors considered. Data quality is the third topic of this section. A. Is the Study Designed to Investigate Causation? 1. Types of studies When causation is the issue, anecdotal evidence can be brought to bear. So can observational studies or controlled experiments. Anecdotal reports may be of value, but they are ordinarily more helpful in generating lines of inquiry than in proving causation.14 Observational studies can establish that one factor is associ- 13. See also Michael D. Green et al., Reference Guide on Epidemiology, Section V, in this manual; Joseph Rodricks, Reference Guide on Exposure Science, Section E, in this manual. 14. In medicine, evidence from clinical practice can be the starting point for discovery of cause-and-effect relationships. For examples, see David A. Freedman, On Types of Scientific Enquiry, in The Oxford Handbook of Political Methodology 300 (Janet M. Box-Steffensmeier et al. eds., 2008). Anecdotal evidence is rarely definitive, and some courts have suggested that attempts to infer causa- tion from anecdotal reports are inadmissible as unsound methodology under Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993). See, e.g., McClain v. Metabolife Int’l, Inc., 401 F.3d 1233, 1244 (11th Cir. 2005) (“simply because a person takes drugs and then suffers an injury does not show causation. Drawing such a conclusion from temporal relationships leads to the blunder of the post hoc ergo propter hoc fallacy.”); In re Baycol Prods. Litig., 532 F. Supp. 2d 1029, 1039–40 (D. Minn. 2007) (excluding a meta-analysis based on reports to the Food and Drug Administration of adverse events); Leblanc v. Chevron USA Inc., 513 F. Supp. 2d 641, 650 (E.D. La. 2007) (excluding plaintiffs’ experts’ opinions that benzene causes myelofibrosis because the causal hypothesis “that has been generated by case reports . . . has not been confirmed by the vast majority of epidemiologic studies of workers being exposed to benzene and more generally, petroleum products.”), vacated, 275 Fed. App’x. 319 (5th Cir. 2008) (remanding for consideration of newer government report on health effects of benzene); cf. Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309, 1321 (2011) (concluding that adverse event reports combined with other information could be of concern to a reasonable investor and therefore subject to a requirement of disclosure under SEC Rule 10b-5, but stating that “the mere existence of reports of adverse events . . . says nothing in and of itself about whether the drug is causing the adverse events”). Other courts are more open to “differential diagnoses” based primarily on timing. E.g., Best v. Lowe’s Home Ctrs., Inc., 563 F.3d 171 (6th Cir. 2009) (reversing the exclusion of a physician’s opinion that exposure to propenyl chloride caused a man to lose his sense of smell because of the timing in this one case and the physician’s inability to attribute the change to anything else); Kaye et al., supra note 7, §§ 8.7.2 & 12.5.1. See also Matrixx Initiatives, supra, at 1322 (listing “a temporal relationship” in a single patient as one indication of “a reliable causal link”). 217

OCR for page 211
Reference Manual on Scientific Evidence ated with another, but work is needed to bridge the gap between association and causation. Randomized controlled experiments are ideally suited for demonstrat- ing causation. Anecdotal evidence usually amounts to reports that events of one kind are followed by events of another kind. Typically, the reports are not even sufficient to show association, because there is no comparison group. For example, some children who live near power lines develop leukemia. Does exposure to electrical and magnetic fields cause this disease? The anecdotal evidence is not compelling because leukemia also occurs among children without exposure.15 It is necessary to compare disease rates among those who are exposed and those who are not. If exposure causes the disease, the rate should be higher among the exposed and lower among the unexposed. That would be association. The next issue is crucial: Exposed and unexposed people may differ in ways other than the exposure they have experienced. For example, children who live near power lines could come from poorer families and be more at risk from other environmental hazards. Such differences can create the appearance of a cause-and- effect relationship. Other differences can mask a real relationship. Cause-and-effect relationships often are quite subtle, and carefully designed studies are needed to draw valid conclusions. An epidemiological classic makes the point. At one time, it was thought that lung cancer was caused by fumes from tarring the roads, because many lung cancer patients lived near roads that recently had been tarred. This is anecdotal evidence. But the argument is incomplete. For one thing, most people—whether exposed to asphalt fumes or unexposed—did not develop lung cancer. A comparison of rates was needed. The epidemiologists found that exposed persons and unexposed persons suffered from lung cancer at similar rates: Tar was probably not the causal agent. Exposure to cigarette smoke, however, turned out to be strongly associated with lung cancer. This study, in combination with later ones, made a compelling case that smoking cigarettes is the main cause of lung cancer.16 A good study design compares outcomes for subjects who are exposed to some factor (the treatment group) with outcomes for other subjects who are 15. See National Research Council, Committee on the Possible Effects of Electromagnetic Fields on Biologic Systems (1997); Zeisel & Kaye, supra note 1, at 66–67. There are problems in measur- ing exposure to electromagnetic fields, and results are inconsistent from one study to another. For such reasons, the epidemiological evidence for an effect on health is inconclusive. National Research Council, supra; Zeisel & Kaye, supra; Edward W. Campion, Power Lines, Cancer, and Fear, 337 New Eng. J. Med. 44 (1997) (editorial); Martha S. Linet et al., Residential Exposure to Magnetic Fields and Acute Lymphoblastic Leukemia in Children, 337 New Eng. J. Med. 1 (1997); Gary Taubes, Magnetic Field-Cancer Link: Will It Rest in Peace?, 277 Science 29 (1997) (quoting various epidemiologists). 16. Richard Doll & A. Bradford Hill, A Study of the Aetiology of Carcinoma of the Lung, 2 Brit. Med. J. 1271 (1952). This was a matched case-control study. Cohort studies soon followed. See Green et al., supra note 13. For a review of the evidence on causation, see 38 International Agency for Research on Cancer (IARC), World Health Org., IARC Monographs on the Evaluation of the Carcinogenic Risk of Chemicals to Humans: Tobacco Smoking (1986). 218

OCR for page 211
Reference Guide on Statistics not exposed (the control group). Now there is another important distinction to be made—that between controlled experiments and observational studies. In a controlled experiment, the investigators decide which subjects will be exposed and which subjects will go into the control group. In observational studies, by contrast, the subjects themselves choose their exposures. Because of self-selection, the treatment and control groups are likely to differ with respect to influential factors other than the one of primary interest. (These other factors are called lurk- ing variables or confounding variables.)17 With the health effects of power lines, family background is a possible confounder; so is exposure to other hazards. Many confounders have been proposed to explain the association between smoking and lung cancer, but careful epidemiological studies have ruled them out, one after the other. Confounding remains a problem to reckon with, even for the best observa- tional research. For example, women with herpes are more likely to develop cer- vical cancer than other women. Some investigators concluded that herpes caused cancer: In other words, they thought the association was causal. Later research showed that the primary cause of cervical cancer was human papilloma virus (HPV). Herpes was a marker of sexual activity. Women who had multiple sexual partners were more likely to be exposed not only to herpes but also to HPV. The association between herpes and cervical cancer was due to other variables.18 What are “variables?” In statistics, a variable is a characteristic of units in a study. With a study of people, the unit of analysis is the person. Typical vari- ables include income (dollars per year) and educational level (years of schooling completed): These variables describe people. With a study of school districts, the unit of analysis is the district. Typical variables include average family income of district residents and average test scores of students in the district: These variables describe school districts. When investigating a cause-and-effect relationship, the variable that repre- sents the effect is called the dependent variable, because it depends on the causes. The variables that represent the causes are called independent variables. With a study of smoking and lung cancer, the independent variable would be smoking (e.g., number of cigarettes per day), and the dependent variable would mark the presence or absence of lung cancer. Dependent variables also are called outcome variables or response variables. Synonyms for independent variables are risk factors, predictors, and explanatory variables. 17. For example, a confounding variable may be correlated with the independent variable and act causally on the dependent variable. If the units being studied differ on the independent variable, they are also likely to differ on the confounder. The confounder—not the independent variable—could therefore be responsible for differences seen on the dependent variable. 18. For additional examples and further discussion, see Freedman et al., supra note 12, at 12–28, 150–52; David A. Freedman, From Association to Causation: Some Remarks on the History of Statistics, 14 Stat. Sci. 243 (1999). Some studies find that herpes is a “cofactor,” which increases risk among women who are also exposed to HPV. Only certain strains of HPV are carcinogenic. 219

OCR for page 211
Reference Manual on Scientific Evidence 2. Randomized controlled experiments In randomized controlled experiments, investigators assign subjects to treatment or control groups at random. The groups are therefore likely to be comparable, except for the treatment. This minimizes the role of confounding. Minor imbal- ances will remain, due to the play of random chance; the likely effect on study results can be assessed by statistical techniques.19 The bottom line is that causal inferences based on well-executed randomized experiments are generally more secure than inferences based on well-executed observational studies. The following example should help bring the discussion together. Today, we know that taking aspirin helps prevent heart attacks. But initially, there was some controversy. People who take aspirin rarely have heart attacks. This is anecdotal evidence for a protective effect, but it proves almost nothing. After all, few people have frequent heart attacks, whether or not they take aspirin regularly. A good study compares heart attack rates for two groups: people who take aspirin (the treatment group) and people who do not (the controls). An observational study would be easy to do, but in such a study the aspirin-takers are likely to be dif- ferent from the controls. Indeed, they are likely to be sicker—that is why they are taking aspirin. The study would be biased against finding a protective effect. Randomized experiments are harder to do, but they provide better evidence. It is the experiments that demonstrate a protective effect.20 In summary, data from a treatment group without a control group generally reveal very little and can be misleading. Comparisons are essential. If subjects are assigned to treatment and control groups at random, a difference in the outcomes between the two groups can usually be accepted, within the limits of statistical error (infra Section IV), as a good measure of the treatment effect. However, if the groups are created in any other way, differences that existed before treatment may contribute to differences in the outcomes or mask differences that otherwise would become manifest. Observational studies succeed to the extent that the treat- ment and control groups are comparable—apart from the treatment. 3. Observational studies The bulk of the statistical studies seen in court are observational, not experi- mental. Take the question of whether capital punishment deters murder. To conduct a randomized controlled experiment, people would need to be assigned randomly to a treatment group or a control group. People in the treatment group would know they were subject to the death penalty for murder; the 19. Randomization of subjects to treatment or control groups puts statistical tests of significance on a secure footing. Freedman et al., supra note 12, at 503–22, 545–63; see infra Section IV. 20. In other instances, experiments have banished strongly held beliefs. E.g., Scott M. Lippman et al., Effect of Selenium and Vitamin E on Risk of Prostate Cancer and Other Cancers: The Selenium and Vitamin E Cancer Prevention Trial (SELECT), 301 JAMA 39 (2009). 220

OCR for page 211
Reference Guide on Statistics controls would know that they were exempt. Conducting such an experiment is not possible. Many studies of the deterrent effect of the death penalty have been conducted, all observational, and some have attracted judicial attention. Researchers have cata- logued differences in the incidence of murder in states with and without the death penalty and have analyzed changes in homicide rates and execution rates over the years. When reporting on such observational studies, investigators may speak of “control groups” (e.g., the states without capital punishment) or claim they are “con- trolling for” confounding variables by statistical methods.21 However, association is not causation. The causal inferences that can be drawn from analysis of observational data—no matter how complex the statistical technique—usually rest on a foundation that is less secure than that provided by randomized controlled experiments. That said, observational studies can be very useful. For example, there is strong observational evidence that smoking causes lung cancer (supra Section II.A.1). Gen- erally, observational studies provide good evidence in the following circumstances: • The association is seen in studies with different designs, on different kinds of subjects, and done by different research groups.22 That reduces the chance that the association is due to a defect in one type of study, a peculiarity in one group of subjects, or the idiosyncrasies of one research group. • The association holds when effects of confounding variables are taken into account by appropriate methods, for example, comparing smaller groups that are relatively homogeneous with respect to the confounders.23 • There is a plausible explanation for the effect of the independent variable; alternative explanations in terms of confounding should be less plausible than the proposed causal link.24 21. A procedure often used to control for confounding in observational studies is regression analysis. The underlying logic is described infra Section V.D and in Daniel L. Rubinfeld, Reference Guide on Multiple Regression, Section II, in this manual. But see Richard A. Berk, Regression Analysis: A Constructive Critique (2004); Rethinking Social Inquiry: Diverse Tools, Shared Standards (Henry E. Brady & David Collier eds., 2004); David A. Freedman, Statistical Models: Theory and Practice (2005); David A. Freedman, Oasis or Mirage, Chance, Spring 2008, at 59. 22. For example, case-control studies are designed one way and cohort studies another, with many variations. See, e.g., Leon Gordis, Epidemiology (4th ed. 2008); supra note 16. 23. The idea is to control for the influence of a confounder by stratification—making compari- sons separately within groups for which the confounding variable is nearly constant and therefore has little influence over the variables of primary interest. For example, smokers are more likely to get lung cancer than nonsmokers. Age, gender, social class, and region of residence are all confounders, but controlling for such variables does not materially change the relationship between smoking and cancer rates. Furthermore, many different studies—of different types and on different populations—confirm the causal link. That is why most experts believe that smoking causes lung cancer and many other diseases. For a review of the literature, see International Agency for Research on Cancer, supra note 16. 24. A. Bradford Hill, The Environment and Disease: Association or Causation?, 58 Proc. Royal Soc’y Med. 295 (1965); Alfred S. Evans, Causation and Disease: A Chronological Journey 187 (1993). Plausibility, however, is a function of time and circumstances. 221

OCR for page 211
Reference Manual on Scientific Evidence than 1%, the result is highly significant. The p-value is also called the observed significance level. See significance test; statistical hypothesis. parameter. A numerical characteristic of a population or a model. See prob- ability model. percentile. To get the percentiles of a dataset, array the data from the smallest value to the largest. Take the 90th percentile by way of example: 90% of the values fall below the 90th percentile, and 10% are above. (To be very precise: At least 90% of the data are at the 90th percentile or below; at least 10% of the data are at the 90th percentile or above.) The 50th percentile is the median: 50% of the values fall below the median, and 50% are above. On the LSAT, a score of 152 places a test taker at the 50th percentile; a score of 164 is at the 90th percentile; a score of 172 is at the 99th percentile. Compare mean; median; quartile. placebo. See double-blind experiment. point estimate. An estimate of the value of a quantity expressed as a single num- ber. See estimator. Compare confidence interval; interval estimate. Poisson distribution. A limiting case of the binomial distribution, when the number of trials is large and the common probability is small. The parameter of the approximating Poisson distribution is the number of trials times the common probability, which is the expected number of events. When this number is large, the Poisson distribution may be approximated by a normal distribution. population. Also, universe. All the units of interest to the researcher. Compare sample; sampling frame. population size. Also, size of population. Number of units in the population. posterior probability. See Bayes’ rule. power. The probability that a statistical test will reject the null hypothesis. To compute power, one has to fix the size of the test and specify parameter values outside the range given by the null hypothesis. A powerful test has a good chance of detecting an effect when there is an effect to be detected. See beta; significance test. Compare alpha; size; p-value. practical significance. Substantive importance. Statistical significance does not necessarily establish practical significance. With large samples, small differ- ences can be statistically significant. See significance test. practice effects. Changes in test scores that result from taking the same test twice in succession, or taking two similar tests one after the other. predicted value. See residual. predictive validity. A skills test has predictive validity to the extent that test scores are well correlated with later performance, or more generally with outcomes that the test is intended to predict. See validity. Compare reliability. 292

OCR for page 211
Reference Guide on Statistics predictor. See independent variable. prior probability. See Bayes’ rule. probability. Chance, on a scale from 0 to 1. Impossibility is represented by 0, certainty by 1. Equivalently, chances may be quoted in percent; 100% cor- responds to 1, 5% corresponds to .05, and so forth. probability density. Describes the probability distribution of a random variable. The chance that the random variable falls in an interval equals the area below the density and above the interval. (However, not all random variables have densities.) See probability distribution; random variable. probability distribution. Gives probabilities for possible values or ranges of values of a random variable. Often, the distribution is described in terms of a density. See probability density. probability histogram. See histogram. probability model. Relates probabilities of outcomes to parameters; also, statis- tical model. The latter connotes unknown parameters. probability sample. A sample drawn from a sampling frame by some objective chance mechanism; each unit has a known probability of being sampled. Such samples minimize selection bias, but can be expensive to draw. psychometrics. The study of psychological measurement and testing. qualitative variable; quantitative variable. Describes qualitative features of subjects in a study (e.g., marital status—never-married, married, widowed, divorced, separated). A quantitative variable describes numerical features of the subjects (e.g., height, weight, income). This is not a hard-and-fast distinction, because qualitative features may be given numerical codes, as with a dummy variable. Quantitative variables may be classified as discrete or continuous. Concepts such as the mean and the standard deviation apply only to quantitative variables. Compare continuous variable; discrete variable; dummy variable. See variable. quartile. The 25th or 75th percentile. See percentile. Compare median. R-squared (R2). Measures how well a regression equation fits the data. R-squared varies between 0 (no fit) and 1 (perfect fit). R-squared does not measure the extent to which underlying assumptions are justified. See regression model. Compare multiple correlation coefficient; standard error of regression. random error. Sources of error that are random in their effect, like draws made at random from a box. These are reflected in the error term of a statistical model. Some authors refer to random error as chance error or sampling error. See regression model. random variable. A variable whose possible values occur according to some probability mechanism. For example, if a pair of dice are thrown, the total number of spots is a random variable. The chance of two spots is 1/36, the 293

OCR for page 211
Reference Manual on Scientific Evidence chance of three spots is 2/36, and so forth; the most likely number is 7, with chance 6/36. The expected value of a random variable is the weighted average of the possible values; the weights are the probabilities. In our example, the expected value is 1 2 3 5 6 × 2+ × 3+ ×4 + ×6+ ×7 36 36 36 36 36 5 4 3 2 1 + × 8 + × 9 + × 10 + × 11 + × 12 36 36 36 36 36 In many problems, the weighted average is computed with respect to the density; then sums must be replaced by integrals. The expected value need not be a possible value for the random variable. Generally, a random variable will be somewhere around its expected value, but will be off (in either direction) by something like a standard error (SE) or so. If the random variable has a more or less normal distribution, there is about a 68% chance for it to fall in the range expected value – SE to expected value + SE. See normal curve; standard error. randomization. See controlled experiment; randomized controlled experiment. randomized controlled experiment. A controlled experiment in which sub- jects are placed into the treatment and control groups at random—as if by a lottery. See controlled experiment. Compare observational study. range. The difference between the biggest and the smallest values in a batch of numbers. rate. In an epidemiological study, the number of events, divided by the size of the population; often cross-classified by age and gender. For example, the death rate from heart disease among American men ages 55–64 in 2004 was about three per thousand. Among men ages 65–74, the rate was about seven per thousand. Among women, the rate was about half that for men. Rates adjust for differences in sizes of populations or subpopulations. Often, rates are computed per unit of time, e.g., per thousand persons per year. Data source: Statistical Abstract of the United States tbl. 115 (2008). regression coefficient. The coefficient of a variable in a regression equation. See regression model. regression diagnostics. Procedures intended to check whether the assumptions of a regression model are appropriate. regression equation. See regression model. regression line. The graph of a (simple) regression equation. regression model. A regression model attempts to combine the values of certain variables (the independent or explanatory variables) in order to get expected values for another variable (the dependent variable). Sometimes, the phrase 294

OCR for page 211
Reference Guide on Statistics “regression model” refers to a probability model for the data; if no qualifica- tions are made, the model will generally be linear, and errors will be assumed independent across observations, with common variance, The coefficients in the linear combination are called regression coefficients; these are parameters. At times, “regression model” refers to an equation (“the regression equation”) estimated from data, typically by least squares. For example, in a regression study of salary differences between men and women in a firm, the analyst may include a dummy variable for gender, as well as statistical controls such as education and experience to adjust for productivity differences between men and women. The dummy variable would be defined as 1 for the men and 0 for the women. Salary would be the dependent variable; education, experience, and the dummy would be the independent variables. See least squares; multiple regression; random error; variance. Compare general linear model. relative frequency. See frequency. relative risk. A measure of association used in epidemiology. For example, if 10% of all people exposed to a chemical develop a disease, compared to 5% of people who are not exposed, then the disease occurs twice as frequently among the exposed people: The relative risk is 10%/5% = 2. A relative risk of 1 indicates no association. For more details, see Leon Gordis, Epidemiology (4th ed. 2008). Compare odds ratio. reliability. The extent to which a measurement process gives the same results on repeated measurement of the same thing. Compare validity. representative sample. Not a well-defined technical term. A sample judged to fairly represent the population, or a sample drawn by a process likely to give samples that fairly represent the population, for example, a large probability sample. resampling. See bootstrap. residual. The difference between an actual and a predicted value. The predicted value comes typically from a regression equation, and is better called the fit- ted value, because there is no real prediction going on. See regression model; independent variable. response variable. See independent variable. risk. Expected loss. “Expected” means on average, over the various datasets that could be generated by the statistical model under examination. Usually, risk cannot be computed exactly but has to be estimated, because the parameters in the statistical model are unknown and must be estimated. See loss func- tion; random variable. risk factor. See independent variable. robust. A statistic or procedure that does not change much when data or assump- tions are modified slightly. 295

OCR for page 211
Reference Manual on Scientific Evidence sample. A set of units collected for study. Compare population. sample size. Also, size of sample. The number of units in a sample. sample weights. See stratified random sample. sampling distribution. The distribution of the values of a statistic, over all pos- sible samples from a population. For example, suppose a random sample is drawn. Some values of the sample mean are more likely; others are less likely. The sampling distribution specifies the chance that the sample mean will fall in one interval rather than another. sampling error. A sample is part of a population. When a sample is used to estimate a numerical characteristic of the population, the estimate is likely to differ from the population value because the sample is not a perfect micro- cosm of the whole. If the estimate is unbiased, the difference between the estimate and the exact value is sampling error. More generally, estimate = true value + bias + sampling error Sampling error is also called chance error or random error. See standard error. Compare bias; nonsampling error. sampling frame. A list of units designed to represent the entire population as completely as possible. The sample is drawn from the frame. sampling interval. See systematic sample. scatter diagram. Also, scatterplot; scattergram. A graph showing the relation- ship between two variables in a study. Each dot represents one subject. One variable is plotted along the horizontal axis, the other variable is plotted along the vertical axis. A scatter diagram is homoscedastic when the spread is more or less the same inside any vertical strip. If the spread changes from one strip to another, the diagram is heteroscedastic. selection bias. Systematic error due to nonrandom selection of subjects for study. sensitivity. In clinical medicine, the probability that a test for a disease will give a positive result given that the patient has the disease. Sensitivity is analogous to the power of a statistical test. Compare specificity. sensitivity analysis. Analyzing data in different ways to see how results depend on methods or assumptions. sign test. A statistical test based on counting and the binomial distribution. For example, a Finnish study of twins found 22 monozygotic twin pairs where 1 twin smoked, 1 did not, and at least 1 of the twins had died. That sets up a race to death. In 17 cases, the smoker died first; in 5 cases, the nonsmoker died first. The null hypothesis is that smoking does not affect time to death, so the chances are 50-50 for the smoker to die first. On the null hypothesis, the chance that the smoker will win the race 17 or more times out of 22 is 296

OCR for page 211
Reference Guide on Statistics 8/1000. That is the p-value. The p-value can be computed from the binomial distribution. For additional detail, see Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers 339–41 (2d ed. 2001); David A. Freedman et al., Statistics 262–63 (4th ed. 2007). significance level. See fixed significance level; p-value. significance test. Also, statistical test; hypothesis test; test of significance. A signifi- cance test involves formulating a statistical hypothesis and a test statistic, com- puting a p-value, and comparing p to some preestablished value (a) to decide if the test statistic is significant. The idea is to see whether the data conform to the predictions of the null hypothesis. Generally, a large test statistic goes with a small p-value; and small p-values would undermine the null hypothesis. For example, suppose that a random sample of male and female employees were given a skills test and the mean scores of the men and women were different—in the sample. To judge whether the difference is due to sampling error, a statistician might consider the implications of competing hypotheses about the difference in the population. The null hypothesis would say that on average, in the population, men and women have the same scores: The difference observed in the data is then just due to sampling error. A one-sided alternative hypothesis would be that on average, in the population, men score higher than women. The one-sided test would reject the null hypothesis if the sample men score substantially higher than the women—so much so that the difference is hard to explain on the basis of sampling error. In contrast, the null hypothesis could be tested against the two-sided alternative that on average, in the population, men score differently than women—higher or lower. The corresponding two-sided test would reject the null hypothesis if the sample men score substantially higher or substantially lower than the women. The one-sided and two-sided tests would both be based on the same data, and use the same t-statistic. However, if the men in the sample score higher than the women, the one-sided test would give a p-value only half as large as the two-sided test; that is, the one-sided test would appear to give stronger evidence against the null hypothesis. (“One-sided” and “one-tailed” are synonymous; so are “two-sided and “two-tailed.”) See p-value; statistical hypothesis; t-statistic. significant. See p-value; practical significance; significance test. simple random sample. A random sample in which each unit in the sampling frame has the same chance of being sampled. The investigators take a unit at random (as if by lottery), set it aside, take another at random from what is left, and so forth. simple regression. A regression equation that includes only one independent variable. Compare multiple regression. size. A synonym for alpha (a). 297

OCR for page 211
Reference Manual on Scientific Evidence skip factor. See systematic sample. specificity. In clinical medicine, the probability that a test for a disease will give a negative result given that the patient does not have the disease. Specificity is analogous to 1 – a, where a is the significance level of a statistical test. Compare sensitivity. spurious correlation. When two variables are correlated, one is not necessarily the cause of the other. The vocabulary and shoe size of children in elementary school, for example, are correlated—but learning more words will not make the feet grow. Such noncausal correlations are said to be spurious. (Originally, the term seems to have been applied to the correlation between two rates with the same denominator: Even if the numerators are unrelated, the common denominator will create some association.) Compare confounding variable. standard deviation (SD). Indicates how far a typical element deviates from the average. For example, in round numbers, the average height of women age 18 and over in the United States is 5 feet 4 inches. However, few women are exactly average; most will deviate from average, at least by a little. The SD is sort of an average deviation from average. For the height distribution, the SD is 3 inches. The height of a typical woman is around 5 feet 4 inches, but is off that average value by something like 3 inches. For distributions that follow the normal curve, about 68% of the elements are in the range from 1 SD below the average to 1 SD above the average. Thus, about 68% of women have heights in the range 5 feet 1 inch to 5 feet 7 inches. Deviations from the average that exceed 3 or 4 SDs are extremely unusual. Many authors use standard deviation to also mean standard error. See standard error. standard error (SE). Indicates the likely size of the sampling error in an esti- mate. Many authors use the term standard deviation instead of standard error. Compare expected value; standard deviation. standard error of regression. Indicates how actual values differ (in some aver- age sense) from the fitted values in a regression model. See regression model; residual. Compare R-squared. standard normal. See normal distribution. standardization. See standardized variable. standardized variable. Transformed to have mean zero and variance one. This involves two steps: (1) subtract the mean; (2) divide by the standard deviation. statistic. A number that summarizes data. A statistic refers to a sample; a parameter or a true value refers to a population or a probability model. statistical controls. Procedures that try to filter out the effects of confounding variables on non-experimental data, for example, by adjusting through statisti- cal procedures such as multiple regression. Variables in a multiple regression 298

OCR for page 211
Reference Guide on Statistics equation. See multiple regression; confounding variable; observational study. Compare controlled experiment. statistical dependence. See dependence. statistical hypothesis. Generally, a statement about parameters in a probability model for the data. The null hypothesis may assert that certain parameters have specified values or fall in specified ranges; the alternative hypothesis would specify other values or ranges. The null hypothesis is tested against the data with a test statistic; the null hypothesis may be rejected if there is a statistically sig- nificant difference between the data and the predictions of the null hypothesis. Typically, the investigator seeks to demonstrate the alternative hypothesis; the null hypothesis would explain the findings as a result of mere chance, and the investigator uses a significance test to rule out that possibility. See significance test. statistical independence. See independence. statistical model. See probability model. statistical test. See significance test. statistical significance. See p-value. stratified random sample. A type of probability sample. The researcher divides the population into relatively homogeneous groups called “strata,” and draws a random sample separately from each stratum. Dividing the population into strata is called “stratification.” Often the sampling fraction will vary from stratum to stratum. Then sampling weights should be used to extrapolate from the sample to the population. For example, if 1 unit in 10 is sampled from stratum A while 1 unit in 100 is sampled from stratum B, then each unit drawn from A counts as 10, and each unit drawn from B counts as 100. The first kind of unit has weight 10; the second has weight 100. See Freedman et al., Statistics 401 (4th ed. 2007). stratification. See independent variable; stratified random sample. study validity. See validity. subjectivist. See Bayesian. systematic error. See bias. systematic sample. Also, list sample. The elements of the population are num- bered consecutively as 1, 2, 3, . . . . The investigators choose a starting point and a “sampling interval” or “skip factor” k. Then, every kth element is selected into the sample. If the starting point is 1 and k = 10, for example, the sample would consist of items 1, 11, 21, . . . . Sometimes the starting point is chosen at random from 1 to k: this is a random-start systematic sample. t-statistic. A test statistic, used to make the t-test. The t-statistic indicates how far away an estimate is from its expected value, relative to the standard error. The expected value is computed using the null hypothesis that is being tested. 299

OCR for page 211
Reference Manual on Scientific Evidence Some authors refer to the t-statistic, others to the z-statistic, especially when the sample is large. With a large sample, a t-statistic larger than 2 or 3 in abso- lute value makes the null hypothesis rather implausible—the estimate is too many standard errors away from its expected value. See statistical hypothesis; significance test; t-test. t-test. A statistical test based on the t-statistic. Large t-statistics are beyond the usual range of sampling error. For example, if t is bigger than 2, or smaller than –2, then the estimate is statistically significant at the 5% level; such values of t are hard to explain on the basis of sampling error. The scale for t-statistics is tied to areas under the normal curve. For example, a t-statistic of 1.5 is not very striking, because 13% = 13/100 of the area under the normal curve is outside the range from –1.5 to 1.5. On the other hand, t = 3 is remarkable: Only 3/1000 of the area lies outside the range from –3 to 3. This discussion is predicated on having a reasonably large sample; in that context, many authors refer to the z-test rather than the t-test. Consider testing the null hypothesis that the average of a population equals a given value; the population is known to be normal. For small samples, the t-statistic follows Student’s t-distribution (when the null hypothesis holds) rather than the normal curve; larger values of t are required to achieve sig- nificance. The relevant t-distribution depends on the number of degrees of freedom, which in this context equals the sample size minus one. A t-test is not appropriate for small samples drawn from a population that is not normal. See p-value; significance test; statistical hypothesis. test statistic. A statistic used to judge whether data conform to the null hypoth- esis. The parameters of a probability model determine expected values for the data; differences between expected values and observed values are measured by a test statistic. Such test statistics include the chi-squared statistic (c2) and the t-statistic. Generally, small values of the test statistic are consistent with the null hypothesis; large values lead to rejection. See p-value; statistical hypothesis; t-statistic. time series. A series of data collected over time, for example, the Gross National Product of the United States from 1945 to 2005. treatment group. See controlled experiment. two-sided hypothesis; two-tailed hypothesis. An alternative hypothesis asserting that the values of a parameter are different from—either greater than or less than—the value asserted in the null hypothesis. A two-sided alterna- tive hypothesis suggests a two-sided (or two-tailed) test. See significance test; statistical hypothesis. Compare one-sided hypothesis. two-sided test; two-tailed test. See two-sided hypothesis. Type I error. A statistical test makes a Type I error when (1) the null hypothesis is true and (2) the test rejects the null hypothesis, i.e., there is a false posi- 300

OCR for page 211
Reference Guide on Statistics tive. For example, a study of two groups may show some difference between samples from each group, even when there is no difference in the population. When a statistical test deems the difference to be significant in this situation, it makes a Type I error. See significance test; statistical hypothesis. Compare alpha; Type II error. Type II error. A statistical test makes a Type II error when (1) the null hypoth- esis is false and (2) the test fails to reject the null hypothesis, i.e., there is a false negative. For example, there may not be a significant difference between samples from two groups when, in fact, the groups are different. See signifi- cance test; statistical hypothesis. Compare beta; Type I error. unbiased estimator. An estimator that is correct on average, over the pos- sible datasets. The estimates have no systematic tendency to be high or low. Compare bias. uniform distribution. For example, a whole number picked at random from 1 to 100 has the uniform distribution: All values are equally likely. Similarly, a uniform distribution is obtained by picking a real number at random between 0.75 and 3.25: The chance of landing in an interval is proportional to the length of the interval. validity. Measurement validity is the extent to which an instrument measures what it is supposed to, rather than something else. The validity of a standard- ized test is often indicated by the correlation coefficient between the test scores and some outcome measure (the criterion variable). See content valid- ity; differential validity; predictive validity. Compare reliability. Study validity is the extent to which results from a study can be relied upon. Study validity has two aspects, internal and external. A study has high internal validity when its conclusions hold under the particular circumstances of the study. A study has high external validity when its results are gener- alizable. For example, a well-executed randomized controlled double-blind experiment performed on an unusual study population will have high internal validity because the design is good; but its external validity will be debatable because the study population is unusual. Validity is used also in its ordinary sense: assumptions are valid when they hold true for the situation at hand. variable. A property of units in a study, which varies from one unit to another, for example, in a study of households, household income; in a study of people, employment status (employed, unemployed, not in labor force). variance. The square of the standard deviation. Compare standard error; covariance. weights. See stratified random sample. within-observer variability. Differences that occur when an observer measures the same thing twice, or measures two things that are virtually the same. Compare between-observer variability. 301

OCR for page 211
Reference Manual on Scientific Evidence z-statistic. See t-statistic. z-test. See t-test. References on Statistics General Surveys David Freedman et al., Statistics (4th ed. 2007). Darrell Huff, How to Lie with Statistics (1993). Gregory A. Kimble, How to Use (and Misuse) Statistics (1978). David S. Moore & William I. Notz, Statistics: Concepts and Controversies (2005). Michael Oakes, Statistical Inference: A Commentary for the Social and Behavioral Sciences (1986). Statistics: A Guide to the Unknown (Roxy Peck et al. eds., 4th ed. 2005). Hans Zeisel, Say It with Figures (6th ed. 1985). Reference Works for Lawyers and Judges David C. Baldus & James W.L. Cole, Statistical Proof of Discrimination (1980 & Supp. 1987) (continued as Ramona L. Paetzold & Steven L. Willborn, The Statistics of Discrimination: Using Statistical Evidence in Discrimination Cases (1994) (updated annually). David W. Barnes & John M. Conley, Statistical Evidence in Litigation: Methodol- ogy, Procedure, and Practice (1986 & Supp. 1989). James Brooks, A Lawyer’s Guide to Probability and Statistics (1990). Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers (2d ed. 2001). Modern Scientific Evidence: The Law and Science of Expert Testimony (David L. Faigman et al. eds., Volumes 1 and 2, 2d ed. 2002) (updated annually). David H. Kaye et al., The New Wigmore: A Treatise on Evidence: Expert Evi- dence § 12 (2d ed. 2011) (updated annually). National Research Council, The Evolving Role of Statistical Assessments as Evi- dence in the Courts (Stephen E. Fienberg ed., 1989). Statistical Methods in Discrimination Litigation (David H. Kaye & Mikel Aickin eds., 1986). Hans Zeisel & David Kaye, Prove It with Figures: Empirical Methods in Law and Litigation (1997). General Reference Encyclopedia of Statistical Sciences (Samuel Kotz et al. eds., 2d ed. 2005). 302