Reference Guide on Multiple Regression

*Daniel L. Rubinfeld, Ph.D., is Robert L. Bridges Professor of Law and Professor of Economics Emeritus, University of California, Berkeley, and Visiting Professor of Law at New York University Law School.*

CONTENTS

II. Research Design: Model Specification

A. What Is the Specific Question That Is Under Investigation by the Expert?

B. What Model Should Be Used to Evaluate the Question at Issue?

1. Choosing the dependent variable

2. Choosing the explanatory variable that is relevant to the question at issue

3. Choosing the additional explanatory variables

4. Choosing the functional form of the multiple regression model

5. Choosing multiple regression as a method of analysis

III. Interpreting Multiple Regression Results

A. What Is the Practical, as Opposed to the Statistical, Significance of Regression Results?

1. When should statistical tests be used?

2. What is the appropriate level of statistical significance?

3. Should statistical tests be one-tailed or two-tailed?

B. Are the Regression Results Robust?

1. What evidence exists that the explanatory variable causes changes in the dependent variable?

2. To what extent are the explanatory variables correlated with each other?

3. To what extent are individual errors in the regression model independent?

4. To what extent are the regression results sensitive to individual data points?

5. To what extent are the data subject to measurement error?

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 303

Reference Guide on
Multiple Regression
DANIEL L. RUBINFELD
Daniel L. Rubinfeld, Ph.D., is Robert L. Bridges Professor of Law and Professor of Economics
Emeritus, University of California, Berkeley, and Visiting Professor of Law at New York
University Law School.
ConTenTs
I. Introduction and Overview, 305
II. Research Design: Model Specification, 311
A. What Is the Specific Question That Is Under Investigation by the
Expert? 311
B. What Model Should Be Used to Evaluate the Question at Issue? 311
1. Choosing the dependent variable, 312
2. Choosing the explanatory variable that is relevant to the
question at issue, 313
3. Choosing the additional explanatory variables, 313
4. Choosing the functional form of the multiple regression
model, 316
5. Choosing multiple regression as a method of analysis, 317
III. Interpreting Multiple Regression Results, 318
A. What Is the Practical, as Opposed to the Statistical, Significance of
Regression Results? 318
1. When should statistical tests be used? 319
2. What is the appropriate level of statistical significance? 320
3. Should statistical tests be one-tailed or two-tailed? 321
B. Are the Regression Results Robust? 322
1. What evidence exists that the explanatory variable causes
changes in the dependent variable? 322
2. To what extent are the explanatory variables correlated with
each other? 324
3. To what extent are individual errors in the regression model
independent? 325
4. To what extent are the regression results sensitive to individual
data points? 326
5. To what extent are the data subject to measurement error? 327
303

OCR for page 303

Reference Manual on Scientific Evidence
IV. The Expert, 328
A. Who Should Be Qualified as an Expert? 328
B. Should the Court Appoint a Neutral Expert? 329
V. Presentation of Statistical Evidence, 330
A. What Disagreements Exist Regarding Data on Which the Analysis Is
Based? 330
B. Which Database Information and Analytical Procedures Will Aid in
Resolving Disputes over Statistical Studies? 331
Appendix: The Basics of Multiple Regression, 333
A. Introduction, 333
B. Linear Regression Model, 336
1. Specifying the regression model, 337
2. Regression line, 337
C. Interpreting Regression Results, 339
D . Determining the Precision of the Regression Results, 340
1. Standard errors of the coefficients and t-statistics, 340
2. Goodness-of-fit, 344
3. Sensitivity of least squares regression results, 345
E. Reading Multiple Regression Computer Output, 346
F . Forecasting, 348
G. A Hypothetical Example, 350
Glossary of Terms, 352
References on Multiple Regression, 357
304

OCR for page 303

Reference Guide on Multiple Regression
I. Introduction and Overview
Multiple regression analysis is a statistical tool used to understand the relationship
between or among two or more variables.1 Multiple regression involves a variable
to be explained—called the dependent variable—and additional explanatory vari-
ables that are thought to produce or be associated with changes in the dependent
variable.2 For example, a multiple regression analysis might estimate the effect of
the number of years of work on salary. Salary would be the dependent variable to
be explained; the years of experience would be the explanatory variable.
Multiple regression analysis is sometimes well suited to the analysis of data
about competing theories for which there are several possible explanations for the
relationships among a number of explanatory variables.3 Multiple regression typi-
cally uses a single dependent variable and several explanatory variables to assess the
statistical data pertinent to these theories. In a case alleging sex discrimination in
salaries, for example, a multiple regression analysis would examine not only sex,
but also other explanatory variables of interest, such as education and experience.4
The employer-defendant might use multiple regression to argue that salary is a
function of the employee’s education and experience, and the employee-plaintiff
might argue that salary is also a function of the individual’s sex. Alternatively,
in an antitrust cartel damages case, the plaintiff’s expert might utilize multiple
regression to evaluate the extent to which the price of a product increased dur-
ing the period in which the cartel was effective, after accounting for costs and
other variables unrelated to the cartel. The defendant’s expert might use multiple
1. A variable is anything that can take on two or more values (e.g., the daily temperature in
Chicago or the salaries of workers at a factory).
2. Explanatory variables in the context of a statistical study are sometimes called independent
variables. See David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section II.A.1,
in this manual. The guide also offers a brief discussion of multiple regression analysis. Id., Section V.
3. Multiple regression is one type of statistical analysis involving several variables. Other types
include matching analysis, stratification, analysis of variance, probit analysis, logit analysis, discriminant
analysis, and factor analysis.
4. Thus, in Ottaviani v. State University of New York, 875 F.2d 365, 367 (2d Cir. 1989) (citations
omitted), cert. denied, 493 U.S. 1021 (1990), the court stated:
In disparate treatment cases involving claims of gender discrimination, plaintiffs typically use multiple
regression analysis to isolate the influence of gender on employment decisions relating to a particular
job or job benefit, such as salary.
The first step in such a regression analysis is to specify all of the possible “legitimate” (i.e., non-
discriminatory) factors that are likely to significantly affect the dependent variable and which could
account for disparities in the treatment of male and female employees. By identifying those legitimate
criteria that affect the decisionmaking process, individual plaintiffs can make predictions about what job
or job benefits similarly situated employees should ideally receive, and then can measure the difference
between the predicted treatment and the actual treatment of those employees. If there is a disparity
between the predicted and actual outcomes for female employees, plaintiffs in a disparate treatment
case can argue that the net “residual” difference represents the unlawful effect of discriminatory animus
on the allocation of jobs or job benefits.
305

OCR for page 303

Reference Manual on Scientific Evidence
regression to suggest that the plaintiff’s expert had omitted a number of price-
determining variables.
More generally, multiple regression may be useful (1) in determining whether
a particular effect is present; (2) in measuring the magnitude of a particular effect;
and (3) in forecasting what a particular effect would be, but for an intervening
event. In a patent infringement case, for example, a multiple regression analysis
could be used to determine (1) whether the behavior of the alleged infringer
affected the price of the patented product, (2) the size of the effect, and (3) what
the price of the product would have been had the alleged infringement not
occurred.
Over the past several decades, the use of multiple regression analysis in court
has grown widely. Regression analysis has been used most frequently in cases of
sex and race discrimination5 antitrust violations,6 and cases involving class cer-
5. Discrimination cases using multiple regression analysis are legion. See, e.g., Bazemore v.
Friday, 478 U.S. 385 (1986), on remand, 848 F.2d 476 (4th Cir. 1988); Csicseri v. Bowsher, 862 F.
Supp. 547 (D.D.C. 1994) (age discrimination), aff’d, 67 F.3d 972 (D.C. Cir. 1995); EEOC v. General
Tel. Co., 885 F.2d 575 (9th Cir. 1989), cert. denied, 498 U.S. 950 (1990); Bridgeport Guardians, Inc.
v. City of Bridgeport, 735 F. Supp. 1126 (D. Conn. 1990), aff’d, 933 F.2d 1140 (2d Cir.), cert. denied,
502 U.S. 924 (1991); Bickerstaff v. Vassar College, 196 F.3d 435, 448–49 (2d Cir. 1999) (sex dis-
crimination); McReynolds v. Sodexho Marriott, 349 F. Supp. 2d 1 (D.C. Cir. 2004) (race discrimina-
tion); Hnot v. Willis Group Holdings Ltd., 228 F.R.D. 476 (S.D.N.Y. 2005) (gender discrimination);
Carpenter v. Boeing Co., 456 F.3d 1183 (10th Cir. 2006) (sex discrimination); Coward v. ADT
Security Systems, Inc., 140 F.3d 271, 274–75 (D.C. Cir. 1998); Smith v. Virginia Commonwealth
Univ., 84 F.3d 672 (4th Cir. 1996) (en banc); Hemmings v. Tidyman’s Inc., 285 F.3d 1174, 1184–86
(9th Cir. 2000); Mehus v. Emporia State University, 222 F.R.D. 455 (D. Kan. 2004) (sex discrimina-
tion); Guiterrez v. Johnson & Johnson, 2006 WL 3246605 (D.N.J. Nov. 6, 2006 (race discrimination);
Morgan v. United Parcel Service, 380 F.3d 459 (8th Cir. 2004) (racial discrimination). See also Keith
N. Hylton & Vincent D. Rougeau, Lending Discrimination: Economic Theory, Econometric Evidence, and
the Community Reinvestment Act, 85 Geo. L.J. 237, 238 (1996) (“regression analysis is probably the best
empirical tool for uncovering discrimination”).
6. E.g., United States v. Brown Univ., 805 F. Supp. 288 (E.D. Pa. 1992) (price fixing of college
scholarships), rev’d, 5 F.3d 658 (3d Cir. 1993); Petruzzi’s IGA Supermarkets, Inc. v. Darling-Delaware
Co., 998 F.2d 1224 (3d Cir.), cert. denied, 510 U.S. 994 (1993); Ohio v. Louis Trauth Dairy, Inc.,
925 F. Supp. 1247 (S.D. Ohio 1996); In re Chicken Antitrust Litig., 560 F. Supp. 963, 993 (N.D. Ga.
1980); New York v. Kraft Gen. Foods, Inc., 926 F. Supp. 321 (S.D.N.Y. 1995); Freeland v. AT&T,
238 F.R.D. 130 (S.D.N.Y. 2006); In re Pressure Sensitive Labelstock Antitrust Litig., 2007 U.S. Dist.
LEXIS 85466 (M.D. Pa. Nov. 19, 2007); In re Linerboard Antitrust Litig., 497 F. Supp. 2d 666 (E.D.
Pa. 2007) (price fixing by manufacturers of corrugated boards and boxes); In re Polypropylene Carpet
Antitrust Litig., 93 F. Supp. 2d 1348 (N.D. Ga. 2000); In re OSB Antitrust Litig., 2007 WL 2253418
(E.D. Pa. Aug. 3, 2007) (price fixing of Oriented Strand Board, also known as “waferboard”); In re
TFT-LCD (Flat Panel) Antitrust Litig., 267 F.R.D. 583 (N.D. Cal. 2010).
For a broad overview of the use of regression methods in antitrust, see ABA Antitrust Section,
Econometrics: Legal, Practical and Technical Issues (John Harkrider & Daniel Rubinfeld, eds. 2005).
See also Jerry Hausman et al., Competitive Analysis with Differenciated Products, 34 Annales D’Économie
et de Statistique 159 (1994); Gregory J. Werden, Simulating the Effects of Differentiated Products Mergers:
A Practical Alternative to Structural Merger Policy, 5 Geo. Mason L. Rev. 363 (1997).
306

OCR for page 303

Reference Guide on Multiple Regression
tification (under Rule 23).7 However, there are a range of other applications,
including census undercounts,8 voting rights,9 the study of the deterrent effect of
the death penalty,10 rate regulation,11 and intellectual property.12
7. In antitrust, the circuits are currently split as to the extent to which plaintiffs must prove
that common elements predominate over individual elements. E.g., compare In Re Hydrogen Peroxide
Litig., 522 F.2d 305 (3d Cir. 2008) with In Re Cardizem CD Antitrust Litig., 391 F.3d 812 (6th Cir.
2004). For a discussion of use of multiple regression in evaluating class certification, see Bret M. Dickey
& Daniel L. Rubinfeld, Antitrust Class Certification: Towards an Economic Framework, 66 N.Y.U. Ann.
Surv. Am. L. 459 (2010) and John H. Johnson & Gregory K. Leonard, Economics and the Rigorous
Analysis of Class Certification in Antitrust Cases, 3 J. Competition L. & Econ. 341 (2007).
8. See, e.g., City of New York v. U.S. Dep’t of Commerce, 822 F. Supp. 906 (E.D.N.Y. 1993)
(decision of Secretary of Commerce not to adjust the 1990 census was not arbitrary and capricious),
vacated, 34 F.3d 1114 (2d Cir. 1994) (applying heightened scrutiny), rev’d sub nom. Wisconsin v. City of
New York, 517 U.S. 565 (1996); Carey v. Klutznick, 508 F. Supp. 420, 432–33 (S.D.N.Y. 1980) (use
of reasonable and scientifically valid statistical survey or sampling procedures to adjust census figures
for the differential undercount is constitutionally permissible), stay granted, 449 U.S. 1068 (1980), rev’d
on other grounds, 653 F.2d 732 (2d Cir. 1981), cert. denied, 455 U.S. 999 (1982); Young v. Klutznick,
497 F. Supp. 1318, 1331 (E.D. Mich. 1980), rev’d on other grounds, 652 F.2d 617 (6th Cir. 1981), cert.
denied, 455 U.S. 939 (1982).
9. Multiple regression analysis was used in suits charging that at-large areawide voting was
instituted to neutralize black voting strength, in violation of section 2 of the Voting Rights Act, 42
U.S.C. § 1973 (1988). Multiple regression demonstrated that the race of the candidates and that of
the electorate were determinants of voting. See Williams v. Brown, 446 U.S. 236 (1980); Rodriguez
v. Pataki, 308 F. Supp. 2d 346, 414 (S.D.N.Y. 2004); United States v. Vill. of Port Chester, 2008
U.S. Dist. LEXIS 4914 (S.D.N.Y. Jan. 17, 2008); Meza v. Galvin, 322 F. Supp. 2d 52 (D. Mass.
2004) (violation of VRA with regard to Hispanic voters in Boston); Bone Shirt v. Hazeltine, 336
F. Supp. 2d 976 (D.S.D. 2004) (violations of VRA with regard to Native American voters in South
Dakota); Georgia v. Ashcroft, 195 F. Supp. 2d 25 (D.D.C. 2002) (redistricting of Georgia’s state and
federal legislative districts); Benavidez v. City of Irving, 638 F. Supp. 2d 709 (N.D. Tex. 2009) (chal-
lenge of city’s at-large voting scheme). For commentary on statistical issues in voting rights cases, see,
e.g., Statistical and Demographic Issues Underlying Voting Rights Cases, 15 Evaluation Rev. 659 (1991);
Stephen P. Klein et al., Ecological Regression Versus the Secret Ballot, 31 Jurimetrics J. 393 (1991); James
W. Loewen & Bernard Grofman, Recent Developments in Methods Used in Vote Dilution Litigation, 21
Urb. Law. 589 (1989); Arthur Lupia & Kenneth McCue, Why the 1980s Measures of Racially Polarized
Voting Are Inadequate for the 1990s, 12 Law & Pol’y 353 (1990).
10. See, e.g., Gregg v. Georgia, 428 U.S. 153, 184–86 (1976). For critiques of the validity of
the deterrence analysis, see National Research Council, Deterrence and Incapacitation: Estimating
the Effects of Criminal Sanctions on Crime Rates (Alfred Blumstein et al. eds., 1978); Richard O.
Lempert, Desert and Deterrence: An Assessment of the Moral Bases of the Case for Capital Punishment, 79
Mich. L. Rev. 1177 (1981); Hans Zeisel, The Deterrent Effect of the Death Penalty: Facts v. Faith, 1976
Sup. Ct. Rev. 317; and John Donohue & Justin Wolfers, Uses and Abuses of Statistical Evidence in the
Death Penalty Debate, 58 Stan. L. Rev. 787 (2005).
11. See, e.g., Time Warner Entertainment Co. v. FCC, 56 F.3d 151 (D.C. Cir. 1995) (chal-
lenge to FCC’s application of multiple regression analysis to set cable rates), cert. denied, 516 U.S.
1112 (1996); Appalachian Power Co. v. EPA, 135 F.3d 791 (D.C. Cir. 1998) (challenging the EPA’s
application of regression analysis to set nitrous oxide emission limits); Consumers Util. Rate Advocacy
Div. v. Ark. PSC, 99 Ark. App. 228 (Ark. Ct. App. 2007) (challenging an increase in nongas rates).
12. See Polaroid Corp. v. Eastman Kodak Co., No. 76-1634-MA, 1990 WL 324105, at *29,
*62–63 (D. Mass. Oct. 12, 1990) (damages awarded because of patent infringement), amended by No.
307

OCR for page 303

Reference Manual on Scientific Evidence
Multiple regression analysis can be a source of valuable scientific testimony
in litigation. However, when inappropriately used, regression analysis can confuse
important issues while having little, if any, probative value. In EEOC v. Sears,
Roebuck & Co.,13 in which Sears was charged with discrimination against women
in hiring practices, the Seventh Circuit acknowledged that “[m]ultiple regression
analyses, designed to determine the effect of several independent variables on a
dependent variable, which in this case is hiring, are an accepted and common
method of proving disparate treatment claims.”14 However, the court affirmed
the district court’s findings that the “E.E.O.C.’s regression analyses did not ‘accu-
rately reflect Sears’ complex, nondiscriminatory decision-making processes’” and
that the “‘E.E.O.C.’s statistical analyses [were] so flawed that they lack[ed] any
persuasive value.’”15 Serious questions also have been raised about the use of mul-
tiple regression analysis in census undercount cases and in death penalty cases.16
The Supreme Court’s rulings in Daubert and Kumho Tire have encouraged
parties to raise questions about the admissibility of multiple regression analyses.17
Because multiple regression is a well-accepted scientific methodology, courts have
frequently admitted testimony based on multiple regression studies, in some cases
over the strong objection of one of the parties.18 However, on some occasions
courts have excluded expert testimony because of a failure to utilize a multiple
regression methodology.19 On other occasions, courts have rejected regression
76-1634-MA, 1991 WL 4087 (D. Mass. Jan. 11, 1991); Estate of Vane v. The Fair, Inc., 849 F.2d
186, 188 (5th Cir. 1988) (lost profits were the result of copyright infringement), cert. denied, 488 U.S.
1008 (1989); Louis Vuitton Malletier v. Dooney & Bourke, Inc., 525 F. Supp. 2d 576, 664 (S.D.N.Y.
2007) (trademark infringement and unfair competition suit). The use of multiple regression analysis to
estimate damages has been contemplated in a wide variety of contexts. See, e.g., David Baldus et al.,
Improving Judicial Oversight of Jury Damages Assessments: A Proposal for the Comparative Additur/Remittitur
Review of Awards for Nonpecuniary Harms and Punitive Damages, 80 Iowa L. Rev. 1109 (1995); Talcott
J. Franklin, Calculating Damages for Loss of Parental Nurture Through Multiple Regression Analysis, 52
Wash. & Lee L. Rev. 271 (1997); Roger D. Blair & Amanda Kay Esquibel, Yardstick Damages in Lost
Profit Cases: An Econometric Approach, 72 Denv. U. L. Rev. 113 (1994). Daniel Rubinfeld, Quantitative
Methods in Antitrust, in 1 Issues in Competition Law and Policy 723 (2008).
13. 839 F.2d 302 (7th Cir. 1988).
14. Id. at 324 n.22.
15. Id. at 348, 351 (quoting EEOC v. Sears, Roebuck & Co., 628 F. Supp. 1264, 1342, 1352
(N.D. Ill. 1986)). The district court commented specifically on the “severe limits of regression analysis
in evaluating complex decision-making processes.” 628 F. Supp. at 1350.
16. See David H. Kaye & David A. Freedman, Reference Guide on Statistics, Sections II.A.3,
B.1, in this manual.
17. Daubert v. Merrill Dow Pharms., Inc. 509 U.S. 579 (1993); Kumho Tire Co. v. Carmichael,
526 U.S. 137, 147 (1999) (expanding the Daubert’s application to nonscientific expert testimony).
18. See Newport Ltd. v. Sears, Roebuck & Co., 1995 U.S. Dist. LEXIS 7652 (E.D. La. May
26, 1995). See also Petruzzi’s IGA Supermarkets, supra note 6, 998 F.2d at 1240, 1247 (finding that
the district court abused its discretion in excluding multiple regression-based testimony and reversing
the grant of summary judgment to two defendants).
19. See, e.g., In re Executive Telecard Ltd. Sec. Litig., 979 F. Supp. 1021 (S.D.N.Y. 1997).
308

OCR for page 303

Reference Guide on Multiple Regression
studies that did not have an adequate foundation or research design with respect
to the issues at hand.20
In interpreting the results of a multiple regression analysis, it is important to
distinguish between correlation and causality. Two variables are correlated—that
is, associated with each other—when the events associated with the variables
occur more frequently together than one would expect by chance. For example,
if higher salaries are associated with a greater number of years of work experience,
and lower salaries are associated with fewer years of experience, there is a positive
correlation between salary and number of years of work experience. However, if
higher salaries are associated with less experience, and lower salaries are associated
with more experience, there is a negative correlation between the two variables.
A correlation between two variables does not imply that one event causes the
second. Therefore, in making causal inferences, it is important to avoid spurious
correlation.21 Spurious correlation arises when two variables are closely related but
bear no causal relationship because they are both caused by a third, unexamined
variable. For example, there might be a negative correlation between the age of
certain skilled employees of a computer company and their salaries. One should
not conclude from this correlation that the employer has necessarily discriminated
against the employees on the basis of their age. A third, unexamined variable, such
as the level of the employees’ technological skills, could explain differences in pro-
ductivity and, consequently, differences in salary.22 Or, consider a patent infringe-
ment case in which increased sales of an allegedly infringing product are associated
with a lower price of the patented product.23 This correlation would be spurious
if the two products have their own noncompetitive market niches and the lower
price is the result of a decline in the production costs of the patented product.
Pointing to the possibility of a spurious correlation will typically not be
enough to dispose of a statistical argument. It may be appropriate to give little
weight to such an argument absent a showing that the correlation is relevant.
For example, a statistical showing of a relationship between technological skills
20. See City of Tuscaloosa v. Harcros Chemicals, Inc., 158 F.2d 548 (11th Cir. 1998), in which
the court ruled plaintiffs’ regression-based expert testimony inadmissible and granted summary judg-
ment to the defendants. See also American Booksellers Ass’n v. Barnes & Noble, Inc., 135 F. Supp.
2d 1031, 1041 (N.D. Cal. 2001), in which a model was said to contain “too many assumptions and
simplifications that are not supported by real-world evidence,” and Obrey v. Johnson, 400 F.3d 691
(9th Cir. 2005).
21. See David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section V.B.3,
in this manual.
22. See, e.g., Sheehan v. Daily Racing Form Inc., 104 F.3d 940, 942 (7th Cir.) (rejecting plain-
tiff’s age discrimination claim because statistical study showing correlation between age and retention
ignored the “more than remote possibility that age was correlated with a legitimate job-related quali-
fication”), cert. denied, 521 U.S. 1104 (1997).
23. In some particular cases, there are statistical tests that allow one to reject claims of causality.
For a brief description of these tests, which were developed by Jerry Hausman, see Robert S. Pindyck
& Daniel L. Rubinfeld, Econometric Models and Economic Forecasts § 7.5 (4th ed. 1997).
309

OCR for page 303

Reference Manual on Scientific Evidence
and worker productivity might be required in the age discrimination example,
above.24
Causality cannot be inferred by data analysis alone; rather, one must infer that
a causal relationship exists on the basis of an underlying causal theory that explains
the relationship between the two variables. Even when an appropriate theory has
been identified, causality can never be inferred directly. One must also look for
empirical evidence that there is a causal relationship. Conversely, the fact that two
variables are correlated does not guarantee the existence of a relationship; it could
be that the model—a characterization of the underlying causal theory—does not
reflect the correct interplay among the explanatory variables. In fact, the absence
of correlation does not guarantee that a causal relationship does not exist. Lack of
correlation could occur if (1) there are insufficient data, (2) the data are measured
inaccurately, (3) the data do not allow multiple causal relationships to be sorted
out, or (4) the model is specified wrongly because of the omission of a variable
or variables that are related to the variable of interest.
There is a tension between any attempt to reach conclusions with near
certainty and the inherently uncertain nature of multiple regression analysis. In
general, the statistical analysis associated with multiple regression allows for the
expression of uncertainty in terms of probabilities. The reality that statistical analy-
sis generates probabilities concerning relationships rather than certainty should not
be seen in itself as an argument against the use of statistical evidence, or worse, as
a reason to not admit that there is uncertainty at all. The only alternative might
be to use less reliable anecdotal evidence.
This reference guide addresses a number of procedural and methodologi-
cal issues that are relevant in considering the admissibility of, and weight to be
accorded to, the findings of multiple regression analyses. It also suggests some
standards of reporting and analysis that an expert presenting multiple regression
analyses might be expected to meet. Section II discusses research design—how the
multiple regression framework can be used to sort out alternative theories about a
case. The guide discusses the importance of choosing the appropriate specification
of the multiple regression model and raises the issue of whether multiple regression
is appropriate for the case at issue. Section III accepts the regression framework
and concentrates on the interpretation of the multiple regression results from both
a statistical and a practical point of view. It emphasizes the distinction between
regression results that are statistically significant and results that are meaningful
to the trier of fact. It also points to the importance of evaluating the robustness
24. See, e.g., Allen v. Seidman, 881 F.2d 375 (7th Cir. 1989) (judicial skepticism was raised when
the defendant did not submit a logistic regression incorporating an omitted variable—the possession of
a higher degree or special education; defendant’s attack on statistical comparisons must also include an
analysis that demonstrates that comparisons are flawed). The appropriate requirements for the defen-
dant’s showing of spurious correlation could, in general, depend on the discovery process. See, e.g.,
Boykin v. Georgia Pac. Co., 706 F.2d 1384 (1983) (criticism of a plaintiff’s analysis for not including
omitted factors, when plaintiff considered all information on an application form, was inadequate).
310

OCR for page 303

Reference Guide on Multiple Regression
of regression analyses, i.e., seeing the extent to which the results are sensitive to
changes in the underlying assumptions of the regression model. Section IV briefly
discusses the qualifications of experts and suggests a potentially useful role for
court-appointed neutral experts. Section V emphasizes procedural aspects associ-
ated with use of the data underlying regression analyses. It encourages greater
pretrial efforts by the parties to attempt to resolve disputes over statistical studies.
Throughout the main body of this guide, hypothetical examples are used as
illustrations. Moreover, the basic “mathematics” of multiple regression has been
kept to a bare minimum. To achieve that goal, the more formal description of the
multiple regression framework has been placed in the Appendix. The Appendix is
self-contained and can be read before or after the text. The Appendix also includes
further details with respect to the examples used in the body of this guide.
II. Research Design: Model Specification
Multiple regression allows the testifying economist or other expert to choose
among alternative theories or hypotheses and assists the expert in distinguishing
correlations between variables that are plainly spurious from those that may reflect
valid relationships.
A. What Is the Specific Question That Is Under Investigation
by the Expert?
Research begins with a clear formulation of a research question. The data to be
collected and analyzed must relate directly to this question; otherwise, appropri-
ate inferences cannot be drawn from the statistical analysis. For example, if the
question at issue in a patent infringement case is what price the plaintiff’s product
would have been but for the sale of the defendant’s infringing product, sufficient
data must be available to allow the expert to account statistically for the important
factors that determine the price of the product.
B. What Model Should Be Used to Evaluate the Question at
Issue?
Model specification involves several steps, each of which is fundamental to the suc-
cess of the research effort. Ideally, a multiple regression analysis builds on a theory
that describes the variables to be included in the study. A typical regression model
will include one or more dependent variables, each of which is believed to be caus-
ally related to a series of explanatory variables. Because we cannot be certain that
the explanatory variables are themselves unaffected or independent of the influence
of the dependent variable (at least at the point of initial study), the explanatory
311

OCR for page 303

Reference Manual on Scientific Evidence
variables are often termed covariates. Covariates are known to have an association
with the dependent or outcome variable, but causality remains an open question.
For example, the theory of labor markets might lead one to expect salaries in
an industry to be related to workers’ experience and the productivity of workers’
jobs. A belief that there is job discrimination would lead one to create a model
in which the dependent variable was a measure of workers’ salaries and the list of
covariates included a variable reflecting discrimination in addition to measures
of job training and experience.
In a perfect world, the analysis of the job discrimination (or any other) issue
might be accomplished through a controlled “natural experiment,” in which
employees would be randomly assigned to a variety of employers in an industry
under study and asked to fill positions requiring identical experience and skills. In
this observational study, where the only difference in salaries could be a result of
discrimination, it would be possible to draw clear and direct inferences from an
analysis of salary data. Unfortunately, the opportunity to conduct observational
studies of this kind is rarely available to experts in the context of legal proceedings.
In the real world, experts must do their best to interpret the results of real-world
“quasi-experiments,” in which it is impossible to control all factors that might affect
worker salaries or other variables of interest.25
Models are often characterized in terms of parameters—numerical character-
istics of the model. In the labor market discrimination example, one parameter
might reflect the increase in salary associated with each additional year of prior
job experience. Another parameter might reflect the reduction in salary associated
with a lack of current on-the-job experience. Multiple regression uses a sample,
or a selection of data, from the population (all the units of interest) to obtain esti-
mates of the values of the parameters of the model. An estimate associated with a
particular explanatory variable is an estimated regression coefficient.
Failure to develop the proper theory, failure to choose the appropriate vari-
ables, or failure to choose the correct form of the model can substantially bias the
statistical results—that is, create a systematic tendency for an estimate of a model
parameter to be too high or too low.
1. Choosing the dependent variable
The variable to be explained, the dependent variable, should be the appropriate
variable for analyzing the question at issue.26 Suppose, for example, that pay dis-
25. In the literature on natural and quasi-experiments, the explanatory variables are characterized
as “treatments” and the dependent variable as the “outcome.” For a review of natural experiments
in the criminal justice arena, see David P. Farrington, A Short History of Randomized Experiments in
Criminology, 27 Evaluation Rev. 218–27 (2003).
26. In multiple regression analysis, the dependent variable is usually a continuous variable that
takes on a range of numerical values. When the dependent variable is categorical, taking on only two
or three values, modified forms of multiple regression, such as probit analysis or logit analysis, are
312

OCR for page 303

Reference Guide on Multiple Regression
crimination among hourly workers is a concern. One choice for the dependent
variable is the hourly wage rate of the employees; another choice is the annual
salary. The distinction is important, because annual salary differences may in part
result from differences in hours worked. If the number of hours worked is the
product of worker preferences and not discrimination, the hourly wage is a good
choice. If the number of hours worked is related to the alleged discrimination,
annual salary is the more appropriate dependent variable to choose.27
2. Choosing the explanatory variable that is relevant to the question at issue
The explanatory variable that allows the evaluation of alternative hypotheses must
be chosen appropriately. Thus, in a discrimination case, the variable of interest
may be the race or sex of the individual. In an antitrust case, it may be a variable
that takes on the value 1 to reflect the presence of the alleged anticompetitive
behavior and the value 0 otherwise.28
3. Choosing the additional explanatory variables
An attempt should be made to identify additional known or hypothesized explana-
tory variables, some of which are measurable and may support alternative substan-
tive hypotheses that can be accounted for by the regression analysis. Thus, in a
discrimination case, a measure of the skills of the workers may provide an alterna-
tive explanation—lower salaries may have been the result of inadequate skills.29
appropriate. For an example of the use of the latter, see EEOC v. Sears, Roebuck & Co., 839 F.2d 302,
325 (7th Cir. 1988) (EEOC used logit analysis to measure the impact of variables, such as age, educa-
tion, job-type experience, and product-line experience, on the female percentage of commission hires).
27. In job systems in which annual salaries are tied to grade or step levels, the annual salary cor-
responding to the job position could be more appropriate.
28. Explanatory variables may vary by type, which will affect the interpretation of the regression
results. Thus, some variables may be continuous and others may be categorical.
29. In James v. Stockham Valves, 559 F. 2d 310 (5th Cir. 1977), the Court of Appeals rejected
the employer’s claim that skill level rather than race determined assignment and wage levels, noting
the circularity of defendant’s argument. In Ottaviani v. State University of New York, 679 F. Supp. 288,
306–08 (S.D.N.Y. 1988), aff’d, 875 F.2d 365 (2d Cir. 1989), cert. denied, 493 U.S. 1021 (1990), the
court ruled (in the liability phase of the trial) that the university showed that there was no discrimi-
nation in either placement into initial rank or promotions between ranks, and so rank was a proper
variable in multiple regression analysis to determine whether women faculty members were treated
differently than men.
However, in Trout v. Garrett, 780 F. Supp. 1396, 1414 (D.D.C. 1991), the court ruled (in the
damage phase of the trial) that the extent of civilian employees’ prehire work experience was not
an appropriate variable in a regression analysis to compute back pay in employment discrimination.
According to the court, including the prehire level would have resulted in a finding of no sex discrimi -
nation, despite a contrary conclusion in the liability phase of the action. Id. See also Stuart v. Roache,
951 F.2d 446 (1st Cir. 1991) (allowing only 3 years of seniority to be considered as the result of prior
313

OCR for page 303

Reference Manual on Scientific Evidence
The R2 of 0.556 indicates that 55.6% of the variation in salaries is explained
by the regression variables, X1, X2, and X3. Finally, the F-test is a test of the null
hypothesis that all regression coefficients (except the intercept) are jointly equal
to 0—that there is no linear association between the dependent variable and any of the
explanatory variables. This is equivalent to the null hypothesis that R2 is equal to 0. In
this case, the F-ratio of 174.71 is sufficiently high that the expert can reject the null
hypothesis with a very high degree of confidence (i.e., with a 1% level of significance).
F. Forecasting
In general, a forecast is a prediction made about the values of the dependent vari-
able using information about the explanatory variables. Often, ex ante forecasts
are performed; in this situation, values of the dependent variable are predicted
beyond the sample (e.g., beyond the time period in which the model has been
estimated). However, ex post forecasts are frequently used in damage analyses.90
An ex post forecast has a forecast period such that all values of the dependent and
explanatory variables are known; ex post forecasts can be checked against existing
data and provide a direct means of evaluation.
For example, to calculate the forecast for the salary regression discussed above,
the expert uses the estimated salary equation
ˆ
Y = $14,085 + $2323X1 + $1675X2 − $36X3. (14)
To predict the salary of a man with 2 years’ experience, the expert calculates
ˆ
Y ( 2 ) = $14,085 + ($2323 ∙ 2) + $1675 − ($36 ∙ 2) = $20,262. (15)
The degree of accuracy of both ex ante and ex post forecasts can be calculated
provided that the model specification is correct and the errors are normally dis-
tributed and independent. The statistic is known as the standard error of forecast
(SEF). The SEF measures the standard deviation of the forecast error that is made
within a sample in which the explanatory variables are known with certainty.91 The
90. Frequently, in cases involving damages, the question arises, what the world would have been
like had a certain event not taken place. For example, in a price-fixing antitrust case, the expert can
ask what the price of a product would have been had a certain event associated with the price-fixing
agreement not occurred. If prices would have been lower, the evidence suggests impact. If the expert
can predict how much lower they would have been, the data can help the expert develop a numerical
estimate of the amount of damages.
91. There are actually two sources of error implicit in the SEF. The first source arises because
the estimated parameters of the regression model may not be exactly equal to the true regression
parameters. The second source is the error term itself; when forecasting, the expert typically sets the
error equal to 0 when a turn of events not taken into account in the regression model may make it
appropriate to make the error positive or negative.
348

OCR for page 303

Reference Guide on Multiple Regression
SEF can be used to determine how accurate a given forecast is. In equation (15),
the SEF associated with the forecast of $20,262 is approximately $5000. If a large
sample size is used, the probability is roughly 95% that the predicted salary will be
within 1.96 standard errors of the forecasted value. In this case, the appropriate
95% interval for the prediction is $10,822 to $30,422. Because the estimated model
does not explain salaries effectively, the SEF is large, as is the 95% interval. A more
complete model with additional explanatory variables would result in a lower SEF
and a smaller 95% interval for the prediction.
A danger exists when using the SEF, which applies to the standard errors of
the estimated coefficients as well. The SEF is calculated on the assumption that the
model includes the correct set of explanatory variables and the correct functional
form. If the choice of variables or the functional form is wrong, the estimated fore-
cast error may be misleading. In some instances, it may be smaller, perhaps substan-
tially smaller, than the true SEF; in other instances, it may be larger, for example, if
the wrong variables happen to capture the effects of the correct variables.
The difference between the SEF and the SER is shown in Figure 9. The SER
measures deviations within the sample. The SEF is more general, because it cal-
culates deviations within or without the sample period. In general, the difference
between the SEF and the SER increases as the values of the explanatory variables
increase in distance from the mean values. Figure 9 shows the 95% prediction
interval created by the measurement of two SEFs about the regression line.
Figure 9. Standard error of forecast.
2 SEFs
2 SERs
Salary (Y)
Experience (X1)
349
6-9.eps

OCR for page 303

Reference Manual on Scientific Evidence
G. A Hypothetical Example
Jane Thompson filed suit in federal court alleging that officials in the police
department discriminated against her and a class of other female police officers in
violation of Title VII of the Civil Rights Act of 1964, as amended. On behalf of
the class, Ms. Thompson alleged that she was paid less than male police officers
with equivalent skills and experience. Both plaintiff and defendant used expert
economists with econometric expertise to present statistical evidence to the court
in support of their positions.
Plaintiff’s expert pointed out that the mean salary of the 40 female officers was
$30,604, whereas the mean salary of the 60 male officers was $43,077. To show
that this difference was statistically significant, the expert put forward a regression
of salary (SALARY) on a constant term and a dummy indicator variable (FEM)
equal to 1 for each female and 0 for each male. The results were as follows:
SALARY = $43,077 −$12,373*FEM
Standard Error ($1528) ($2416)
p-value <.01 <.01
R2 = .22
The −$12,373 coefficient on the FEM variable measures the mean difference
between male and female salaries. Because the standard error is approximately one-
fifth of the value of the coefficient, this difference is statistically significant at the 5%
(and indeed at the 1%) level. If this is an appropriate regression model (in terms of its
implicit characterization of salary determination), one can conclude that it is highly
unlikely that the difference in salaries between men and women is due to chance.
The defendant’s expert testified that the regression model put forward was the
wrong model because it failed to account for the fact that males (on average) had
substantially more experience than females. The relatively low R2 was an indica-
tion that there was substantial unexplained variation in the salaries of male and
female officers. An examination of data relating to years spent on the job showed
that the average male experience was 8.2 years, whereas the average for females
was only 3.5 years. The defense expert then presented a regression analysis that
added an additional explanatory variable (i.e., a covariate), the years of experience
of each police officer (EXP). The new regression results were as follows:
SALARY = $28,049 – $3860*FEM + $1833*EXP
Standard Error (2513) ($2347) ($265)
p-value <.01 <.11 <.01
R2 = .47
Experience is itself a statistically significant explanatory variable, with a
p-value of less than .01. Moreover, the difference between male and female
350

OCR for page 303

Reference Guide on Multiple Regression
salaries, holding experience constant, is only $3860, and this difference is not sta-
tistically significant at the 5% level. The defense expert was able to testify on this
basis that the court could not rule out alternative explanations for the difference
in salaries other than the plaintiff’s claim of discrimination.
The debate did not end here. On rebuttal, the plaintiff’s expert made three
distinct points. First, whether $3860 was statistically significant or not, it was prac-
tically significant, representing a salary difference of more than 10% of the mean
female officers’ salaries. Second, although the result was not statistically significant at
the 5% level, it was significant at the 11% level. If the regression model were valid,
there would be approximately an 11% probability that one would err by concluding
that the mean salary difference between men and women was a result of chance.
Third, and most importantly, the expert testified that the regression model
was not correctly specified. Further analysis by the expert showed that the value of
an additional year of experience was $2333 for males on average, but only $1521
for females. Based on supporting testimonial experience, the expert testified that
one could not rule out the possibility that the mechanism by which the police
department discriminated against females was by rewarding males more for their
experience than females. The expert made this point clear by running an addi-
tional regression in which a further covariate was added to the model. The new
variable was an interaction variable, INT, measured as the product of the FEM
and EXP variables. The regression results were as follows:
SALARY = $35,122 − $5250*FEM + $2333*EXP − $812*FEM*EXP
St. Error ($2825) ($347) ($265) ($185)
p-value <.01 <.11 <.01 <.04
R2 = .65
The plaintiff’s expert noted that for all males in the sample, FEM = 0, in which
case the regression results are given by the equation
SALARY = $35,122 + $2333*EXP
However, for females, FEM = 1, in which the corresponding equation is
SALARY = $29,872 + $1521*EXP
It appears, therefore, that females are discriminated against not only when hired
(i.e., when EXP = 0), but also in the reward they get as they accumulate more
and more experience.
The debate between the experts continued, focusing less on the statistical inter-
pretation of any one particular regression model, but more on the model choice
itself, and not simply on statistical significance, but also with regard to practical
significance.
351

OCR for page 303

Reference Manual on Scientific Evidence
Glossary
The following terms and definitions are adapted from a variety of sources, includ-
ing A Dictionary of Epidemiology (John M. Last et al., eds., 4th ed. 2000) and
Robert S. Pindyck & Daniel L. Rubinfeld, Econometric Models and Economic
Forecasts (4th ed. 1998).
alternative hypothesis. See hypothesis test.
association. The degree of statistical dependence between two or more events or
variables. Events are said to be associated when they occur more frequently
together than one would expect by chance.
bias. Any effect at any stage of investigation or inference tending to produce
results that depart systematically from the true values (i.e., the results are
either too high or too low). A biased estimator of a parameter differs on
average from the true parameter.
coefficient. An estimated regression parameter.
confidence interval. An interval that contains a true regression parameter with
a given degree of confidence.
consistent estimator. An estimator that tends to become more and more accu-
rate as the sample size grows.
correlation. A statistical means of measuring the linear association between vari-
ables. Two variables are correlated positively if, on average, they move in the
same direction; two variables are correlated negatively if, on average, they
move in opposite directions.
covariate. A variable that is possibly predictive of an outcome under study; an
explanatory variable.
cross-sectional analysis. A type of multiple regression analysis in which each
data point is associated with a different unit of observation (e.g., an individual
or a firm) measured at a particular point in time.
degrees of freedom (DF). The number of observations in a sample minus the
number of estimated parameters in a regression model. A useful statistic in
hypothesis testing.
dependent variable. The variable to be explained or predicted in a multiple
regression model.
dummy variable. A variable that takes on only two values, usually 0 and 1, with
one value indicating the presence of a characteristic, attribute, or effect (1),
and the other value indicating its absence (0).
efficient estimator. An estimator of a parameter that produces the greatest pre-
cision possible.
error term. A variable in a multiple regression model that represents the cumula-
tive effect of a number of sources of modeling error.
352

OCR for page 303

Reference Guide on Multiple Regression
estimate. The calculated value of a parameter based on the use of a particular
sample.
estimator. The sample statistic that estimates the value of a population parameter
(e.g., a regression parameter); its values vary from sample to sample.
ex ante forecast. A prediction about the values of the dependent variable that go
beyond the sample; consequently, the forecast must be based on predictions
for the values of the explanatory variables in the regression model.
explanatory variable. A variable that is associated with changes in a dependent
variable.
ex post forecast. A prediction about the values of the dependent variable made
during a period in which all values of the explanatory and dependent variables
are known. Ex post forecasts provide a useful means of evaluating the fit of
a regression model.
F-test. A statistical test (based on an F-ratio) of the null hypothesis that a group of
explanatory variables are jointly equal to 0. When applied to all the explana-
tory variables in a multiple regression model, the F-test becomes a test of the
null hypothesis that R2 equals 0.
feedback. When changes in an explanatory variable affect the values of the
dependent variable, and changes in the dependent variable also affect the
explanatory variable. When both effects occur at the same time, the two
variables are described as being determined simultaneously.
fitted value. The estimated value for the dependent variable; in a linear regres-
sion, this value is calculated as the intercept plus a weighted average of the
values of the explanatory variables, with the estimated parameters used as
weights.
heteroscedasticity. When the error associated with a multiple regression model
has a nonconstant variance; that is, the error values associated with some
observations are typically high, while the values associated with other obser-
vations are typically low.
hypothesis test. A statement about the parameters in a multiple regression model.
The null hypothesis may assert that certain parameters have specified values
or ranges; the alternative hypothesis would specify other values or ranges.
independence. When two variables are not correlated with each other (in the
population).
independent variable. An explanatory variable that affects the dependent vari-
able but that is not affected by the dependent variable.
influential data point. A data point whose deletion from a regression sample
causes one or more estimated regression parameters to change substantially.
interaction variable. The product of two explanatory variables in a regression
model. Used in a particular form of nonlinear model.
353

OCR for page 303

Reference Manual on Scientific Evidence
intercept. The value of the dependent variable when each of the explanatory
variables takes on the value of 0 in a regression equation.
least squares. A common method for estimating regression parameters. Least
squares minimizes the sum of the squared differences between the actual
values of the dependent variable and the values predicted by the regression
equation.
linear regression model. A regression model in which the effect of a change in
each of the explanatory variables on the dependent variable is the same, no
matter what the values of those explanatory variables.
mean (sample). An average of the outcomes associated with a probability dis-
tribution, where the outcomes are weighted by the probability that each will
occur.
mean squared error (MSE). The estimated variance of the regression error,
calculated as the average of the sum of the squares of the regression residuals.
model. A representation of an actual situation.
multicollinearity. When two or more variables are highly correlated in a mul-
tiple regression analysis. Substantial multicollinearity can cause regression
parameters to be estimated imprecisely, as reflected in relatively high standard
errors.
multiple regression analysis. A statistical tool for understanding the relationship
between two or more variables.
nonlinear regression model. A model having the property that changes in
explanatory variables will have differential effects on the dependent variable
as the values of the explanatory variables change.
normal distribution. A bell-shaped probability distribution having the property
that about 95% of the distribution lies within 2 standard deviations of the
mean.
null hypothesis. In regression analysis the null hypothesis states that the results
observed in a study with respect to a particular variable are no different from
what might have occurred by chance, independent of the effect of that vari-
able. See hypothesis test.
one-tailed test. A hypothesis test in which the alternative to the null hypothesis
that a parameter is equal to 0 is for the parameter to be either positive or
negative, but not both.
outlier. A data point that is more than some appropriate distance from a regres-
sion line that is estimated using all the other data points in the sample.
p-value. The significance level in a statistical test; the probability of getting a test
statistic as extreme or more extreme than the observed value. The larger the
p-value, the more likely that the null hypothesis is valid.
parameter. A numerical characteristic of a population or a model.
354

OCR for page 303

Reference Guide on Multiple Regression
perfect collinearity. When two or more explanatory variables are correlated
perfectly.
population. All the units of interest to the researcher; also, universe.
practical significance. Substantive importance. Statistical significance does not
ensure practical significance, because, with large samples, small differences
can be statistically significant.
probability distribution. The process that generates the values of a random vari-
able. A probability distribution lists all possible outcomes and the probability
that each will occur.
probability sampling. A process by which a sample of a population is chosen
so that each unit of observation has a known probability of being selected.
quasi-experiment (or natural experiment). A naturally occurring instance
of observable phenomena that yield data that approximate a controlled
experiment.
R-squared (R2). A statistic that measures the percentage of the variation in the
dependent variable that is accounted for by all of the explanatory variables in
a regression model. R-squared is the most commonly used measure of good-
ness of fit of a regression model.
random error term. A term in a regression model that reflects random error
(sampling error) that is the result of chance. As a consequence, the result
obtained in the sample differs from the result that would be obtained if the
entire population were studied.
regression coefficient. Also, regression parameter. The estimate of a population
parameter obtained from a regression equation that is based on a particular
sample.
regression residual. The difference between the actual value of a dependent
variable and the value predicted by the regression equation.
robust estimation. An alternative to least squares estimation that is less sensitive
to outliers.
robustness. A statistic or procedure that does not change much when data or
assumptions are slightly modified is robust.
sample. A selection of data chosen for a study; a subset of a population.
sampling error. A measure of the difference between the sample estimate of a
parameter and the population parameter.
scatterplot. A graph showing the relationship between two variables in a study;
each dot represents one subject. One variable is plotted along the horizontal
axis; the other variable is plotted along the vertical axis.
serial correlation. The correlation of the values of regression errors over time.
355

OCR for page 303

Reference Manual on Scientific Evidence
slope. The change in the dependent variable associated with a one-unit change
in an explanatory variable.
spurious correlation. When two variables are correlated, but one is not the
cause of the other.
standard deviation. The square root of the variance of a random variable. The
variance is a measure of the spread of a probability distribution about its mean;
it is calculated as a weighted average of the squares of the deviations of the
outcomes of a random variable from its mean.
standard error of forecast (SEF). An estimate of the standard deviation of the
forecast error; it is based on forecasts made within a sample in which the values
of the explanatory variables are known with certainty.
standard error of the coefficient; standard error (SE). A measure of the
variation of a parameter estimate or coefficient about the true parameter. The
standard error is a standard deviation that is calculated from the probability
distribution of estimated parameters.
standard error of the regression (SER). An estimate of the standard deviation
of the regression error; it is calculated as the square root of the average of the
squares of the residuals associated with a particular multiple regression analysis.
statistical significance. A test used to evaluate the degree of association between
a dependent variable and one or more explanatory variables. If the calculated
p-value is smaller than 5%, the result is said to be statistically significant (at
the 5% level). If p is greater than 5%, the result is statistically insignificant
(at the 5% level).
t-statistic. A test statistic that describes how far an estimate of a parameter is from
its hypothesized value (i.e., given a null hypothesis). If a t-statistic is suffi-
ciently large (in absolute magnitude), an expert can reject the null hypothesis.
t-test. A test of the null hypothesis that a regression parameter takes on a particu-
lar value, usually 0. The test is based on the t-statistic.
time-series analysis. A type of multiple regression analysis in which each data
point is associated with a particular unit of observation (e.g., an individual or
a firm) measured at different points in time.
two-tailed test. A hypothesis test in which the alternative to the null hypothesis
that a parameter is equal to 0 is for the parameter to be either positive or
negative, or both.
variable. Any attribute, phenomenon, condition, or event that can have two or
more values.
variable of interest. The explanatory variable that is the focal point of a par-
ticular study or legal issue.
356

OCR for page 303

Reference Guide on Multiple Regression
References on Multiple Regression
Jonathan A. Baker & Daniel L. Rubinfeld, Empirical Methods in Antitrust: Review
and Critique, 1 Am. L. & Econ. Rev. 386 (1999).
Gerald V. Barrett & Donna M. Sansonetti, Issues Concerning the Use of Regression
Analysis in Salary Discrimination Cases, 41 Personnel Psychol. 503 (2006).
Thomas J. Campbell, Regression Analysis in Title VII Cases: Minimum Standards,
Comparable Worth, and Other Issues Where Law and Statistics Meet, 36 Stan. L.
Rev. 1299 (1984).
Catherine Connolly, The Use of Multiple Regression Analysis in Employment Discrimi-
nation Cases, 10 Population Res. & Pol’y Rev. 117 (1991).
Arthur P. Dempster, Employment Discrimination and Statistical Science, 3 Stat. Sci.
149 (1988).
Michael O. Finkelstein, The Judicial Reception of Multiple Regression Studies in Race
and Sex Discrimination Cases, 80 Colum. L. Rev. 737 (1980).
Michael O. Finkelstein & Hans Levenbach, Regression Estimates of Damages in Price-
Fixing Cases, Law & Contemp. Probs., Autumn 1983, at 145.
Franklin M. Fisher, Multiple Regression in Legal Proceedings, 80 Colum. L. Rev.
702 (1980).
Franklin M. Fisher, Statisticians, Econometricians, and Adversary Proceedings, 81 J. Am.
Stat. Ass’n 277 (1986).
Joseph L. Gastwirth, Methods for Assessing the Sensitivity of Statistical Comparisons
Used in Title VII Cases to Omitted Variables, 33 Jurimetrics J. 19 (1992).
Note, Beyond the Prima Facie Case in Employment Discrimination Law: Statistical Proof
and Rebuttal, 89 Harv. L. Rev. 387 (1975).
Daniel L. Rubinfeld, Econometrics in the Courtroom, 85 Colum. L. Rev. 1048
(1985).
Daniel L. Rubinfeld & Peter O. Steiner, Quantitative Methods in Antitrust Litigation,
Law & Contemp. Probs., Autumn 1983, at 69.
Daniel L. Rubinfeld, Statistical and Demographic Issues Underlying Voting Rights
Cases, 15 Evaluation Rev. 659 (1991).
The Evolving Role of Statistical Assessments as Evidence in the Courts (Stephen
E. Fienberg ed., 1989).
357

OCR for page 303