| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 33
APPENDIX A
A Technical Discussion of the Process of Rating
and Ranking Programs in a Field.
This appendix explains in detail how the various parts of the rating and ranking process for
graduate programs fit together and how the process is carried out. Figure A-1 provides a
graphical overview of the entire process and forms the basis for this appendix. We address each
of the boxes in Figure A-1 separately, starting at the top and generally working downward and to
the right. The topics in this appendix include:
• a summary of the sources of data used in the rating and ranking process,
• the direct weights, the regression-based weights, the methods used to calculate
the regression-based weights,
• the simulation of the uncertainty in the weights by random-halves sampling,
• the construction of the combined weights using an optimal fraction to combine
the simulated values of the direct and regression-based weights,
• the elimination of variables with nonsignificant combined weights,
• the simulation of the uncertainty in the values of the program variables,
• the combination of the simulated combined weights for the significant program
variables with the simulated standardized values of the program variables to
obtain simulated rankings, and
• the resulting inter-quartile ranges of rankings that are the primary rating and
ranking quantities we report.
33
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 34
Figure A-1 A graphical summary of the NRC’s approach to
rating and thereby ranking graduate programs.
The three sets of data: X, R and P.
R = the collection of ratings of
X = the collection of the faculty P = the collection of the values
programs by the faculty raters. An
importance measures. A complete of the program variables. A
incomplete array, with ratings only
array with an importance value for complete array with a value for
for the sampled programs and
every program variable by every every program (that satisfies the
rated only by those faculty inclusion criteria for rating and
responding faculty member.
members who were sampled to ranking) in a field, on every
rate a given sampled program. program variable.
(1) Random halves
(2) Random halves
(8) Random perturbation
sampling of faculty in X.
sampling of raters in R.
of the values in P.
(1a) Results in one random (2a) Results in one random
%
% half of R, denoted by R.
half of X, denoted by X .
(3) Standardize
program variables to
%
X
(1b) Average over faculty
%
R over raters Mean = 0, and SD = 1.
(2b) Average
to get the direct weights, x .
Denote result by P*.
to get average ratings for
The sum of direct weights =
sampled programs, r . This These are the
1.0.
independent variables
is the dependent variable in
(5) Select policy weight, w= ½. in the regressions.
the regressions.
The combined weights The regressions
ˆ
(6) Combine x , m and w (4) (a) Transform original program (8a) Results in one
variables to principal components (PCs). randomly perturbed
= ½ using the optimal
(b) Perform backwards stepwise version of P, denoted by
fraction to form the
regression to obtain a stable fitted %
combined weights, f0. P.
equation predicting average ratings from
the remaining PCs.
(8b) Standardize
(c) Transform resulting coefficients back
% %
to the original program variables to get P to get P * .
ˆ , and
the regression-based weights, m
Eliminating non-significant variables
make their absolute sum = 1.0. ranges of Rankings
The Inter-quartile
(7) Repeat the steps from (1) to (6) 500 times. Use the
(9) Repeat the steps (8) to (8b) to get 500 replications
resulting 500 samples of f0 to eliminate program
%
of P * , and combine them with the final 500
variables in X and P* having non-significant
replications of f0 to get 500 Ratings for each program.
combined weights. Repeat this until there are no non-
Rank the programs for each set of 500 ratings. This
significant program variables. Final output is last 500
results in 500 Rankings for each program. Use these
replications of f0 with zero entries for all non-
500 Rankings to get the Inter-quartile range of the
significant variables.
Rankings for each program.
34
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 35
THE THREE DATA SETS
The empirical basis of the NRC ratings and rankings are the three data sets indicated in
the three unlabeled boxes at the top of Figure A-1. The first, denoted by X, is the collection of
faculty importance measures that were derived from data that were collected in the faculty
questionnaire. The data in X are used to derive the direct weights discussed more extensively
below. The second, denoted by R, is the collection of ratings of programs by faculty raters.
These ratings were made separately from the faculty questionnaire and involved only a sample of
programs from each field and only a sample of faculty raters from that field. This sample of
faculty ratings plays a crucial role in the derivation of the regression-based weights, discussed
more extensively below. The third data set, denoted by P, is the collection of the values of the 20
program variables that were collected from various sources for each program. The data in P are
used in the final ratings and rankings of the programs and are discussed in greater detail below.
More details about these three data sets are also available in Section 2 of this report.
BOX (1b): THE DIRECT WEIGHTS FROM THE FACULTY QUESTIONNAIRE36
We turn first to the direct weights in box (1b) in Figure A-1, leaving boxes (1) and (1a) to
our later discussion of how we simulated the uncertainty in these data.
The faculty questionnaire asks each graduate-program faculty respondent to indicate how
important each of 21 characteristics is to the quality of a program in his or her field of study. 37
This information is then used to derive the direct weights for each surveyed faculty member, as
described below.
The original 21 program characteristics listed on the faculty questionnaire are shown in
Table A-1, and they were divided into three categories—faculty, student, and program
characteristics. Of the original 21, there are 20 for which adequate data were deemed to be
available to use in the rating process, and these 20 data values for each program became the 20
program variables used in this study to which we repeatedly refer.
Faculty respondents were first asked to indicate up to four characteristics in each category that
they thought were “most important” to program quality. Each characteristic that was listed
received an initial score of 1 for that faculty respondent. These preferences were then narrowed
by asking the faculty members to further identify a maximum of two characteristics in each
category that they thought were the most important. Each of these selected characteristics
received an additional point, resulting in a score of 2. Given this approach, at most, 12 of the
program characteristics can have a non-zero value for any given faculty member; and of these
12, 6, at most, will have a score of 2, and the rest will have a score of 1. At least 8 program
characteristics will have a score of 0 for each faculty respondent, more than 8 would be zero if
the respondent selected less than 4 as the “important” or 2 as the “most important”
characteristics. A final question asked faculty respondents to indicate the relative importance of
36
The importance of program attributes to program quality is surveyed in Section G of the faculty questionnaire.
37
The number of student publications and presentations was not used because consistent data on it were unavailable.
The direct and regression-based weights were calculated without it.
35
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 36
each of the three categories by assigning them values that summed to 100 over the three
categories.38 For each faculty respondent, his or her importance measure for each program
characteristic was calculated as the product of the score that it received times the relative
importance value assigned to its category. Finally, the 20 importance measures for each faculty
respondent were transformed by dividing each one by the sum of his or her importance measures
across the 20 program variables.
38
The faculty task can be thought of as asking faculty how many percentage points should be assigned to each
category. The sum of the percentage point weights adds up to 100.
36
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 37
Faculty characteristics
TableumberThe 21 Program Characteristics Listed in the Faculty Questionnaire.
i. N A-1 of publications per faculty member
ii. Number of citations per publication (for non-humanities fields)
iii. Percent of faculty holding grants
iv. Involvement in interdisciplinary work
v. Racial/ethnic diversity of program faculty
vi. Gender diversity of program faculty
vii. Reception by peers of a faculty member’s work as measured by honors and awards
Student characteristics
i. Median GRE scores of entering students
ii. Percentage of students receiving full financial support
iii. Percentage of students with portable fellowships
iv. Number of student publications and presentations (not used)
v. Racial/ethnic diversity of the student population
vi. Gender diversity of the student population
vii. A high percentage of international students
Program characteristics
i. Average number of Ph.D.’s granted in last five years
ii. Percentage of entering students who complete a doctoral degree in a given time (6
years for non-humanities, 8 years for humanities).
iii. Time to degree
iv. Placement of students after graduation (percent in either positions or postdoctoral
fellowships in academia)
v. Percentage of students with individual work space
vi. Percentage of health insurance premiums covered by institution or program
vii. Number of student support activities provided by the institution or program
37
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 38
We will use the following notation consistently: i for a faculty respondent, j for a
program in a field, and k for one of the 20 program variables. Thus, xik denotes the measure of
importance placed on program variable k by faculty respondent i. The values, xik, are non-
negative and, over k, sum to 1.0 for each faculty respondent i. The importance measure vector
for faculty respondent i is the collection of these 20 values,
xi = (xi1, xi2, . . , xi20).
(1)
The entries in these x-vectors are non-negative and sum to 1.00. Denote the vector of average
importance weights, averaged across the entire set of faculty respondents in a field, by
x = ( x1 , x2 ,..., x20 ) .
(2)
The mean value, xk , is the average weight of the importance given to the kth program variable by
all the surveyed faculty respondents in the field. The averages, { xk }, are the direct weights of the
faculty respondents because they directly give the average relative importance of each program
variable, as indicated by the faculty questionnaire responses in the field of study. Thus, the final
20 importance measures of the program characteristics for each faculty respondent are non-
negative and sum to 1.0.
BOXES (2b), (3) AND (4): THE REGRESSION-BASED WEIGHTS
We next consider the processes in boxes (2b), (3) and (4) in Figure A-1 that lead to the
regression-based weights. Again, we leave boxes (2) and (2a) to our later discussion of how we
simulated the uncertainty in these data.
The regression-based weights represent our attempt to ascertain how much weight is
implicitly given to each program variable by faculty members when they rate programs by using
their own perceived quality of the programs they are rating. We used linear regression to predict
average faculty ratings from the 20 program variables and interpreted the resulting regression
coefficients as indicating the implicit importance of each program variable for faculty ratings.
This is different from the direct weights that were just described. We have broken down the
process of obtaining the regression-based weights into the three parts indicated by boxes (2b), (3)
and (4) which we now discuss in turn.
Box (2b): The average ratings for the sampled programs.
The ratings data in R of Figure A-1 are the ratings given by the sampled faculty members
to the sample of programs that they were requested to rate. A randomly selected faculty member,
i, rates a randomly selected program, j, on a scale of 1 to 6 in terms of his or her perception of its
38
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 39
quality. Denote this rating by rij. The matrix sampling plan used was designed so that a sample of
up to 50 of the programs in a field was rated by a sample of the graduate faculty members in that
same field. Each rater rated about 15 programs, and none rated his or her own program. On
average, each rated program was rated by about 44 faculty raters. The rater sample was stratified
to ensure proportionality by geographic region, program size (measured by number of faculty),
and academic rank. The program sample was stratified to ensure proportionality by geographic
region and program size.
R is the array of all the values of rij. Note that R is an incomplete array because many
faculty members who responded to the questionnaire did not rate programs and many programs
in a field were not rated, except for the small fields. Box (2b) indicates that we compute the
average of these ratings for program j, and denote this average rating by rj . Because each
program’s average rating is determined by a different random sample of graduate faculty raters,
it is highly unlikely that any two programs will be evaluated by exactly the same set of raters.
Denote the vector of the average ratings for the sampled programs in a field by r .
The values of the average ratings in r are the dependent variable in the regression
analyses used to form the regression-based weights.
Box (3): The program variables and standardizing
Denote the value of program variable k for program j by pjk, and define the vector of all
program variables for program j by
pj = (pj1, pj2 , . . , pj20), (3)
and the array with rows given by pj by P. A cursory examination of the program characteristics
listed in Table A-1 shows that they are on different scales. For example, the number of
publications per faculty member (numbers in the fives and tens), the median GRE scores of
entering students (numbers in the hundreds), and the percentage of entering students who
complete a doctoral degree in 10 years or less (fractions) are reported in values that are of very
different orders of magnitude. If these values are left as they are, the size of any regression
coefficient based on them will be influenced by both the importance of that program variable for
predicting the average ratings (which is what we are interested in), as well as the scale of that
variable (which is arbitrary and does not interest us). The program variables with large values,
such as the median GRE scores, will have very small coefficients to reflect the change in scale in
going from GRE scores (in the hundreds) to ratings (in the 1 to 6 range). Conversely, program
variables with small values, such as proportions, will have larger regression coefficients to
reflect the change in scale in going from numbers less than 1 to ratings (in the 1 to 6 range).
To avoid the ambiguity between the influence of the scale and the real predictive
importance of a variable, we needed to modify the values of the different program variables so
they have similar scales. This would ensure that program variables with the same influence on
the prediction of faculty ratings would have similar regression-coefficient values. Our solution is
the very common one of standardizing the pjk-values by subtracting their mean across the
39
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 40
programs in a field and dividing by the corresponding standard deviation. This will result in
program variables that have the same mean (0.0) and standard deviation (1.0) across the
programs in the field. In this way, no program variable will have substantially larger or smaller
values than any other program variable across the programs in a field. For the regressions of box
(4), the standardization was done only over the programs that were sampled for rating.
We denote the values of the standardized program variables with an asterisk (pjk* and
P*). Two program variables (Student Work Space and Health Insurance) were coded as 1
(present) or -1 (absent). We felt that there was no need for additional standardization of these
two program variables and they were not standardized to have mean 0 and variance 1.
The standardized program variables for the sampled and rated programs served as the
predictor or independent variables in the regressions that lead to the regression-based weights.
Box (4): The regressions and the regression-based weights
The statistical problem addressed in box (4) is to use r and P* as the dependent and
independent variables, respectively, in a linear regression, to obtain the vector of regression-
ˆ
based weights, m , using least squares. It should be noted that only the data in P* for the
sampled programs are used. The data for the non-sampled programs in P* are not used in this
step of the process.
Two immediate problems arise. These are: (1) the number of observations (i.e., the
number of sampled programs in a field) is 50 or less, while the number of independent variables
(i.e., the program variables in P*) is 20, and (2) a number of the program variables are correlated
with each other across the programs in a field.. This is less than an ideal situation for obtaining
stable regression coefficients. There are too few observations to hope for stable estimates of the
coefficients for 20 variables. The fact that these variables are also correlated does not help
matters either. If we had ignored these two problems, least-squares regression methods would
have tended to assign coefficients rather arbitrarily to one particular variable or to other variables
that are correlated with it, and how this worked out would depend on which programs were
included in the sample of rated programs. The resulting unstable regression coefficients would
have been unusable for our purposes.
For example, as expected, when we fit a linear model that included all 20 of the program
variables, we found that for a number of the variables, the coefficients and their signs did not
make intuitive sense. However, we found, as expected, that they made more sense when we used
various step-wise selection methods for reducing the number of variables used as predictors.
With only 50 cases, we had to expect that we could not use all 20 variables in the prediction
equations without adjustments.
After examining a variety of approaches, we settled on using a backwards, step-wise
selection method applied to the 20 principal component (PC) variables formed from the 20
program variables (rather than using the original 20 program variables). The regression
coefficients obtained for the remaining PC variables were then transformed back to scale of the
original 20 program variables, with the result that all 20 program variables now had non-zero
40
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 41
coefficients, but these coefficients were subject to several linear constraints implied by the
deleted PC variables.
The principal component variables are linear combinations of the original 20 program
variables that have two properties: (1) they are uncorrelated in the sample, and (2) they can give
exactly the same predictions as do the original variables—that is, every prediction equation that
is possible with the original variables is also possible to form using the PC variables, using
different regression coefficients. The PC variables are usually ordered by their variances from
largest to smallest, but this plays no role here. There are as many PC variables as there are
original variables—in our case, 20.
If we denote the array of original 20 standardized variables for the sample of rated
programs as P*, then the corresponding array of the 20 PC variables, C, is given by the matrix
multiplication, C = P*V, where V is the 20 by 20 orthogonal matrix specified by, among other
things, the singular value decomposition of P*. After the regression coefficients are estimated
using the PC variables, we get back to the coefficients for the original standardized variables in
P* by transforming the vector of regression coefficients by the transformation, V.
Our step-wise use of the PC variables proceeded as follows. We begin with a least-
squares prediction equation, predicting r from C, that includes all of the PC variables. Then a
series of analyses is performed, with one PC variable at a time being left out of the prediction
equation; the PC variable that has the least impact on the fit of the predicted ratings (as measured
by its t-statistic) is removed. This process is repeated, removing one PC variable each time, until
the remaining PC variables each add statistically significant improvements to the fit of the
predictions of the ratings (at the 0.05 level). The result is a set of regression coefficients, the PC
coefficients, γˆ , which predict the sample of program ratings from a subset of the PC variables,
i.e.,
ˆ
r = C γˆ . (4)
In Equation 4, the caret denotes estimation. Moreover, for the PC variables that have
been eliminated during the backwards selection process, the corresponding PC-coefficients, γˆk ,
are zero. These zeros mean that we are setting the coefficients of certain linear combinations of
the original variables to zero rather than setting the coefficients for some of the original program
variables to zero. This was regarded as a virtue, because we did not necessarily eliminate any of
the original program variables from the prediction equation used to find the regression-based
weights. By proceeding this way, we are not forced to give a zero weight to one of two collinear
variables in the step-wise procedure. Instead, both collinear variables will typically load onto the
same principal components and get some weight when the matrix V is applied to the PC
coefficients to obtain the coefficients for the original program variables, i.e.,
m = V γˆ .
ˆ (5)
41
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 42
In the same way, the matrix of estimated variances and covariances of γˆ , obtained from the
ˆ
least-squares output, may be transformed to the corresponding matrix for m . The variances from
this matrix are used later in box (6) in the computation of the “optimal fraction” for combining
the direct and regression-based weights.
The regression coefficient for the kth program variable, denoted by mk , is the regression-
ˆ
based weight for program characteristic k as a predictor of the average ratings of the programs by
ˆ ˆˆ ˆ
the faculty raters, and m = (m1 , m2 ,..., m20 ) .
The predicted perceived quality rating for a sampled program can be expected to differ
somewhat from the actual average rating for that program. For example, for the two fields
studied in Assessing Research Doctorate Programs: A Methodology Study, the root-mean-square
deviation between the predictions and the average ratings was 0.42 on a 1-to-6 rating scale for
both mathematics and English. In addition, the (adjusted) R2 of the regressions of average ratings
on measured program characteristics was 0.82 for mathematics and 0.80 for economics. These
values indicate that the predictions account for about 80 percent of the variability in average
ratings. We regarded this as satisfactory levels of agreement between predicted and actual to use
these methods in this study.
These results show that the predicted perceived quality ratings agree fairly well with the
actual ratings. However, these results do not indicate how well a prediction equation that was
based on a sample of programs will reproduce the predictions of the equation for the whole
population of programs in a field. The data for mathematics, reported in Assessing Research
Doctorate Programs: A Methodology Study, indicate that using 49 programs did a reasonably
good job of reproducing the predictions based on the whole field of 147 physics programs.39
Thus, we decided that in developing the regression-based ratings, we would use a sample of 50
programs from a field if it had more than 50 programs and use almost all of the programs in
fields with 50 or fewer programs. When there were fewer than 30 programs in a field, it was
combined with a larger discipline with similar direct weights for the purposes of estimating the
regression-based weights.40 In one case, computer engineering, there were fewer than 25
39
See Appendix G of Assessing Research Doctorate Programs: A Methodology Study, National Research Council
(2003)
40
The fields for which this was done were:
Small Field Surrogate Field
Aerospace engineering Mechanical engineering
Agricultural economics Economics
American studies English literature
Astrophysics and astronomy Physics
Entomology Plant science
Forestry Plant science
Food science Plant science
Engineering science and mechanics Mechanical engineering
Theatre and performance English literature
42
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 43
programs, and this field was combined with the field of electrical and computer engineering to
estimate the regression-based coefficients.41
ˆ
There is one final alteration in the values of m that needs to be mentioned. The direct
weights, { xk }, have absolute values that sum to 1.0. This is not necessarily true of the regression
ˆ
coefficients, { mk }. The scale of mk depends on both the scale of pjk and the scale of the average
ratings, { rj }.We decided, because our intent was to combine these two sources of the importance
of the various program variables, that they needed to be on similar scales. We decided to force
them both to sum to 1.0 in absolute value42. This allows the direct and regression-based weights
to have negative values where they arise, typically in the regression-based weights, without
requiring anything complicated to deal with this. Using the sum of absolute values allows the
sign of the regression-based weights to be determined by the data rather than by an a priori
ˆ
hypothesis. Thus, we divided each regression coefficient, mk , by the sum of the absolute values
of all the regression coefficients. In this way, both the direct and regression-based weights are
fractional values, mostly positive but some negative, whose absolute sums equal 1.0. The
ˆ
estimated standard deviations of the { mk }, obtained in standard ways from the regression output,
were also divided by this sum to make them the correct size for use in the process of combining
the direct and regression-based weights, discussed below.
BOXES (5) AND (6): THE COMBINED WEIGHTS
To motivate our method of combining of the direct and regression-based weights, we
start by describing the direct and regression-based ratings. Remembering that the standardized
values of the program variables for program j are denoted by pjk*, the direct rating for program j,
using the average direct weight vector, x , is Xj, is given by
20
∑x
Xj = p jk * . (6)
k
k =1
ˆ
The regression-based rating for program j, using the regression-based weight vector, m ,
is Mj, is given by
20
∑m
ˆ
Mj = p jk * . (7)
k
k =1
41
The committee had not anticipated this when it developed the taxonomy, or the field would not have been included
as a separate field.
42
We use the absolute value here because, for time to degree, a higher value should receive a negative weight.
43
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 44
Note that the regression-based rating is a linear transformation of the predicted ratings
used to obtain the regression-based weights, because the constant term of the regression is
deleted, and the weights have been scaled by a common value so that their absolute sum is 1.0.
The procedure for computing regression-based ratings can be used for any program, sampled or
not, in the given field. Simply use Mj as defined in Equation 7 above, where {pjk*} comes from
ˆ
the data for program j and the { mk } are the regression-based weights based on the sample of
43
programs and raters.
We combined the direct ratings with the regression-based ratings as follows. Let w denote
a policy weight and form the following combination of the direct and regression-based ratings:
Rj = wMj + (1 – w)Xj. (8)
The policy weight, w, is chosen in box (5) of Figure A-1, and is the amount the regression-based
ratings are allowed to influence the combined rating, Rj. When w = 0, the regression-based rating
has no influence on the Rj. When w = 1, the Rjs are totally based upon the regression-based
ratings. Any compromise value of w is somewhere between 0 and 1.
We did not actually form both the direct and regression-based ratings in our work.
Instead, we exploited the simple linear form of these given by:
20 20 20
Rj = w ∑ mk p jk * + (1 – w) ∑ xk p jk * = ∑f
ˆ p jk * (9)
k
k =1 k =1 k =1
where the combined weight, f k , is given by
ˆ
f k = w mk + (1 – w) xk . (10)
The representation of the combined rating given in Equations 9 and 10 is a linear
combination of the program variables that uses the combined weights, { f k } defined in Equation
10. The combined weight f k is applied to the kth standardized program characteristic, pjk* for
each k, and then all 20 of these weighted values are summed to obtain the final combined rating
for program j.
ˆ
However, because both mk and xk are subject to uncertainty, we made one additional
adjustment to Equation 10 that is described below, following the discussion of how we simulated
the uncertainty in both the direct weights and in the average ratings used to form the regression-
based weights.
43
We have throughout estimated linear regressions. Is this assumption justified? We can only say that, empirically,
we tried alternative specifications that included quadratic terms for the most important variables (publications and
citations) and did not find an improved fit.
44
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 45
BOXES (1), (1a), (2) AND (2a): SIMULATING THE UNCERTAINTY IN THE
DIRECT AND REGRESSION-BASED WEIGHTS
The direct weight vector, x , is subject to uncertainty; that is, a different set of respondent
faculty would have led to different values in x . Disagreement among the graduate faculty on
the relative importance of the 20 program variables is the source of the uncertainty of the direct
weights. The average ratings of the sampled faculty in r are also subject to uncertainty; a
different sample of raters or programs would have produced different values in r . One way to
reflect this uncertainty is to use the sampling distributions of x and r . There are various ways
that these sampling distributions may be realized. We chose an empirical approach that made no
assumptions about the shapes of the various distributions involved, but this allowed us to use
computer-intensive methods to let the sampling variability of both x and r influence the final
ratings and rankings. We examined two empirical approaches, Efron’s bootstrap and a random-
halves (RH) procedure suggested by the committee chairman. We found that both gave very
similar final results in terms of the final ranges of rankings and ratings. The bootstrap requires
taking a sample of N with replacement from the relevant empirical distribution. The RH
procedure requires taking a sample of N/2 without replacement from the same empirical
distribution. We chose to use the RH procedure because it cut the sampling computations in half,
is fairly easy to explain, and as far as we could tell, gave essentially the same results as the
bootstrap for ranking and rating.
Boxes (1) and (2): The random halves procedure
The RH procedure for both x and r are nearly the same, and with the same
justifications. X is a complete array whose rows denote the N faculty respondents, while R is an
incomplete array whose rows denote the n sampled faculty raters for a field. In the case of X, the
RH procedure requires a random sample of size N/2 of the faculty respondents. In the case of R,
the RH procedure requires a random sample of size n/2 of the faculty raters. Repeated draws
from these random half samples are then used to simulate the uncertainty in x and r ,
respectively.
Alert readers may worry that these half samples will exhibit too much variability in the
resulting averages; after all, a half sample has only half the number of cases as a full sample—
and the bootstrap always takes a full sample of N or n. The explanation of why a half sample
without replacement has essentially the same variability as a full sample with replacement is
most easily seen by considering the variance of the mean of a sample without replacement from a
finite population. It is well known from sampling theory that the variance of the mean from a
sample of size N/2, from a population of size N is, essentially,
σ x2 σx 2
N
(1 − / N ) = k .
Var( xk ) = (11)
k
⎛N⎞ 2 N
⎜⎟
⎝2⎠
That is, because of the “finite sampling correction,” the variance from a random half
sample without replacement is exactly the same as the variance of a random sample of twice the
45
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 46
size with replacement (there is a small “N versus N – 1” effect that Formula 11 ignores). This is
why the bootstrap and the RH methods give such similar results in our application to the
uncertainty of the direct weights. There are other reasons to expect the RH method to produce a
useful simulation of the uncertainty of averages.44
The same reasoning applies to the RH sampling of the faculty raters in R to simulate the
uncertainty in the average ratings, r , used to obtain the regression-based weights. The procedure
was to sample a random half of all raters for programs in a field and compute the average rating
for each program from that half sample.
The regression-based weights are subject to uncertainty from two sources. The first is the
uncertainty arising from sampling the faculty raters and, as indicated above, the RH sampling
directly addresses this source. The second is from using average ratings from a sample of
programs rather than all the programs to develop the regression equation from which the
regression-based weights are derived. In the discussion of box (4), above, we gave our reasoning
for believing the sample of 50 programs is adequate, and how we pool the data from other related
fields when the number of programs in a field is smaller than 50. In addition, while the use of
ratings for a sample of programs has the practical value of reducing the workload of the faculty
raters, our implicit use of the predicted average ratings, {Mj}, from Equation 7 above, rather
than actual average ratings, { rj }, also reduces some of the uncertainty due to the sampling of the
programs to be rated. For these two reasons, we believe that this second source of uncertainty is
not as important as that simulated by the RH procedure for the uncertainty in the average ratings,
ˆ
and consequently, for the regression-based weights, m .
We always drew the RH samples 500 times, and those for x were statistically
independent of those for r . This gives us 500 replications of the direct weights and 500
replications of the regression-based weights that we then combined into 500 replications of the
combined weights, which we describe next.
Box (6): Using the optimal fraction to combine the direct and
regression-based weights.
ˆ
In deriving the ranges of ratings that reflect the uncertainty in mk and xk , simulated
ˆ
values, mk, and xk, are drawn from the sampling distributions of mk , and xk , respectively, using
independent RH samples from the appropriate parts of R and X. These two simulated values are
to be combined to form a simulated value, fk, for f k in Equation 10. However, the simple
weighted average in Equation 10 only reflects the effect of the policy weighting, w, and ignores
the fact that both mk, and xk are independent random draws from distributions, rather than fixed
44
The random-halves procedure has a place in the statistical literature, but with other names. It is an example of the
“deleted-d” jackknife as described in Efron and Tibshirani, (1993) An Introduction to the Bootstrap. New York:
Chapman and Hall. p. 149, with d = n/2. It is described by Kirk Wolter in a private communication as an example of
the “balanced repeated replication” or “balanced half samples,” and described in Wolter, K. M. (2007) Introduction
to Variance Estimation., 2nd ed. New York: Springer-Verlag.
46
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 47
values. We want to combine mk, and xk in such a way as to bring the simulated value, fk, as close
as possible to f k on average, and in a way that will also reflect the policy weight, w,
appropriately. This section outlines our approach to choosing the optimal fraction to apply to mk
to achieve this. The optimal fraction is the amount of weight applied to mk that minimizes the
mean-square error of fk, treating f k as a target parameter to be estimated.
First, consider a general weighting, fk(u), that uses a fraction, u. This weighting has the
form
fk(u) = umk + (1 – u)xk. (12)
ˆ
By construction of the RH procedure, the mean of the distribution of mk is mk (the regression
coefficients that are obtained when the data from all n faculty raters are used). Similarly, the
mean of the distribution of xk is xk , the mean importance value that is obtained when the data
from all N faculty respondents are averaged. We may regard fk(u) as an estimator of φk, given by
φk = w mk + (1 – w) xk .
ˆ (13)
The problem then is to find the value of u that will minimize the mean-square error
(MSE) of fk(u) given by
MSE(u) = E(fk(u) – φk)2, (14)
where, in Equation 14, the notation, E(fk(u) – φk)2 denotes the expectation or average taken over
ˆ
the independent RH distributions of mk and xk . The MSE is a measure of the combined
uncertainty in fk(u).
The MSE in (14) can be written as
MSE(u) = E(umk + (1 – u)xk – w mk – (1 – w) xk )2
ˆ
= E(u(mk – mk ) + (1 – u)(xk – xk ) + (u – w) mk + (w – u) xk )2
ˆ ˆ
= E(u(mk – mk ) + (1 – u)(xk – xk ) + (u – w)( mk – xk ))2.
ˆ ˆ (15)
The point of re-expressing Equation 14 as Equation 15 is that now when the squaring is carried
out, all of the terms except the squared ones have zero expected values and can be ignored. If we
denote the variance of the sampling distribution of mk by σ2( mk ) and the variance of xk by
ˆ ˆ
σ2( xk ), then Equation 15 becomes
MSE(u) = u2σ2( mk ) + (1 – u)2σ2( xk ) + (u – w)2( mk – xk )2.
ˆ ˆ (16)
47
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 48
It is now a straightforward task to differentiate Equation 16 in u, set the result to zero, and solve
for the optimal u-value, u0k, which we call the optimal fraction. This calculation results in
σ 2 ( xk ) + w(mk − xk ) 2
ˆ
u0k = . (17)
σ ( xk ) + σ (mk ) + (mk − xk )2
2 2
ˆ ˆ
The optimal fraction in Equation 17 has some useful and intuitive properties. It takes on
the value w when there is no uncertainty about the direct and regression-based weights.
ˆ
Moreover, w has no influence on the optimal fraction when mk and xk are equal. In that case, the
th
direct weights and regression-based weights on the k program characteristic are the same, and
the optimal fraction combines the two simulated values in a way that is inversely proportional to
their variances, so that the value with less variation gets more weight. Note also, that the value in
Equation17 is the same for all of the RH simulated values of mk and xk.
The two variances in Equation 17, σ2( xk ) and σ2( mk ), may be found in standard ways.
ˆ
The value of σ2( xk ) is given by
σ2( xk ) = σ2(xk)/NF, (18)
where NF denotes the number of faculty in the field who supply direct weight data, and σ2(xk)
denotes the variance of the individual direct weights given to the kth program variable by these
faculty respondents. The value of σ2( mk ) is obtained from the regression output that produces
ˆ
mk when the data from all faculty raters in a field are used. Its square root, σ( mk ), is the standard
ˆ ˆ
ˆ ˆ
error of the regression coefficient, mk . Finally, because we rescaled the mk so that their absolute
sum was 1.0, the same divisor must be applied to σ( mk ) to put it on the corresponding scale.
ˆ
If we now replace the u in Equation 12 with u0k given in Equation 17, we then obtain the
combined weight that optimally combines the two simulated values of the weights, mk, and xk,
into the combined rating, given by
20
∑f
R0j = p*kj (19)
0k
k =1
where
f0k = u0kmk + (1 – u0k)xk, (20)
and u0k is given by Equation 17. The vector of optimally combined weights is denoted by f045.
45
The weights f0k differ little from the weights that would be obtained from equation (10) with w = ½ in fields with
a large number of programs. For example, the program described in Chapter 5 in economics is one of 117 programs,
and the root mean square difference between the optimal weights calculated from Equation 20 and those from
Equation 10 with w = ½ over the 500 iterations is 0.00468. The average absolute difference in rankings for the 117
48
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 49
The values of R0j from Equations 19 and 20 are used as the 500 simulated values of the
combined ratings for the purposes of determining the ranking interval ranges for each program
that is discussed below.
In performing the RH sampling to mimic the uncertainty in the direct and regression-
based weights, it should be emphasized that the random half samples from X and R were
statistically independent. This is our justification for assuming that the random draws, mk, and xk,
are statistically independent in the calculation of the optimal fraction, u0k.46
As a final point, we did realize that the approach to calculating the optimal fraction
described above did not take into account any correlation between the direct and regression-
based weights for different program variables. We did examine a method that did, but it simply
produced a matrix version of Equation 17 that reduced to the procedure we used when the
program variables were uncorrelated, but was otherwise difficult to implement with the resources
available to us.
BOX (7): ELIMINATING NON-SIGNIFICANT PROGRAM VARIABLES.
After we have obtained the 500 simulated values of the combined weights by applying
Equations 17 and 20 to the 500 simulated values for the direct and regression-based weights, we
were in a position to examine the distributions of these 500 values of the combined weights for
each program variable. The distributions of the combined weights for some of the program
variables did not contain zero and were not even near zero. However, other program variables
had combined weight distributions that did contain zero. If zero is inside the middle 95 percent
of this distribution, we declare the combined weight for that program variable to be non-
significant for the rating and ranking process (in analogy with the usual way that distributions of
parameters are tested for statistical significance). If the combined weight for a program variable
is not significantly different from zero, the variable for that coefficient is dropped from further
computations. This elimination of program variables required us to recalculate everything above
box (7) in Figure A-1. The eliminated program variables are ignored in calculating the direct and
regression-based weights for the other variables. New RH samples are drawn, the direct weights
are retransformed so that the absolute sum of the remaining direct weights was 1.0, the
regressions are re-run using the reduced set of program variables as predictors, and new optimal
fractions are computed to combine the direct and regression-based weights. Finally, the 500
simulated combined coefficients are again tested for statistical significance from zero. This
programs in economics between those for the optimal weights and those with w = ½ is 3.972 and 3.979 for the 1st
and 3rd quartile ratings, respectively. The average difference in the lengths of the ranking range over the 117
programs was 6.047 for optimal weighting and 6.032 for the w = ½ weighting. These differences may be greater if
the field is composed of a small number of programs with fewer responses by the faculty for the importance weights
and a larger variance on those weights, such as applied mathematics with 33 programs.
46
The fact that the raters for each field were a subset of those who answered the faculty questionnaire may confuse
some into thinking that our independence assumption may not be justified. This is an unfortunate misunderstanding
of the simulation of uncertainty in the rating and ranking process. It is the statistical independence of the two RH
sampling processes that matters, nothing else.
49
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 50
process is repeated until a final set of combined weights, each of which is significantly different
from zero, is obtained. Only after this testing and retesting process is performed are the final sets
of 500 combined coefficients ready for use in the computation of the intervals of rankings that
are discussed in box (9) of Figure A-1. The values for the combined weights that correspond to
the eliminated variables are set to 0.0 in each of the final 500 simulated values of f0. These 500
vectors of combined weights are used in the production of the ratings that are used to produce the
final intervals of rankings for each program, as discussed later.
Empirically, the examination of three fields suggests that this process has two useful
effects. First, the middle of the inter-quartile ranges of rankings of programs is changed very
little, so that the ranges before eliminating nonsignificant program variables and those after this
elimination are centered in nearly the same places47. Second, the widths of these inter-quartile
ranges are slightly reduced or are unchanged. These are the effects that we would expect from
eliminating variables that are having only a noisy effect on the ranking and rating process, and
for this reason, we have continued to include box (7) in our rating and ranking process.
Nonetheless, the inter-quartile intervals do shift more markedly than the medians, when
estimated coefficients are set to zero—largely for those departments near the middle of the
rankings. This is because quartile estimates are more variable than median estimates. There are
even rare instances in which the intervals calculated both ways do not overlap.
BOX (8), (8a) AND (8b): INCORPORATING UNCERTAINTY INTO THE PROGRAM
VARIABLES
In addition to the uncertainty in the direct and regression-based weights discussed above,
there is also some uncertainty in the values of the program variables themselves. Some of the 20
program variables used to calculate the ratings also vary or have an error associated with their
values due to year-to-year fluctuations. Data for five of the variables (publications per faculty,
citations per publications, GRE scores, Ph.D. completion, and number of Ph.D.’s) were collected
over time, and averages over a number of years were used as the values of these program
variables. If a different time period had been used, the values would have been different. To
express this type of uncertainty, a relative error factor, ejk, was associated with each program
variable value, pjk. The relative error factor was calculated by dividing the standard deviation
over the series by the square root of the number of observations in the series, and then dividing
that number by the value of the variable pkj. For example, the publications per faculty variable is
the average number of allocated publications per allocated faculty over 7 years, and a standard
error value was calculated for this variable as SD/√7. This standard error was then divided by the
value of the publications per faculty variable to get the relative error factor for this program
variable.
47
Examination of the effect of this procedure gave correlations between the median rankings with and without the
elimination of nonsignificant variables of .99.
50
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 51
For the other 15 program variables that are used in the ratings, no data on variability were
directly obtained during the study, and we assigned a relative error of 0, 0.1 or 0.2 to these
variables. The relative error for the variables Student Workspace and Health Insurance were
given an error of 0, because they were thought to have little or no temporal fluctuation over the
interval considered; and for Percent of Faculty Holding Grants, the error assigned was 0.2,
because an examination of data from the National Science Foundation Survey of Research
Expenditure indicated this to be an appropriate estimate. The remaining 12 program variables
were assigned a relative error of 0.1. Each program had its own relative error factor for each
program variable, ejk.
Just as we had simulated values from the sampling distributions of x and r via RH
sampling, we also wanted to reflect the uncertainty in the values of the program variables
themselves rather than using the fixed values, {pkj}, in computing program ratings. We did this in
the following way. The value, pkj, was perturbed by drawing randomly from the Gaussian
distribution, N(pkj, (ekpkj)2).This distribution has a mean equal to the variable value pkj and a
standard deviation equal to the relative error, ek, times the variable value, pkj. Thus, the entire
%
array P is randomly perturbed to a new array, P . This perturbing process is repeated 500 times,
and each one is standardized to have mean 0.0 and standard deviation 1.0 for each of the 20
%
program variables to produce 500 standardized arrays, P *.
BOX (9): THE INTER-QUARTILE RANGES OF RANKINGS
In box (9) we have already calculated 500 replications of the combined weights after
eliminating the nonsignificant program variables for the given field [from box (7)] and from 500
replications of the steps in boxes (8), (8a) and (8b), we have 500 replications of the standardized
perturbed version of P that contains the program variable data for all of the programs to be rated
in the field. Now we use Equations 17, 19,and 20 to combine the replications of the combined
weights with the replications of the standardized perturbed program variables to obtain 500
replications of the combined rating Rj for each program, j. Denote the kth replication of Rj by
R (j k ) . To obtain the kth replication of the rankings of the programs, sort the values of R (j k ) over j
from high to low and assign the rank of 1 to the program with the highest rating in this set. In
case of tied ratings, we use the standard procedure in which the ranks are averaged for the tied
cases, and the common rank given to the tied programs is the average of the ranks that would
have been given to the tied set of programs. For each of the replications of the ratings, there is a
corresponding replication of the rankings of the programs, resulting in 500 replications of the
ranking of each program.
Instead of reporting a single ranking of the programs in a field, we report the inter-
quartile range of the rankings for each program. This is an interval starting with the rank that was
at the 25th percentile (also called the first quartile) in the distribution of the 500 replications of
the ranks for the given program, and ending at the 75th percentile (the third quartile) of this
51
PREPUBLICATION COPY—UNEDITED PROOFS
OCR for page 52
distribution. The interpretation of the inter-quartile range is that it is the middle of the
distribution of rankings and reflects the uncertainty in the direct and regression-based weights
and in the program data values, twenty-five percent of a program’s rankings in our process are
less than this interval and 25 percent are higher. The interval itself represents what we would
expect the typical rankings for that program to be, given the uncertainty in the process and the
ratings of the other programs in the field.48
48
The choice of an inter-quartile range, rather than some other range (eliminating the top and bottom quintile, for
example) is arbitrary. IQRs are standard in the statistical literature. Broader ranges would result in greater overlap.
The point of introducing uncertainty in our calculations is that we do not know the “true” ranking of a program. The
purpose of presenting an IQR is to provide a range in which a program’s ranking is likely to fall.
52
PREPUBLICATION COPY—UNEDITED PROOFS