Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 351
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality Test 1 Preprofessional Skills Test (PPST): Reading Test* The PPST (Pre-Professional Skills Test): Reading is produced and sold by the Educational Testing Service (ETS). It is one in a series of tests used by several states as a screening test for entry-level teachers. The test is a 40-item multiple-choice test designed to be administered in a one-hour period. In some states the test is administered prior to admission to a teacher preparation program; in other states it may be administered at any time prior to obtaining an initial teaching certificate (license). The test is intended to assess an examinee’s general reading ability. A. TEST AND ASSESSMENT DEVELOPMENT • Purpose: According to ETS, the purpose of the test is to measure “the ability to understand and to analyze and evaluate written messages” (Tests at a Glance: Pre-Professional Skills Test, p. 42). Comment: Stating the purpose of the test publicly and having it available for potential test takers are appropriate and consistent with good measurement practice. • Table of specifications: What KSAs (knowledge/skills/abilities) are tested (e.g., is cultural diversity included)? Two broad topics are covered: Literal Comprehension (55 percent of the examination) and Critical and Inferential Comprehension (45 percent of the examination). Each broad topic includes several subcategories, none of which speaks directly to cultural diversity (Tests at a Glance, pg. 42–43). * Impara, 2000d.
OCR for page 352
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality Comment: The two broad topics and the more detailed descriptions seem reasonable, but a content specialist could judge more appropriately the quality and completeness of the content coverage. How were the KSAs derived and by whom? The “skills content domain”1 was determined by using a job analysis procedure that began in 1988. The job analysis consisted of two phases. Phase One entailed developing an initial content description (a set of enabling skills appropriate for entry-level teachers). This phase included having ETS test development staff produce an initial set of 119 enabling skills across five dimensions (Reading, Writing, Mathematics, Listening, and Speaking). The staff reviewed the literature, including skills required by various states, drawing on the test development staff’s own teaching experience. These 119 initial enabling skills were then reviewed by an external review panel of 21 practicing professionals (working teachers, principals, deans/associate deans of education, state directors of teacher education and certification, and state representatives of the National Parent-Teacher Association) who were nominated by appropriate national organizations. After the External Review Panel reviewed and modified the initial list of enabling skills, the resulting skills were reviewed by a 12-member National Advisory Committee (also nominated from appropriate national organizations). This committee made additional adjustments to the skills and added a sixth dimension to the domain: Interactive Communication Skills. At the end of Phase One there were 134 enabling skills, with 26 in Reading. The second phase of the job analysis consisted of a pilot survey and a large verification survey. The pilot survey was undertaken to obtain information about the clarity of the instructions and the appropriateness of the content of the survey instrument. It was administered to six teachers in New Jersey and Pennsylvania. The verification survey included responses from almost 2,800 individuals from constituent groups and practicing teachers. The constituent group’s sample included 630 individuals from either state agencies or national professional organizations other than teachers (289 responded; 46 percent response rate). The main teacher sample consisted of 6,120 elementary, middle, and secondary teachers (about 120 teachers from each state, including Washington, D.C.). Of these, 2,269 teachers responded (37 percent response rate.) A supplemental sample of 1,500 African American and Hispanic elementary, middle, and secondary teachers who were members of the National Education Association were selected to ensure adequate minority representation. There were 236 usable responses (16 percent) from this latter group. All those surveyed were asked to 1 This domain includes all dimensions of PPST: Reading, Writing, Mathematics, Listening, Speaking, and Interactive Communication Skills. The focus of this report is on the Reading dimension.
OCR for page 353
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality judge the importance of the enabling skills that resulted from Phase One. The rating scale for importance was a five-point scale with the highest rating being Extremely Important (a value of 5) and the lowest rating being Not Important (a value of 1). Based on analyses of all respondents and of respondents by subgroup (e.g., race, subject taught), 84 percent (113) of the 134 enabling skills were considered eligible for inclusion in the Praxis I tests because they had importance ratings of 3.5 or higher on the five-point scale. Of these, 21 of the original 26 Reading enabling skills were retained. In addition to determining the average overall and subgroup ratings, reliability estimates were computed (group split-half correlations and intraclass correlations). All split-half correlations exceeded .96. To check for across-respondent consistency, means for each item were calculated for each of 26 respondent characteristic (e.g., subject area, region of the country, race) and the correlations of means of selected pairs of characteristics were calculated to check the extent that the relative ordering of the enabling skills was the same across different mutually exclusive comparison groups (e.g., men and women; elementary school teachers, middle school teachers, secondary school teachers). A 1991 ETS report entitled Identification of a Core of Important Enabling Skills for the NTE Successor Stage I Examination describes the job analysis in detail. Also included are the names of the non-ETS participants in Phase One and the teachers who participated in the pilot survey. Copies of the various instruments and cover letters also are included. Comment: The process described is consistent with the literature for conducting a job analysis. This is not the only method, but it is an acceptable one. Phase One was well done. The use of professional organizations to nominate a qualified group of external reviewers was appropriate. The National Advisory Committee was also made up of individuals from the various national organizations. Although there was some nonresponse by members of the External Review Panel, the subsequent review by the National Advisory Committee helped ensure an adequate list of skills. Phase Two was also well done. Although normally one would expect a larger sample in the pilot survey, the use of only six teachers seems justified. It is not clear, however, that these six individuals included minority representation to check for potential bias and sensitivity. The verification survey sample was quite large. The response rates for each of the three separate samples were consistent with (or superior to) those from job analyses for other licensure programs. An inspection of the characteristics of the 2,269 respondents in the teacher sample showed a profile consistent with that of the sampling frame. Overall the job analysis was well done. It is, however, more than 10 years old. An update of Phase One consisting of a current literature review that includes a review of skills required across a wider variety of states and professional organizations and forming committees of professionals nominated by their national organizations is desirable. Phase Two should also be repeated if the
OCR for page 354
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality reexamination of Phase One results in a modification of the list of enabling skills. • Procedures used to develop items and tasks (including qualifications of personnel): ETS has provided only a generic description of the test development procedures for its licensure tests. In addition to the generic description, ETS has developed its own standards (The ETS Standards for Quality and Fairness, November 1999) that delineate expectations for test development (and all other aspects of its testing programs). Thus, no specific description of the test development activities undertaken for this test was available. Reproduced below is the relevant portion of ETS’s summary description of its test development procedures. (More detailed procedures also are provided.) Step 1: Local Advisory Committee. A diverse (race or ethnicity, setting, gender) committee of 8 to 12 local (to ETS) practitioners is recruited and convened. These experts work with ETS test development specialists to review relevant standards (national and disciplinary) and other relevant materials to define the components of the target domain—the domain to be measured by the test. The committee produces draft test specifications and begins to articulate the form and structure of the test. Step 1A: Confirmation (Job Analysis) Survey. The outcomes of the domain analysis conducted by the Local Advisory Committee are formatted into a survey and administered to a national and diverse (race or ethnicity, setting, gender) sample of teachers and teacher educators appropriate to the content domain and licensure area. The purpose of this confirmation (job analysis) survey is to identify the knowledge and skills from the domain analysis that are judged by practitioners and those who prepare practitioners to be important for competent beginning professional practice. Analyses of the importance ratings would be conducted for the total group of survey respondents and for relevant subgroups. Step 2: National Advisory Committee. The National Advisory Committee (also a diverse group of 15 to 20 practitioners, this time recruited nationally and from nominations submitted by disciplinary organizations and other stakeholder groups) reviews the draft specifications, outcomes of the confirmation survey, and preliminary test design structure and makes the necessary modifications to accurately represent the construct domain of interest. Step 3: Local Development Committee. The local committee of 8 to 12 diverse practitioners delineates the test specifications in greater detail after the National Advisory Committee finishes its draft and draft test items that are mapped to the specifications. Members of the Local Advisory Committee may also serve on the Local Development Committee to maintain development continuity. (Tryouts of items also occur at this stage in the development process.) Step 4: External Review Panel. Fifteen to 20 diverse practitioners review a draft form of the test, recommend refinements, and reevaluate the fit or link between the test content and the specifications. These independent reviews are
OCR for page 355
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality conducted through the mail and by telephone (and/or e-mail). The members of the External Review Panel have not served on any of the other development or advisory committees. (Tryouts of items also occur at this stage in the development process.) Step 5: National Advisory Committee. The National Advisory Committee is reconvened and does a final review of the test, and, unless further modifications are deemed necessary, signs off on it. (ETS, 2000, Establishing the Validity of Praxis Test Score Interpretations Through Evidence Based on Test Content: A Model for the 2000 Test Development Cycle). Comment: The procedures ETS has described are consistent with sound measurement practice. However, the procedures were published only recently (in 2000). It is not clear if the same ones were followed when this test was originally developed. Even if such procedures were in place then, it is also not clear if these procedures were actually followed in the development of the test and subsequent new forms. • Congruence of test items/tasks with KSAs and their relevance to practice: This is a 40-item multiple-choice test designed to assess the 21 skills that resulted from the job analysis. In 1991–1992 a validation study of the item bank used for several of the Praxis series tests was undertaken. The Praxis I component of this study examined the computer-based version of the test. The traditional paper-and-pencil version was not examined. The relationship between the items in the paper-and-pencil version and the computer-based version is not specified in the materials provided. In the validation study 53 educators (including higher-education faculty, practicing teachers, and central office supervisors) from several states examined the 977 Reading items in the item bank. The panelists answered three questions about each item: Does this question measure one or more aspects of the intended specifications? How important is the knowledge and/or skill needed to answer this question for the job of an entry-level teacher? (A four-point importance scale was provided.) Is this question fair for examinees of both sexes and of all ethnic, racial, or religious groups? (Yes or No) Panelists received a variety of materials prior to the meeting, held in Princeton where the evaluation of items took place. At the meeting, panelists received training in the process they were to undertake. The large number of items precluded each panelist from evaluating all items. Each panelist reviewed 426 to 454 items, and at least 20 panelists reviewed each item. Of the 977 Reading items, 14 were flagged for their poor match to the specifications, 46 were flagged for low importance, and 243 were flagged for lack of fairness (if even one rater indicated the item was unfair it was flagged). The ETS Fairness Review Panel subsequently reviewed flagged items for lack of fairness. The panel retained 161 of the flagged Reading items; the remaining
OCR for page 356
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality flagged items were eliminated from the pool. (The validation study is described in the November 1992 ETS document Multistate Study of Aspects of the Validity and Fairness of Items Developed for the Praxis Series: Professional Assessments for Beginning Teachers™.) Comment: The procedures described by ETS for examining the congruence between test items and the table of specifications and their relevance to practice are consistent with sound measurement practice. However, it is not clear if these procedures were followed for the paper-and-pencil version of the test (even though the materials provided indicate that such procedures are used for every test in the Praxis series). The Validation and Standard-Setting Procedures Used for Tests in the Praxis Series™ describes a procedure similar to the multistate study that is supposed to be done separately by each user (state) to assess the validity of the scores by the user. If this practice is followed, there is substantial evidence of congruence. It is not clear what action is taken when a particular user (state) identifies items that are not congruent or job related in that state. Clearly the test content is not modified to produce a unique test for that state. Thus, the overall match of items to specifications may be good, but individual users may find that some items do not match, and this could tend to reduce the validity of the scores in that context. • Cognitive relevance (response processes—level of processing required): No information was found in the materials reviewed on this aspect of the test development process. Comment: The absence of information on this element of the evaluation framework should not be interpreted to mean that it is ignored in the test development process. It should be interpreted to mean that no information about this aspect of test development was provided. B. SCORE RELIABILITY • Internal consistency: Estimates of internal consistency reliability were provided for the October 1993 and July 1997 test administrations. The reliability estimate (KR 20) for the 1993 administration was .84 and for the 1997 administration .87. (The 1993 data are contained in the January 2000 ETS report Test Analysis Pre-Professional Skills Test Form 3PPS1, and summary comparative data for the 1993 and 1997 test are in a single-page document, Comparative Summary Statistics for PPST Reading .) Comment: The score reliabilities are marginally adequate for the purpose of the test, especially given the length of the test (40 items) and the time limit (one hour). • Stability across forms: It is known that there are multiple forms (e.g., in a footnote in a statistical report it is indicated that in 1997 Form 3TPS2 was spiraled with Form K-3MPS2), but the total number of forms and the evidence of comparability across forms is limited to the equating of the March 1993 administration of
OCR for page 357
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality Form 3NPS3 to the October 1993 administration of Form 3PPS1. The Tucker method, using common items, was used for this equating. This equating also was used to convert raw scores from Form 3PPS1 to scaled scores on Form 3NPS3. The raw score means for the two tests differed by less than .25, and the standard deviations differed by even less. There were 23 common items (out of 40). (The 1993 data and equating information are contained in the January 2000 ETS report, Test Analysis Pre-Professional Skills Test Form 3PPS1). Comment: The ETS standards require that scores from different forms of a test are to be equivalent. These standards also require that appropriate methodology be used to ensure this equivalence. The Tucker method, using common items, is an appropriate method of equating scores from this test and population. Score distributions from the two alternate forms for which equating data are provided are quite close in their statistical characteristics, and the equating appears to be sound. It is not known if the two forms for which data are provided are typical or atypical. If they are typical, the equating method and results are quite sound and appropriate. • Generalizability (including inter- and intrareader consistency): No generalizability data were found in the materials provided. Because the test is multiple choice, inter- and intrareader consistency information is not applicable. Comment: Although some level of generalizability analysis might be helpful in evaluating some aspects of the psychometric quality of the test, none was provided. • Reliability of pass/fail decision—misclassification rates: No specific data are provided. This absence is expected because each user (state) sets its own unique passing score; thus, each state could have a different pass/fail decision point. The statistical report for the October 1993 test administration provides conditional standard errors of measurement at a variety of score points, many of which represent the passing scores that have been set by different states. These conditional standard errors of measurement for typical passing scaled scores range from 2.4 (for a passing scaled score of 178) to 2.9 (for a passing scaled score of 170). It is not clear if ETS provides users with these data. The report describing statistics from the October 1993 test administration also indicates that reliability of classification decisions is available in a separate report. It is not clear if this is a single report or a report unique to each state. The method used to estimate reliability of decision consistency is not described except to suggest that it is an estimate of “the extent to which examinees would be classified in the same way based on an alternate form of the test, equal in difficulty and covering the same content as the actual form they took.” The method is reported in a Livingston and Lewis article, “Estimating the Consistency and Accuracy of Classifications Based on Test Scores” (Journal of Educational Measurement, 1995.) Comment: The nature of the Praxis program precludes reporting a single estimate of the reliability of pass/fail decisions because each of the unique users
OCR for page 358
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality of the test may set a different passing score and may have a unique population of test takers. (Some states require that the test be used for preadmission to teacher preparation programs, and other states administer the test at the time of initial certification.) The availability of these estimates in a separate report is appropriate (although an illustration would have been helpful). A more detailed description of the method used also would have been helpful. C. STATISTICAL FUNCTIONING • Distribution of item difficulties and discrimination indexes (e.g., p-values, biserials): ETS’s January 2000 Test Analysis Pre-Professional Skills Test Form 3PPS1 report includes summary data on the October 1993 administration of Form 3PPS1. These data include information on test takers’ speed (99.7 percent of examinees reached item 40), and frequency distribution, means, and standard deviations of observed and equated deltas2 and biserial3 correlations. Discussion of Form 3PPS1 administered in October 1993. Observed deltas were reported for all 40 items. There were two deltas of 5.9 or lower (indicating these were very easy items) and three deltas between 12.0 and 12.9 (the highest on this test form was 12.4), indicating the hardest items were of only moderate difficulty. The average delta was 9.2 (standard deviation of 1.8). The lowest value of the equated deltas was also 5.9. The highest value of the equated deltas was 14.3. The biserial correlations ranged from a low of .36 to a high of .72. Biserials were not calculated for two items (based on the criteria for calculating biserials and deltas, it was inferred that two items were answered correctly by more than 95 percent of the analysis sample). The average biserial was .53 (standard deviation of .10). Discussion of Form 3TPS2 administered in July 1997. Observed deltas were reported for all 40 items. The lowest reported delta was 5.9 or lower and 2 Delta is “an index of item difficulty related to the proportion of examinees answering the item correctly” (as defined in ETS’s January 2000 Test Analysis Pre-Professional Skills Test Form 3PPS1, p. A.8). Values of delta range from about 6 for easy items to about 20 for difficult items. Equated deltas have been converted to a unique scale established on the group that took the first form of the particular test and permit comparing the difficulty of items on the “new” form with those on any previous form. Deltas are computed only when the percent correct is between 5 and 95 and more than half the analysis sample reaches the item. However, if an item is in the extreme of the distribution, a delta of 5.9 or 20.1 is entered. 3 The biserial correlation is the estimated product-moment correlation between the total score (a continuous distribution that includes all items, including the item for which the correlation is being computed) and the item score that hypothesized to have an underlying continuous distribution that is dichotomized as right or wrong. Biserial correlations are computed only when the percent correct is between 5 and 95 and more than half the analysis sample reaches the item.
OCR for page 359
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality the highest was 13.5, indicating that the hardest items were of only moderate difficulty. The average delta was 10.4 (standard deviation of 1.7). Equated deltas had a lower value of 6.4 and an upper bound of 13.6. The biserial correlations ranged from a low of .32 to a high of .68. The summary page did not indicate if any biserial correlations were not calculated. The average biserial was .52 (standard deviation of .09). Comment: The test appears to be easy for most examinees. The 1993 form was easier than the 1997 form. Using more traditional estimates of difficulty, the average item difficulty (percent answering correctly) for the test form used in the October 1993 administration was .805. The comparable statistic for the form used in the July 1997 administration was .725. Clearly this test consisted of fairly easy items that tend to correlate well with the total score. Although the items tend to discriminate moderately well, it would be preferable to have items that exhibited greater range in difficulty, even if the average difficulty remained about the same. This preference is expressed because some of the items (especially on the October 1993 test form) were answered correctly by most examinees and likely provided little useful test information. • Differential item functioning (DIF) studies: DIF is performed using the Mantel-Haenszel index expressed on the delta scale. DIF analyses are performed to compare item functioning for two mutually exclusive groups whenever there are sufficient numbers of examinees to justify using the procedure (N≥200). Various pairs of groups are routinely compared—for example, male/female and whites paired with each of the typical racial/ethnic subgroups. The DIF analysis is summarized by classifying items into one of three categories: A, to indicate there is no significant group difference; B, to indicate there is a moderate but significant difference (specific criteria are provided for defining moderate); and C, to indicate a high DIF (high is defined explicitly). DIF data are reported only for the October 1993 test administration. For that test DIF analyses were performed for males/females (M/F) and whites/blacks (W/B) only. For each group, 37 of 40 items were classified as A items (no DIF). For both groups combined, 34 common items were in the A classification. For the M/F comparison, three items were in the B classification (moderate DIF) and for the W/B comparison two items were classified as B and one as C (high DIF). For the M/F comparison, two of the B items were more difficult for males and one was more difficult for females. For the W/B comparison, the two items in the B classification were easier for the white examinees, and the one item classified as a C was more difficult for the white examinees. Comment: Conducting DIF studies is an appropriate activity. It appears the DIF analysis for this test was done appropriately, and the criteria for classifying items into different levels of DIF helps to identify the most serious problems. There is no indication of what actions, if any, might result from the DIF analyses. If the items identified as functioning differently are not examined critically to determine their appropriateness for continued use, or if no other action is
OCR for page 360
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality taken, the DIF analysis serves no purpose. Unfortunately, the ETS standards document was incomplete, so it was not possible to ascertain what the specific policies are with regard to using the results of the DIF analysis. The absence of information on how the DIF data are used should not be interpreted to mean that the data are ignored. It should be interpreted to mean that no information about this aspect of test analysis was provided. D. SENSITIVITY REVIEW • What were the methods used and were they documented? ETS has an elaborate process in place for reviewing tests for bias and sensitivity. This process is summarized below. There is no explicit documentation on the extent that this process was followed exactly for this test or about who participated in the process. The 1998 ETS guidelines for sensitivity review indicate that tests should have a “suitable balance of multicultural material and a suitable gender representation” (Overview: ETS Fairness Review). Included in this review is the avoidance of language that fosters stereotyping, uses inappropriate terminology, applies underlying assumptions about groups, suggests ethnocentrism (presuming Western norms are universal), uses inappropriate tone (elitist, patronizing, sarcastic, derogatory, inflammatory), or includes inflammatory material or topics. Reviews are conducted by ETS staff members who are specially trained in fairness issues at a one-day workshop. This initial training is supplemented with periodic refreshers. The internal review is quite elaborate, requiring an independent reviewer (someone not involved in the development of the test in question). In addition, many tests are subjected to review by external reviewers as part of the test review process. (Recall that one of the questions external reviewers answered in the discussion of the match of the items to the test specifications was a fairness question.) This summary was developed from Overview: ETS Fairness Review. Comment: The absence of information on how the sensitivity review for this test was undertaken should not be interpreted to mean there was no review. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided. • Qualifications and demographic characteristics of personnel: No information was found on this topic for this test. Comment: The absence of information on the participants of the sensitivity review for this test should not be interpreted to mean there was no review or that the participants were not qualified. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided.
OCR for page 361
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality E. STANDARD SETTING • What were the methods used and were they documented? The ETS standards require that any cut score study be documented. The documentation should include information about the rater selection process, specifically how and why each panelist was selected, and how the raters were trained. The other aspects of the process should also be described (how judgments were combined, the procedures used, and results, including estimates of the variance that might be expected at the cut score). For this test, standard-setting studies are conducted by ETS for each of the states that use it (presently 30 states, plus Department of Defense Dependents Schools use this test). Each state has had a standard-setting study conducted. ETS provides each state with a report of the standard-setting study that documents details of the study as described in the ETS standards. No reports from individual states were provided to illustrate the process. The typical process used by ETS to conduct a standard-setting study is described in the September 1997 ETS document, Validation and Standard Setting Procedures Used for Tests in the Praxis Series. This document describes the modified Angoff process used to set a recommended cut score. In this process panelists are convened who are considered expert in the content of the test. The panelists are trained extensively in the process. An important component of the training is discussion of the characteristics of the entry-level teacher and an opportunity to practice the process with sample test questions. Panelists estimate the number out of 100 hypothetical just-qualified entry-level teachers who would answer the question correctly. The cut score for a panelist is the sum of the panelist’s performance estimates. The recommended cut score is the average cut score for the entire group of panelists. Comment: The absence of a specific report describing how the standard setting for this test was undertaken in a particular state should not be interpreted to mean that no standard-setting studies were undertaken or that any such studies that were undertaken were not well done. It should be interpreted to mean that no reports from individual states describing this aspect of testing were contained in the materials provided. If ETS uses the procedure described for setting a recommended cut score in each of the states that use this test, the process reflects what is considered by most experts in standard setting to be sound measurement practice. There is some controversy in the use of the Angoff method, but it remains the most often used method for setting cut scores for multiple-choice licensure examinations. The process described by ETS is an exemplary application of the Angoff method. • Qualifications and demographic characteristics personnel. No information was found for this test that describe the qualifications or characteristics of panelists in individual states. A description of the selection criteria and panel demographics is provid
OCR for page 362
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality ed in the September 1997 document, Validation and Standard Setting Procedures Used for Tests in the Praxis Series. The panelists must be familiar with the job requirements relevant to the test for which a standard is being set and with the capabilities of the entry-level teacher. The panelists must also be representative of the state’s educators in terms of gender, ethnicity, and geographic region. For this test the panelists must also represent diverse areas of certification and teaching levels. A range of 25 to 40 panelists is recommended for Praxis I tests. Comment: The absence of information on the specific qualification of participants of a standard-setting panel for this test should not be interpreted to mean that there are no standard-setting studies or that the participants were not qualified. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided other than a description of the criteria recommended by ETS to the state agencies that select panelists. F. VALIDATION STUDIES • Content validity: The only validity procedure described is outlined in the description of the evaluation framework criteria above related to their relevance to practice. In summary panelists rate each item in terms of its importance to the job of an entry-level teacher and in terms of its match to the table of specifications. The decision rule for deciding if an item is considered “valid” varies with the individual client but typically requires that 75 to 80 percent of the panelists indicate that the item is job related. In addition, at least some minimum number of items (e.g., 80 percent) also must be rated as being job related. This latter requirement is the decision rule for the test as a whole. ETS does not typically select the panelists for content validity studies. These panels are selected by the user (state agency). The criteria for selecting panelists for validity studies suggested by ETS are the same for a validity study as they are for a standard-setting study. In some cases both validity and standard-setting studies may be conducted concurrently by the same panels. Comment: The procedures described by ETS for collecting content validity evidence are consistent with sound measurement practice. However, it is not clear if the procedures described above were followed for this test for each of the states in which the test is being used. The absence of information on specific content validity studies should not be interpreted to mean that there are no such studies. It should be interpreted to mean that no specific reports from user states about this aspect of the validity of the test scores were found in the materials provided. • Empirical validity (e.g., known group, correlation with other measures): No information related to any studies done to collect empirical validity data was found in the materials provided. Comment: The absence of information on the empirical validity studies should not be interpreted to mean there are no such studies. It should be inter-
OCR for page 363
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality preted to mean that no information about this aspect of the validity of the test scores was found in the materials provided. • Disparate impact—initial and eventual passing rates by racial/ethnic and gender groups: No information related to any studies done to collect disparate impact data were found in the materials provided. Because responsibility for conducting such studies is that of the end user (individual states), each of which may have different cut scores and different population characteristics, no such studies were expected. Comment: The absence of information on disparate impact studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no information about this aspect of the impact of the testing program of the individual states was found in the materials provided. Because this is a state responsibility, the absence of illustrative reports should not reflect negatively on ETS. G. FAIRNESS • Comparability of scores and pass/fail decision across time, forms, judges, and locations: Score comparability is achieved by equating forms of the test to a base form. For this test the base form was part of the older National Teacher Examination (NTE) testing program. The base form is a 1983 test. Although not specified for each test form, at least two (1993) forms were equated using the Tucker equating method and a set of 23 common items. Weights for converting raw scores to scale scores equated to the base form are provided in a statistical report (ETS, January 2000, Test Analysis Pre-Professional Skills Test Form 3PPS1). Many states have set the passing score near the mean of the test. This results in the likelihood of greater stability of equating and enhances the comparability of pass/fail decisions within a state across forms. Because this test contains only multiple-choice items, the comparability of scores across judges is not relevant. No data are presented relative to score comparability across locations, but such data are available However, because all examinees take essentially the same forms at any particular administration of the test (e.g., October 2000), the comparability of scores across locations would vary only as a function of the examinee pool and not as a function of the test items. Comment: If the Tucker method of equating using a set of common items is used for all forms, the equating methodology used is appropriate for this test. The establishment of passing scores that tend to be near the mean reduces the equating error at the passing scores, thus enhancing the across-form comparability of the passing scores. In this type of program, that is a more critical concern than ensuring comparability of scores across the entire range. Comparability of scores, pass/fail decisions across time, and locations seems to be reasonable to expect under the conditions described in the ETS materials provided.
OCR for page 364
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality • Examinees have comparable questions/tasks (e.g., equating, scaling, calibration): The ETS standards and other materials provided suggest that substantial efforts are made to ensure that items in this test are consistent with the test specifications derived from the job analysis. There are numerous reviews of items both within ETS and external to ETS. Statistical efforts to examine comparability of item performance over time include use of the equated delta. Test equating appears to be accomplished using an appropriate equating method (Tucker with common items). There is no indication that preequating of items is done. Moreover, there is no indication that operational test forms include nonscored items (a method for pilot testing items under operational conditions). The pilot test procedures used to determine the psychometric quality of items in advance of operational administration are not well described in the materials provided; thus, it is not known how well each new form of the test will perform until its operational administration. Note that there are common items across forms that constitute over 50 percent of the test. There also may be other items on the test that have been used previously but not on the most recent prior administration. Thus, it is not known from the materials provided what percentage of items on any particular form of the test are new (i.e., not previously administered other than in a pilot test). Comment: From the materials provided, it appears that substantial efforts are made to ensure that different forms of this test are comparable in both content and their psychometric properties. The generic procedures that are described, assuming these are used for this test, represent reasonable methods to attain comparability. • Test security: Procedures for test security at administration sites are provided in ETS’s supervisor’s manuals. These manuals indicate the need for test security and describe how the security procedures should be undertaken. The security procedures require that the test materials be kept in a secure location prior to test administration and be returned to ETS immediately following administration. At least five material counts are recommended at specified points in the process. Qualifications are specified for personnel who will serve as test administrators (called supervisors), associate supervisors, and proctors. Training materials for these personnel also are provided (for both standard and nonstandard administrations). Methods for verifying examinee identification are described, as are procedures for maintaining the security of the test site (e.g., checking bathrooms to make sure there is nothing written on the walls that would be a security breach or that would contribute to cheating). The manuals also indicate there is a possibility that ETS will conduct a site visit and that the visit may be announced in advance or unannounced. It is not specified how frequently such visits may occur or what conditions may lead to a visit. Comment: The test security procedures described for use at the test administration site are excellent. If these procedures are followed, the chances
OCR for page 365
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality for security breaches are very limited. Of course, a dedicated effort to breach security may not be thwarted by these procedures, but the more stringent procedures that would be required to virtually eliminate the possibility of a security breach at a test site are prohibitive. Not provided are procedures to protect the security of the test and test items when they are under development, in the production stages, and in the shipping stages. Personal experience with ETS suggests that these procedures also are excellent; however, no documentation of these procedures was provided. • Protection from contamination/susceptibility to coaching: This test consists entirely of multiple-choice items. As such, contamination (in terms of having knowledge or a skill not relevant to the intended knowledge or skill measured by the item and that assists the examinee in obtaining a higher score than is deserved) is not a likely problem. Other than the materials that describe the test development process, no materials were provided that specifically examined the potential for contamination of test scores. In terms of susceptibility to coaching (participating in test preparation programs like those provided by such companies as Kaplan), there is no evidence provided that this test is more or about less susceptible than any other test. ETS provides information to examinees about the structure of the test and the types of items on the test, and test preparation materials are available to examinees (at some cost). The descriptive information and sample items are contained in ETS’s The Praxis Series™ Tests at a Glance, Praxis I: Academic Skills Assessments (1999). Comment: Scores on this test are unlikely to be contaminated by examinees employing knowledge or skills other than those intended by the item writers. This is largely due to the multiple-choice structure of the items and to the extensive item review process that all such tests are subject to if the ETS standard test development procedures are followed. No studies on the coachability of this test were provided. It does not appear that this test would be more or less susceptible than similar tests. The Tests at a Glance provides test descriptions, discussions of types of items, and sample items that are available free to all examinees. For this test more detailed test preparation materials are produced by ETS and sold for $16. Making these materials available to examinees is a fair process, assuming the materials are useful. The concern is whether examinees who can afford them are not advantaged extensively more than examinees who cannot. If passing the test is conditional on using the supplemental test preparation materials, the coachability represents a degree of unfairness. If, however, the test can be passed readily without the use of these or similar materials that might be provided by other vendors, the level of unfairness is diminished substantially. It is important that studies be undertaken and reported (or if such studies exist that they be made public) to assess the degree of advantage for examinees who have used the supplemental materials.
OCR for page 366
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality • Appropriateness of accommodations (ADA): The 1999–2000 Supervisor’s Manual for Nonstandard Test Administrations4 describes the special accommodations that should be available at each test administration site as needed (examinees indicate and justify their needs when registering for the test). In addition to this manual, there are policy statements in hard copy and on ETS’s website regarding disabilities and testing registration, and other concerns that examinees who might be eligible for special accommodations might have. No documentation is provided to ensure that at every site the accommodations are equal, even if they are made available. For example, not all readers may be equally competent, even though all are supposed to be trained by the site’s test supervisor and all have read the materials in the manual. The large number of administration sites suggests that there will be some variability in the appropriateness of accommodations; however, it is clear that efforts are made (providing detailed manuals, announced and unannounced site visits by ETS staff) to ensure at least a minimum level of appropriateness. Comment: No detailed site-by-site report on the appropriateness of accommodations was found in the materials provided. The manual and other materials describe the accommodations that test supervisors at each site are responsible for providing. If the manual is followed at each site, the accommodations will be appropriate and adequate. The absence of detailed reports should not be interpreted to mean that accommodations are not adequate. • Appeals procedures (due process): No detailed information regarding examinee appeals was found in the materials provided. The only information found was contained in the supervisor’s manuals and in the registration materials available to examinees. The manuals indicated that examinees could send complaints to the address shown in the registration bulletin. These complaints would be forwarded (without examinees names attached) to the site supervisor, who would be responsible for correcting any deficiencies in subsequent administrations. There is also a notice provided to indicate that scores may be canceled due to security breaches or other problems. In the registration materials it is indicated that an examinee may seek to verify his or her score (at some cost unless an error in scoring is found). Comment: The absence of detailed materials on the process for appealing a score should not be interpreted to mean that there is no process. It only means that the information for this element of the evaluation framework was not found in the materials provided. Because ETS is the owner of the tests and is responsible for scoring and reporting test results, it is clear that it has some responsibility for handling an appeal from an examinee that results from a candidate not pass- 4 Nonstandard administrations include extended time and may include other accommodations such as a reader, writer, or interpreter. Specific and detailed procedures are provided for each of these (and other) accommodations.
OCR for page 367
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality ing the test. However, the decision to pass or fail an examinee is up to the test user (state). It would be helpful if the materials available to the examinee were explicit on the appeals process, what decisions could reasonably be appealed, and to what agency particular appeals should be directed. H. COSTS AND FEASIBILITY • Logistics, space, and personnel requirements: This test requires no special logistical, space, or personnel requirements that would not be required for the administration of any paper-and-pencil test. The supervisor’s manuals describe the space and other requirements (e.g., making sure left-handed test takers can be comfortable) for both standard and nonstandard administrations. The personnel requirements for test administration also are described in the manuals. Comment: The logistical, space, and personnel requirements are reasonable and consistent with what would be expected for any similar test. No information is provided that reports on the extent that these requirements are met at every site. The absence of such information should not be interpreted to mean that logistical, space, and personnel requirements are not met. • Applicant testing time and fees: The standard time available for examinees to complete this test is one hour. The base costs to examinees in the 1999– 2000 year (through June 2000) were a $35 nonrefundable registration fee and a fee of $18 for the Reading test Under certain conditions additional fees may be assessed (e.g., $35 for a late registration fee; $35 for a change in test, test center, or date). The cost for the test increased to $25 in the 2000–2001 year (September 2000 through June 2001). The nonrefundable registration fee remains unchanged. Comment: The testing time of one hour for a 40-item multiple-choice test is reasonable. This is evidenced by the observation that almost 100 percent of examinees finish the test in the allotted time (see statistical information reported above). The fee structure is posted and detailed. The reasonableness of the fees is debatable and beyond the scope of this report. It is commendable that examinees may request a fee waiver. In states using tests provided by other vendors, the costs for similar tests are comparable in some states and higher in others. Posting and making public all costs that an examinee might incur and the conditions under which they might be incurred are appropriate. • Administration: The test is administered in a large group setting.5 Examinees may be in a room in which other tests in the Praxis series with similar characteristics (one-hour duration, multiple-choice format) also are being administered. Costs for administration (site fees, test supervisors, other personnel) 5 There is a computer administration option. It is a computer-adaptive test. Because that is considered by ETS to be a different test, it is not described in this report.
OCR for page 368
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality are paid for by ETS. The test supervisor is a contract employee of ETS (as are other personnel). It appears to be the case (as implied in the supervisor’s manuals) that arrangements for the site and for identifying personnel other than the test supervisor are accomplished by the test supervisor. The supervisor’s manuals include detailed instructions for administering the test for both standard and nonstandard administrations. Administrators are told exactly what to read and when. The manuals are very detailed. The manuals describe what procedures are to be followed to collect the test materials and to ensure that all materials are accounted for. The ETS standards also speak to issues associated with the appropriate administration of tests to ensure fairness and uniformity of administration. Comment: The level of detail in the administration manuals is appropriate and consistent with sound measurement practice. It is also consistent with sound practice that ETS representatives periodically observe the administration (either announced or unannounced). • Scoring and reporting: Scores are provided to examinees (along with a booklet with score interpretation information) and up to three score recipients. Score reports include the score from the current administration and the highest other score (if applicable) the examinee earned in the past 10 years. Score reports are mailed out approximately four weeks after the test date. Examinees may request that their scores be verified (at an additional cost unless an error is found; then the fee is refunded). Examinees may request that their scores be canceled within one week after the test date. ETS may also cancel a test score if it finds that a discrepancy in the process has occurred. The score reports to institutions or states are described as containing information about the status of the examinee with respect to the passing score appropriate to that institution or state only (e.g., if an examinee requests that scores be sent to three different states, each state will receive pass/fail status only for that state). The report provided to the examinee has pass/fail information appropriate for all recipients. The ETS standards also speak to issues associated with the scoring and score reporting to ensure such things as accuracy and interpretability of scores and timeliness of score reporting. Comment: The score reporting is timely, and the information (including interpretations of scores and pass/fail status) is appropriate. • Exposure to legal challenge: No information on this element of the evaluation framework was found in the materials provided. Comment: The absence of information on exposure to legal challenge should not be interpreted to mean that it is ignored. It should be interpreted to mean that no information about this aspect of test analysis was provided.
OCR for page 369
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality I. DOCUMENTATION • Interpretative guides, sample tests, notices, and other information for applicants: Substantial information is made available to applicants. Interpretive guides (to assist applicants in preparing for the test) are available for a charge of $18. These guides (a model guide for the mathematics tests—Content Knowledge, Pedagogy, Proofs, Models, and Problems, Parts 1 and 2—was sent by ETS) include actual tests (released and not to be used in the future), answers with explanations, scoring guides (for constructed-response tests), and test-taking strategies. The actual tests do not have answer keys, but the sample items are keyed and explanations for the answers are provided. These materials would be very helpful to applicants. Other information is available at no cost to applicants. Specifically, the Tests at a Glance documents, which are unique for each test, include information about the structure and content of the test, the types of questions on the test and sample questions with explanations for the answers. Test-taking strategies also are included. ETS maintains a website that can be accessed by applicants. This site includes substantial general information about the Praxis program and some specific information. In addition to information for applicants, ETS provides information to users (states) related to such things as descriptions of the program, the need for using justifiable procedures in setting passing scores, history of past litigation related to testing, and the need for validity for licensure tests. Comment: The materials available to applicants are substantial and would be helpful. Applicants would benefit from the Tests at a Glance developed for the reading test. It is also likely that some applicants would benefit from the more comprehensive guide. As noted above, there is some concern about the necessity of purchasing the more expensive guide and the relationship between its use and an applicant’s score. Studies are needed on the efficacy of these preparation materials. The materials produced for users are well done and visually appealing. • Technical manual with relevant data: There is no single technical manual for any of the Praxis tests. Much of the information that would routinely be found in such a manual is spread out over many different publications. The frequency of developing new forms and multiple annual test administrations would make it very difficult to have a single comprehensive technical manual. Comment: The absence of a technical manual is a problem, but the rationale for not having one is understandable. The availability of information on most important topics is helpful, but it seems appropriate for there to be some reasonable compromise to assist users in evaluating each test without being overwhelmed by having to sort through the massive amount of information that would be required for a comprehensive review. For example, a technical report that covered a specific period of time (e.g., one year) might be useful to illustrate
OCR for page 370
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality the procedures used and the technical data for the various forms of the test for that period. COMMENTS This test seems to be well constructed and has moderate-to-good psychometric qualities. The procedures used for test development, standard setting, and validation are all consistent with sound measurement practices. The fairness reviews and technical strategies used also are consistent with sound measurement practices. The costs to users (states) are essentially nil, and the costs to applicants/examinees seem to be in line with similar programs. Applicants are provided with substantial free information, and even more information is available at some cost. One criticism is that the job analysis is over 10 years old. In that time changes have occurred in the teaching profession and, to some extent, in the public’s expectations for what beginning teachers should know and be able to do. Some of those changes may be related to the content of this test.
Representative terms from entire chapter: