Read "Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality" at NAP.edu

Page 83 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

5
Evaluating Current Tests

This chapter examines current teacher licensure tests. The committee uses its evaluation framework to evaluate several widely used initial licensure tests and presents the results of the evaluation here, along with the sampling criteria used to select the tests it reviewed.

As noted in Chapter 3, most of the commonly used teacher licensure tests come from the Educational Testing Service (ETS) or National Evaluation Systems (NES). In addition, some state education agencies or higher-education institutions develop tests for their states. Since the majority of tests used are from ETS’s Praxis series or NES, the committee focused its review on tests developed by these two publishers. A measurement expert was commissioned under the auspices of the Oscar and Luella Buros Center for Testing to provide technical reviews of selected teacher licensure tests. A subset of available tests was selected for review.

SELECTING TEACHER LICENSURE TESTS FOR REVIEW

Selecting Praxis Series Tests

In negotiations with ETS, a number of factors were considered in selecting Praxis tests for review. Assessments in both the Praxis I and Praxis II series were considered for technical review. Altogether, the following factors were considered in selecting Praxis tests to review. The committee wanted the review to:

Page 84 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

include one Praxis I test;
include both content and pedagogical knowledge Praxis II tests;
include tests that have both multiple-choice and open-ended formats;
cover the full range of teacher grade levels (e.g., K-6, 5–9, 7–12);
include, if possible, language arts/English, mathematics, science, and social studies content tests;
include tests that are in wide use; and
consider shelf life, that is, not include tests that are near “retirement.”

The final set of tests was chosen by the committee through discussions with ETS and the Buros Center for Testing. From the Praxis I set of assessments, the Pre-Professional Skills Test: Reading (paper-and-pencil administration) was selected for review. From Praxis II the committee selected four tests for review: the Principles of Learning and Teaching (K-6); Middle School English/Language Arts; Mathematics: Proofs, Models, and Problems, Part 1; and Biology: Content Knowledge Test, Parts 1 and 2.

Selecting NES Tests

To obtain material on NES-developed tests, the committee contacted NES and the relevant state education agencies in the states listed as using NES tests in the 2000 NASDTEC Manual.¹ Efforts to obtain sufficient technical information for the committee to evaluate the tests similar to what the committee received from ETS were unsuccessful for NES tests. As a result, NES-developed tests are not included in the committee’s review and the committee can make no statements about their soundness or technical quality.

The committee’s inability to comment on NES-developed tests is significant. First, NES-developed tests are administered to very large numbers of teacher candidates (R.Allen, NES, personal communication, July 26, 1999). Second, the disclosure guidelines in the joint Standards for Educational and Psychological Testing specify that “test documents (e.g., test materials, technical manuals, users guides, and supplemental materials) should be made available to prospective test users and other qualified persons at the time a test is published or released for use” (American Educational Research Association et al., 1999:68). Consistent with the 1999 standards, and as it did with ETS, the committee requested information sufficient to evaluate the appropriateness and technical adequacy of NES-developed tests. In response to the committee’s request, an NES

¹	New York, Massachusetts, Arizona, Michigan, California, Illinois, Texas, and Colorado were listed as NES states in the 2000 NASDTEC Manual. Oregon uses ETS and NES tests. Oklahoma’s test development program was in transition when the committee’s study began.

Page 85 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

representative informed it that the requested materials were “under the control and supervision” of its client states and that the committee should seek information directly from the state agencies (R.Allen, NES, correspondence, September 4, 1999).

Following the tenets of the 1999 standards, the committee then requested the following data from several state agencies (D.Z.Robinson, committee chair, correspondence, August 8, 2000):

…technical information on state licensing tests, including the processes involved in the tests’ development (including job analysis and the means by which job analyses are translated into tests), technical information related to scoring, interpretation and evidence of validity and reliability, scaling and norming, guidelines of test administration and interpretation, and the means by which passing scores are determined…sufficient documentation to support judgments about the technical quality of the test, the resulting scores, and the interpretations based on the test scores.

In communications with the states, at least two state agencies reported their understanding that the requested technical information could not be disclosed to the committee because of restrictions included in their contracts with NES. Colorado’s Office of Professional Services, for example, pointed the committee to the following contract language (E.J.Campbell, Colorado Office of Professional Service, correspondence, September 19, 2000):

Neither the Assessment, nor any records, documents, or other materials related to its development and administration may be made available to the general public, except that nonproprietary information, such as test objectives and summary assessment results may be publicly disseminated by the State. Except as provided above and as contemplated by Paragraph 15, or as required by a court of competent jurisdiction or other governmental agency or authority, neither the State nor the Contractor, or its respective subcontractor(s), employees, or agents may reveal to any persons(s) any part of the Assessment, any part of the information collected during the Project, or any results of the Project, or any Assessment, without the prior written permission of the other party.

Despite multiple contacts with many of the relevant state agencies over several months, the committee received very little of the requested technical information. Several state agencies provided registration booklets and test preparation guides and one state provided a summary of passing rates. California officials provided technical documentation for one of its 40 tests, but the committee concluded that the documentation did not include sufficient information for a meaningful technical evaluation.

In addition to contract restrictions on disclosure, state education agencies gave various reasons for not providing to the committee some or all of the

Page 86 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

requested material, including the following: the technical information was not readily accessible; the technical information appears in a form that would not be useful to the committee; the technical documentation was not yet complete; and planned revisions of state assessments would limit the usefulness of current test review. Several state agencies simply declined to provide some or all of the requested information to the committee.

The committee’s lack of success in obtaining sufficient technical material on NES tests currently in use precludes a meaningful technical evaluation by the committee of the quality of these tests or an assessment of their possible adverse impact. The committee urges efforts to ensure that users and other interested parties can obtain sufficient technical information on teacher licensure tests in accordance with the joint 1999 Standards.

EVALUATING THE PRAXIS SERIES TESTS

In this section the overall strengths and weaknesses of the selected Praxis tests are discussed in relation to the committee’s evaluation framework. The analysis is based on technical reviews prepared by the Buros Center for Testing of the five Praxis tests. The reviews were provided to the committee and shared with ETS in July 2000. The full text of each is available in the electronic version of the committee’s report on the World Wide Web at www.nap.edu. The reviews are briefly summarized below.

Although separate documentation was provided by ETS for the tests reviewed, additional documentation was provided on general test development procedures by ETS. The ETS Standards for Quality and Fairness (1999a) serve as the guide for all test development by ETS. Updated in 1999, they were developed to supplement the Standards for Educational and Psychological Testing (American Educational Research Association et al., 1999). In many cases these generic test development procedures formed the basis of information regarding the specific Praxis tests, and additional specific information was provided to support test development of the individual tests. As a result these generic procedures should be considered as the foundation on which the respective individual assessments were developed.

For Praxis, test development begins with an analysis of the knowledge and skills beginning teachers need to demonstrate (Educational Testing Service, 1999e). These analyses draw on standards from national disciplinary organizations, such as the National Council of Teachers of Mathematics (1989) and the National Research Council (1996), state standards for students and teachers, and the research literature. The knowledge and skill listings which result are then used to survey teachers about their views on the importance and criticality of potential content. Using the information received from these surveys, test specifications describing the content of the tests are developed. Test questions that meet the specifications are written by ETS developers and then reviewed for

Page 87 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

accuracy and clarity. ETS staff also review items for potential bias, with attention to possible inappropriate terminology, stereotyping, underlying assumptions, ethnocentrism, tone, and inflammatory materials. Occasionally, external reviews also are conducted, but this is not systematic. Test forms are then constructed to reflect the test’s specifications.

Once tests are constructed, passing standards are set (Educational Testing Service, 1999b). States that conduct standard-setting studies determine the scores required for passing. As noted in Chapter 3, passing scores are based on educators’ views of minimally competent teaching performance and policy makers’ goals for improvements in teaching and teacher supply.

Detailed manuals are prepared for the Praxis tests and are provided to test administrators and supervisors. There are separate manuals for standard administrations and those tailored to the needs of candidates with learning or physical disabilities. Manuals also detail security procedures for the tests and test administration.

Overall Assessment of Praxis Tests

With a few exceptions, the Praxis I and Praxis II tests reviewed meet the criteria for technical quality articulated in the committee’s framework. This is particularly true regarding score reliability, sensitivity reviews, standard setting, validation research (although only content-related evidence of validity was provided), costs and feasibility, and test documentation.

However, several areas were of concern to the committee. For three of the tests, concerns remain about the content specifications. For two of these tests the job analysis information was dated; for another the development may not be sensitive to the grade-level focus of the test; for the last test there is ambiguity about the possible inclusion of noncontent-relevant material. Only one of the tests reviewed has information on differential item functioning. In four of the five tests reviewed, information on equating strategies is either lacking, inadequate, or problematic. These issues are detailed below. Although these areas of concern are important and need attention by the test developer, all five of these tests do meet the majority of review criteria set forth in this report.

Praxis I: Pre-Professional Skills Test (PPST) in Reading

The PPST in Reading meets all of the review criteria. The test shows strong evidence of being technically sound. The procedures for test development, equating, reliability, and standard setting are consistent with current measurement practices (see Box 5–1). However, since the job analysis on which the content of the test is based is over 10 years old, a study should be conducted to examine whether the components included from the previous job analysis are still current and appropriate and whether additional skills should be addressed.

Page 88 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

BOX 5–1 Technical Review Synopsis Pre-Professional Skills Test (PPST): Reading (Paper-and-Pencil Administration)

Description: Forty multiple-choice items; one-hour administration. In some states the test is administered prior to admission to a teacher preparation program; in other states it may be administered at any time prior to obtaining an initial license.

Purpose of the Assessment: To measure ability to understand and evaluate written messages.

Competencies to Be Assessed: Two broad categories are covered: Literal Communication (55%) and Critical and Inferential Comprehension (45%).

Developing the Assessment: Based on a 1988 job analysis and reviews by an external advisory committee.

Field Testing and Exercise Analysis: Average item difficulties range from 0.72 to 0.80; average item-to-total correlations are in the 0.50 range. Differential item functioning analyses were conducted by considering various pairings of examinee groups. Only a few problematic items were noted. Following ETS’s standard practice, sensitivity reviews are conducted by specially trained staff members.

Administration and Scoring: Administration is standardized; all examinees have one hour to complete the 40-item test. Training is provided for administrators for standard and accommodated administrations. ETS has a clear policy for score challenges; however, decisions regarding pass/fail status are the state’s responsibility. Policies regarding retakes, due process, and so forth reside at the state level.

Protection from Corruptibility: Special procedures are in place to ensure the security of test materials.

Standard Setting: Modified Angoff was used with panels of size 25 to 40. Panelists are familiar with the job requirements and are representative of the state’s educators in terms of gender, ethnicity, and geographic region.

Consistency, Reliability, Generalizability, and Comparability: Common-item, nonequivalent group equating is used to maintain comparability of scores and pass/fail decisions across years and forms. Internal consistency estimates range from 0.84 to 0.87; limited information is provided on conditional standard errors of measurement at possible passing scores. States set different passing scores on the reading tests, so classification rates are peculiar to states and, in some cases, licensing years within states.

Score Reporting and Documentation: Results are reported to examinees in about four weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. Guides, costing $18 each, contain released tests with answers and explanations and test-taking strategies. Other information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, and sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed through ETS’s website. However, there is no single, comprehensive, integrated technical manual for tests in the Praxis series.

Validation Studies: Content-related evidence of validity was reported, based on a 1992 study. Limited evidence is provided on disparate impact by gender and racial/ethnic groups. In 1998 to 1999, across all states, passing rates were 86% for

Page 89 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

white examinees, 65% for Hispanic examinees, and 50% for African American examinees. Test-taker pools were not large enough to report passing rates for Asian examinees.

Cost and Feasibility: There are no special logistical, space, or personnel requirements for the paper-and-pencil administration. For 2000 to 2001, there was a $35 nonrefundable registration fee and a $25 fee for the test.

Study of Long-term Consequences of Licensure Program: No information was reported on the long-term consequences of the PPST reading test as a component of a total licensure program.

Overall Evaluation: Overall, the PPST in Reading shows strong evidence of being technically sound. The procedures for test development, equating, validation, reliability, and standard setting are consistent with current measurement practices. The job analysis is over 10 years old, and validity evidence was based on a limited content study.

SOURCE: Impara, 2000d.

Principles of Learning and Teaching (K-6) Test

The Principles of Learning and Teaching (PLT) (K-6) test is well constructed and has moderate to good technical qualities. The procedures for test development and standard setting are consistent with current measurement practices (see Box 5–2). Two areas of concern were raised for the test—statistical functioning and fairness. Some of the indicators of statistical functioning of the test items are problematic. In particular, correlations of individual items with overall test performance (biserial correlations) are low for a test of this kind. In addition, no studies of differential item functioning are reported for scores. With regard to the fairness criterion, no material is provided on methods used to equate alternate forms of the test—an issue that is especially important because, across years, candidates appear to be performing better. Because of the lack of equating information, it is unclear whether this results from a better-prepared candidate population or from easier test forms across years. Also, because the test is a mix of multiple-choice and open-ended questions, the equating strategies are not straightforward. The job analysis for the test is over 10 years old.

Middle School English/Language Arts Test

The Middle School English/Language Arts test is well constructed and has reasonably good technical properties. The procedures for test development and standard setting are consistent with current measurement practices (see Box 5–3). Three areas of the Middle School English/Language Arts test are identified as

Page 90 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

BOX 5–2 Technical Review Synopsis Principles of Learning and Teaching (PLT) (K-6) Test

Description: Forty-five multiple-choice items; six constructed response tasks; two-hour administration. The test is designed for beginning teachers and is intended to be taken after a candidate has almost completed his or her teacher preparation program.

Purpose of the Assessment: To assess a beginning teacher’s knowledge of a variety of job-related criteria, including organizing content knowledge for student learning, creating an environment for learning, teaching for student learning, and teacher professionalism.

Competencies to Be Assessed: Organizing Content Knowledge for Student Learning (28%), Creating a Learning Environment (28%), Teaching for Student Learning (28%), Teacher Professionalism (16%).

Developing the Assessment: Based on a 1990 job analysis and reviews by an external advisory committee.

Field Testing and Exercise Analysis: Average item difficulties are typically 0.70; average item-to-total correlations are in the mid-30s. No differential item functioning analyses were reported. Following ETS’s standard practice, sensitivity reviews are conducted by specially trained staff members.

Administration and Scoring: Administration is standardized; all examinees have two hours to complete the test. Training is provided for administrators for standard and accommodated administrations. ETS has a clear policy for score challenges; however, decisions regarding pass/fail status are the state’s responsibility. Policies regarding retakes, due process, and so forth reside at the state level.

Protection from Corruptibility: Special procedures are in place to ensure the security of test materials.

Standard Setting: Modified Angoff for the multiple-choice items was used with panels of size 25 to 40. Panelists are familiar with the job requirements and are representative of the state’s educators in terms of gender, ethnicity, and geographic region. Either a benchmark or an item-level pass/fail method was used with the constructed-response questions.

Consistency, Reliability, Generalizability, and Comparability: No information is provided on what method was used to maintain comparability of scores across years and forms. Interrater reliability estimates on constructed-response items are all greater than 0.90; overall reliability estimates range from 0.72 to 0.76. Limited information is reported on conditional standard errors of measurement at possible passing scores. States set different passing scores on the test, so classification error rates are peculiar to state and year.

Score Reporting and Documentation: Results are reported to examinees in about six weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. No interpretive guide specific to this test is available. Some information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, and sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed through ETS’s website. However, there is no single, comprehensive, integrated technical manual for tests in the Praxis series.

Page 91 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

Validation Studies: Content-related evidence of validity is reported. Limited evidence is provided on disparate impact by gender and racial/ethnic groups. In 1998 to 1999, across all states, passing rates were 86% for white examinees, 65% for Hispanic examinees, 82% for Asian examinees, and 48% for African American examinees.

Cost and Feasibility: There are no special logistical, space, or personnel requirements for the paper-and-pencil administration. For 2000 to 2001, there was a $35 nonrefundable registration fee and an $80 fee for the test.

Study of Long-Term Consequences of Licensure Program: No information was reported on the long-term consequences of the test as a component of a total licensure program.

Overall Evaluation: Overall, the test is well constructed and has moderate to good psychometric properties. The procedures for test development, validation, and standard setting are all consistent with current measurement practices. No information was provided on equating alternate forms of the test, and validity evidence is limited to content-related evidence.

SOURCE: Impara, 2000e.

BOX 5–3 Technical Review Synopsis Middle School English/Language Arts Test

Description: Ninety multiple-choice items; two constructed-response tasks; two-hour administration. The test is designed for beginning teachers and is intended to be taken after a candidate has almost completed his or her teacher preparation program.

Purpose of the Assessment: To measure whether an examinee has the knowledge and competencies necessary for a beginning teacher of English/language arts at the middle school level.

Competencies to Be Assessed: Reading and Literature Study (41%), Language and Linguistics (18%), Composition and Rhetoric (41%).

Developing the Assessment: Based on a 1996 job analysis, the purpose of which was to determine the extent that a job analysis undertaken earlier for secondary teachers would apply to middle school teachers and reviews by an external advisory committee.

Field Testing and Exercise Analysis: Average item difficulties were typically 0.73; average item-to-total correlation was 0.37. No differential item functioning analyses are reported. As is ETS’s standard practice, sensitivity reviews are conducted by specially trained staff members.

Administration and Scoring: Administration is standardized; all examinees have two hours to complete the test. Training is provided for administrators for

Page 92 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

standard and accommodated administrations. ETS has a clear policy for score challenges; however, decisions regarding pass/fail status are the state’s responsibility. Policies regarding retakes, due process, and so forth reside at the state level.

Protection from Corruptibility: Special procedures are in place to ensure the security of test materials.

Standard Setting: Modified Angoff was used for the multiple-choice items using panels of size 25 to 40. Panelists are familiar with the job requirements and are representative of the state’s educators in terms of gender, ethnicity, and geographic region. Either a benchmark or an item-level pass/fail method was used with the constructed-response questions.

Consistency, Reliability, Generalizability, and Comparability: No information was provided on the method used to maintain comparability of scores across years and forms. Interrater reliability on constructed-response items was 0.89; overall reliability was estimated at 0.86. Limited information was reported on conditional standard errors of measurement at possible passing scores. States set different passing scores, so classification error rates are specific to states and years.

Score Reporting and Documentation: Results are reported to examinees in about six weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. No specific interpretive guide is available for this test. Some information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed are representative of the state’s educators in terms of gender, ethnicity, and geographic region. Either a benchmark or an item-level pass/fail method” was used with the constructed-response questions.

Consistency, Reliability, Generalizability, and Comparability: No information is provided on what method was used to maintain comparability of scores across years and forms. Interrater reliability estimates on constructed-response items are all greater than 0.90; overall reliability estimates range from 0.72 to 0.76. Limited information is reported on conditional standard errors of measurement at possible passing scores. States set different passing scores on the test, so classification error rates are peculiar to state and year.

Score Reporting and Documentation: Results are reported to examinees in about six weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. No interpretive guide specific to this test is available. Some information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, and sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed through ETS’s website. However, there is no single, comprehensive, integrated technical manual for tests in the Praxis series.

Validation Studies: Content-related evidence of validity is reported. Limited evidence is provided on disparate impact by gender and racial/ethnic groups. In 1998 to 1999, across all states, passing rates were 86% for white examinees, 65%

Page 93 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

for Hispanic examinees, 82% for Asian examinees, and 48% for African American examinees.

Cost and Feasibility: There are no special logistical, space, or personnel requirements for the paper-and-pencil administration. For 2000 to 2001, there was a $35 nonrefundable registration fee and an $80 fee for the test.

Study of Long-Term Consequences of Licensure Program: No information was reported on the long-term consequences of the test as a component of a total licensure program.

Overall Evaluation: Overall, the test is well constructed and has moderate to good psychometric properties. The procedures for test development, validation, and standard setting are all consistent with current measurement practices. No information was provided on equating alternate forms of the test, and validity evidence is limited to content-related evidence.

SOURCE: Impara, 2000c.

showing possible weaknesses in relation to the evaluation framework. First, since the test is derived directly from the High School English/Language Arts test, it is not clear whether the item review process is sufficient and relevant to the middle school level. Second, as with the PLT (K-6) test, no information is provided on differential item functioning for scores across identified groups of examinees. Finally, as with the PLT (K-6) test, information is lacking on the equating strategies used. This test combines multiple-choice and open-ended item formats, which complicates the equating process.

Mathematics: Proofs, Models, and Problems, Part 1 Test

The Mathematics: Proofs, Models, and Problems, Part 1 test is well constructed and has reasonably good technical properties. The test development and standard-setting procedures are consistent with current measurement practices (see Box 5–4). Specifications for the test are unclear, and it appears that the test may include material not directly related to the content specifications. If this is the case, the possibility of score contamination is a concern because performance by some candidates might be distorted by noncontent-specific information included in the test questions. Furthermore, the test contains open-ended questions, and interrater score agreement is lower than desired for some of these questions. Similar concerns are noted for the PLT (K-6) and Middle School English/Language Arts tests. No information is reported on differential item functioning for this test. Fairness is also a concern because the equating method is questionable if forms have dissimilar content. In addition, sample sizes are too low to have confidence in the accuracy of the equating results.

Page 94 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

BOX 5–4 Technical Review Synopsis Mathematics: Proof, Models, and Problems, Part 1 Test

Description: Four constructed-response tasks (one mathematical proof, one developing a mathematical model, two problem solving); one-hour administration. The test is designed for beginning teachers and is intended to be taken after a candidate has almost completed his or her teacher preparation program.

Purpose of the Assessment: To measure the mathematics knowledge and competencies necessary for a beginning teacher of secondary mathematics.

Competencies to Be Assessed: To solve the four problems, examinees must understand and be able to work with mathematical concepts, reason mathematically, integrate knowledge of different areas of mathematics, and develop mathematical models of real-life situations.

Developing the Assessment: Based on a 1989 job analysis and through reviews by an external advisory committee.

Field Testing and Exercise Analysis: Average scores for the two problem-solving items (out of a possible 10) are 7.8 and 3.7; mathematical proof average score is 5.4; average for the mathematical model question is 5.0. Intertask correlations range from a low of 0.12 to a high of 0.32. No differential item functioning analyses are reported. ETS’s standard practice is to conduct sensitivity reviews by specially trained staff members.

Administration and Scoring: Administration is standardized; all examinees have two hours to complete the test. Training is provided for administrators for standard and accommodated administrations. ETS has a clear policy for score challenges; however, decisions regarding pass/fail status are the state’s responsibility. Policies regarding retakes, due process, and so forth reside at the state level.

Protection from Corruptibility: Special procedures are in place to ensure the security of test materials. Competencies from other content areas may be required to solve the problems. This could confound score interpretations.

Standard Setting: ETS uses either a benchmark or an item-level pass/fail method with constructed-response questions. Neither method is well documented in the literature, and there is no specific report for the applications of these methods for this test.

Consistency, Reliability, Generalizability, and Comparability: Total raw scores on the test are equated through raw scores on a multiple-choice test using a chained equipercentile procedure. The equating is based on very small samples. Interrater reliability on constructed-response items is in the mid-0.90s. Classification error rates are peculiar to state and year.

Score Reporting and Documentation: Results are reported to examinees in about six weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. Guides, costing $31 each, contain released tests with answers, explanations, and test-taking strategies. Other information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed through ETS’s website.

Page 95 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

There is no single, comprehensive, integrated technical manual for tests in the Praxis series.

Validation Studies: Content-related evidence of validity is reported. Differential passing rates are reported only for white and African American examinees (due to small sample sizes). In 1998 to 1999, across all states, the average passing rate for white examinees was 82% and for African American examinees 53%.

Cost and Feasibility: There are no special logistical, space, or personnel requirements for the paper-and-pencil administration. For 2000 to 2001, there is a $35 nonrefundable registration fee and a $70 fee for the test.

Study of Long-Term Consequences of Licensure Program: No information is reported on the long-term consequences of Mathematics: Proofs, Models, and Problems, Part 1 as a component of a total licensure program.

Overall Evaluation: Overall, the test is well constructed and has reasonably good psychometric properties. The procedures for test development, validation, and standard setting are consistent with current measurement practices. The equating strategy is problematic, especially given the small sample sizes. The cost of the study guide may be prohibitive for some candidates.

SOURCE: Impara, 2000b.

Biology: Content Knowledge Tests, Parts 1 and 2

The Biology: Content Knowledge Tests, Parts 1 and 2 seem to be well constructed and have moderate to good psychometric properties. The test development and standard-setting procedures are consistent with current practice (see Box 5–5). There was a lack of information on differential item functioning reported for the biology tests. No information was provided regarding the equating of alternate forms of the test (this was the base form). Only limited statistical data are available.

EXAMINING DISPARATE IMPACT

Test fairness issues are important to test quality. In this section of the report, the committee examines data for racial/ethnic minority and majority teacher candidates on several teacher licensing tests; compares these data to data from other large-scale tests; and discusses issues of test bias, the consequences of disparate impact, and the policy implications of the data.

Historically and currently, African American and Hispanic candidates usually have substantially lower passing rates on teacher licensure tests than white candidates (Garcia, 1985; George, 1985; Goertz and Pitcher, 1985; Graham, 1987; Rebell, 1986; Smith, 1987; Gitomer et al., 1999; Mehrens, 1999; Brunsman et al., 1999; Brunsman et al., 2000; Carlson et al., 2000). The size of the gap in passing

Page 96 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

BOX 5–5 Technical Review Synopsis Biology: Content Knowledge Tests, Parts 1 and 2

Description: Each test consists of 75 multiple-choice items; each test is designed to be administered in one hour. These tests are designed for beginning teachers and are intended to be taken after a candidate has almost completed a teacher preparation program.

Purpose of the Assessment: To measure the knowledge and competencies necessary for a beginning teacher in biology in a secondary school.

Competencies to Be Assessed: Part 1: Basic Principles of Science (17%); Molecular and Cellular Biology (16%); Classical Genetics and Evolution (15%); Diversity of Life, Plants, and Animals (26%); Ecology (13%); and Science, Technology, and Society (13%). Part 2: Molecular and Cellular Biology (21%); Classical Genetics and Evolution (24%); Diversity of Life, Plants, and Animals (37%); and Ecology (18%).

Developing the Assessment: Based on a 1990 job analysis and reviews by an external advisory committee.

Field Testing and Exercise Analysis: Part 1: Average item difficulties range from 0.64 to 0.70; average item-to-total correlations are in the mid-0.40s. Part 2: Average item difficulties range from 0.53 to 0.57; average item-to-total correlations are in the upper 0.30s. Differential item functioning analyses were not conducted due to small samples. ETS’s standard practice is to conduct sensitivity reviews by specially trained staff members.

Administration and Scoring: Administration is standardized; all examinees have one hour to complete each of the 75-item tests. Training is provided for administrators for standard and accommodated administrations. ETS has a clear policy for score challenges; however, decisions regarding pass/fail status are a state responsibility. Policies regarding retakes, due process, and so forth reside at the state level.

Protection from Corruptibility: Special procedures are in place to ensure the security of test materials.

Standard Setting: Modified Angoff was used with panels of size 25 to 40. Panelists are familiar with the job requirements and are representative of the state’s educators in terms of gender, ethnicity, and geographic region.

Consistency, Reliability, Generalizability, and Comparability: Equating is used to maintain comparability of scores and pass/fail decisions across years and forms, although the specific method is not specified. Internal consistency estimates are in the mid-0.80s for both tests; limited information was provided on conditional standard errors of measurement at possible passing scores. Classification error rates are specific to state and year.

Score Reporting and Documentation: Results are reported to examinees in about six weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. Guides, costing $16 each, contain released tests with answers, explanations, and test-taking strategies. Other information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, and sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed through ETS’s web-

Page 97 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

site. There is no single, comprehensive, integrated technical manual for tests in the Praxis series.

Validation Studies: Content-related evidence of validity is reported, based on a 1992 study. Limited evidence is provided on disparate impact by gender and racial/ethnic groups. In 1998 to 1999, across all states, average passing rates were 91% for white examinees, 34% for African American examinees; and 71% for Asian examinees for Part 1; test-taker pools for Hispanic candidates were not large enough to report passing rates. For Part 2, average passing rates by examinee groups were as follows: white, 70%; African American, 24%; Hispanic, 35%; Asian, 74%.

Cost and Feasibility: There are no special logistical, space, or personnel requirements for the paper-and-pencil administration. For 2000 to 2001, there was a $35 nonrefundable registration fee and fees of $45 each for Parts 1 and 2.

Study of Long-Term Consequences of Licensure Program: No information was reported on the long-term consequences of the tests as components of a total licensure program.

Overall Evaluation: Overall, these tests seem to be well constructed with moderate to good psychometric properties. The procedures for test development, validation, and standard setting are all consistent with current measurement practices. The job analysis is dated, and no information is provided on the procedures used to equate scores on different forms of these tests.

SOURCE: Impara, 2000a.

rates varies across tests and states. Before discussing these differences, issues to be considered in comparing racial/ethnic minority and majority group data on teacher licensure tests are noted. These issues bear on the use of licensure tests to make decisions about teacher candidates. The committee considers fairness issues in using licensure tests to judge program quality in Chapter 7.

METHODOLOGICAL NOTE ABOUT COMPARISONS

First-Time and Eventual Passing Rates

Within all racial/ethnic groups, first-time test takers of teacher licensure tests generally have higher passing rates than do repeaters. Moreover, as a byproduct of the differences in passing rates among groups, a larger percentage of minority applicants are likely to be repeaters than are nonminority candidates. Hence, comparisons of average scores and passing rates among groups that are based on a single administration of a test (or on the last test taken during a given year) do not give a complete picture of group differences. Comparisons

Page 98 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

based on a single test administration are likely to inflate differences in passing rates among groups.

Because of the importance of this distinction, some analysts prefer to emphasize eventual passing rates in examining test fairness issues in licensure decision making; the eventual passing rate is the percentage of a group that passes after several attempts. This approach focuses on the passing rate that corresponds to the percentage of candidates meeting the testing requirements for licensure—at some point in time, if not on the first attempt. Conversely, other analysts focus on initial passing rates, the rates of first-time test takers. The first-time testing group includes four sets of individuals: (1) initial testers who pass, (2) initial testers who fail and never retry the licensing test, (3) test takers who initially fail but eventually pass the licensing test, and (4) test takers who repeat but never pass the licensing test. Initial passing rates are important. Candidates’ initial unsuccessful attempts cause delays and additional costs, even for those who eventually pass. The committee contends that both initial and eventual testing results are important.

Data Combined Across States

Comparisons of passing rates for racial/ethnic groups that are based on data aggregated across states can present interpretation problems. A number of factors, such as differences among states in their passing scores and the characteristics of their minority and majority teacher candidates make such data difficult to interpret. Consequently, the committee suggests that readers use extra caution in interpreting the results of any comparisons that aggregate data across states.

Test Scales

Different tests have different scoring scales. Consequently, scores must be converted to a common metric in order to determine whether a gap in average scores between two groups is larger on one test than another. This is usually accomplished by reporting the difference in scores between groups in terms of standard deviation units. The standard deviation difference is computed by dividing the difference between the mean scores for two groups by the standard deviation of the scores. (For readers who are not familiar with this metric, a one standard deviation difference in average scores between groups would roughly correspond to about 75 percent of the high-scoring group having scores that are higher than 75 percent of those in the low-scoring group. Thus, although a one standard deviation difference is quite large, there is still some overlap in the scores of the groups.)

Page 99 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

DIFFERENCES BETWEEN MINORITY AND MAJORITY EXAMINEES ON LARGE-SCALE TESTS

As a frame of reference for the discussion that follows, it is useful to note that the differences in average scores among racial/ethnic groups on the teacher licensure tests the committee examined are generally similar to the differences found among these groups on other tests. In one review of test data, Hedges and Nowell (1998) found that the average scores of African American and white test takers on a large number of secondary-level tests differed by 0.82 to 1.18 standard deviation units. Similar differences have been found on the National Assessment of Educational Progress tests (U.S. Department of Education, 1998b). On the 1999 Scholastic Assessment Test (SAT), the difference in average scores between African Americans and white test takers was one standard deviation on the mathematics section and 0.89 standard deviation units on the verbal section (College Entrance Examination Board, 1999).

DIFFERENCES BETWEEN MINORITY AND MAJORITY TEACHER CANDIDATES ON THE SAT

Differences in SAT scores among prospective teachers provide another point of comparison for differences among racial/ethnic groups on teacher licensure tests. ETS recently reported SAT data for over 150,000 teacher candidates who took the Praxis I and Praxis II tests between 1994 and 1997. ETS matched their data to the records of those who took the SAT between 1977 and 1997. The last SAT record was used for individuals who took the SAT more than once (some individuals retake the SAT with the goal of improving their scores). Table 5–1 shows the average SAT scores of Praxis examinees. The mean and standard

TABLE 5–1 Average SAT Scores for Praxis Test Takers by Population Group, 1994–1997

	Praxis I Examinees^a			Praxis II Examinees^b
Ethnicity	N	SAT Math	SAT Verbal	N	SAT Math	SAT Verbal
African American	3,603	413	428	11,510	424	440
Asian American	1,277	517	490	3,810	534	508
Hispanic	602	459	476	5,352	465	473
White	27,506	501	514	135,035	505	518
^aAverage SAT math and verbal scores for Praxis I examinees were 491 and 503, respectively. ^bAverage SAT math and verbal scores for Praxis II examinees were 498 and 510, respectively. SOURCE: Gitomer, et al. (1999).

Page 100 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

deviations of the SAT scores are 500 and 100, respectively, for the general SAT-taking population.

Table 5–1 shows that African American and Hispanic Praxis I examinees have lower average SAT scores than Asian American and white teacher candidates. The same pattern of results is seen for Praxis II takers. The next table presents standard deviation differences for these tests and groups. Table 5–2 shows the differences in mean SAT scores between white and racial/ethnic minority teacher candidates. The difference between white and African American Praxis I examinees is slightly less than one standard deviation on both the math and the verbal sections of the SAT. The difference between white and Hispanic students is about half this size. For Praxis II examinees the pattern is similar. These standard deviation differences are likely to be conservative estimates because the standard deviation of SAT scores for the total SAT-taking population was used in the computations; this standard deviation is likely to be larger than the standard deviation of SAT scores for Praxis I and II test takers. (SAT standard deviation data were not reported for Praxis test takers.)

The mean differences in SAT performance between African American and white Praxis examinees are slightly smaller than those reported for the broader SAT-taking population. This may reflect restriction in range for the groups; that is, the individuals whose records were used in this analysis were college entrants and thus a relatively capable subset of the total SAT-taking population. It may reflect the fact that the SAT records used in this analysis were eventual records obtained by searching SAT records for a 20-year period for Praxis test takers’ last SAT records. As just noted, it also may reflect a difference in the standard deviation of SAT scores for the total SAT-taking population compared to the group that took the Praxis tests. Whether these restrictions play out differently for minority and majority test takers is unknown.

TABLE 5–2 Differences Between Average SAT Scores of Minority and White Praxis I and II Test Takers in Standard Deviation Units, 1994–1997

	Praxis I Examinees		Praxis II Examinees
Differences Between Whites and:	SAT Math	SAT Verbal	SAT Math	SAT Verbal
African Americans	0.88	0.86	0.81	0.78
Asian Americans	−0.16	0.24	−0.29	0.10
Hispanics	0.42	0.38	0.40	0.45

Page 101 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

DIFFERENCES BETWEEN MINORITY AND MAJORITY CANDIDATES ON TEACHER LICENSING TESTS

Average Scores

Though licensure testing programs generally report passing rates rather than average scores for their candidates, the committee was able to obtain group means from ETS for the Praxis tests it reviewed earlier in this chapter (Educational Testing Service, prepared March 22, 2001). Tables 5–3 and 5–4 report data for the Pre-Professional Skills Test in Reading and the Principles of Learning and Teaching (K-6) test. There were too few test takers in some population groups to report average score data for the Middle School/English Language Arts test, the Mathematics: Proof, Models, and Problems: Part 1 test, and Parts 1 and 2 of the Biology Content Knowledge test. These Praxis I and PLT data are from the 1998/1999 administrations of the tests and include the most recent testing records for candidates testing more than once during that period. For both tests, average scores for the groups are presented in Table 5–3, and the differences between group averages (in standard deviation units) are given in Table 5–4. Some of the limitations of these data are described below.

Table 5–3 shows that the average scores of minority candidates on the PPST in Reading and the PLT (K-6) for 1998/1999 were lower than those of white candidates. Table 5–4 shows that the difference between the average scores of African American and white test takers on the 1998/1999 PPST in reading was 1.2 standard deviations. The difference between Hispanic and white average scores for 1998/1999 was 0.7 standard deviations. The difference in the average scores of African American and white candidates on the PLT (K-6) test for

TABLE 5–3 Average Praxis Scores for 1998–1999 Test Takers by Racial/Ethnic Group

	PPST: Reading		PLT (K-6)
Ethnicity	N	Mean^a	N	Mean^b
African American	5,296	172	1,617	159
Asian American	1,114	175	359	169
Hispanic	848	175	206	165
White	38,868	179	15,743	173
^aThe most recent scores were included for examinees repeating PPST: Reading in 1998/1999; mean, 178; standard deviation, 6. ^bThe most recent scores were included for examinees repeating PLT (K-6) in 1998/1999; mean, 172; standard deviation, 12. SOURCE: Data provided to the committee by Educational Testing Service on March 22, 2001.

Page 102 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

TABLE 5–4 Differences Between the Average Praxis Scores of Minority and White Test Takers in Standard Deviation Units, 1998–1999

Differences Between Whites and:	PPST Reading	PLT: K-6
African Americans	1.2	1.2
Asian Americans	0.7	0.3
Hispanics	0.7	0.7

1998/1999 was 1.2 standard deviations; the average difference between Hispanic and white examinees was 0.7.

Several methodological characteristics of the data may affect the group differences. The first is that the Praxis records used in this analysis combine testing data for first-time examinees with data for those retesting during the application year. The data do not take into account the performance of those in the cohort who retest successfully after the application year. Moreover, the data include the later results for individuals who tested unsuccessfully before the 1998/1999 application year. Application-year reports like these tend to exaggerate group differences because minority examinees tend to be overrepresented in the repeat test-taking population. The average scores for minority test takers are depressed by the inclusion of greater numbers of repeaters who, by definition, are lower scoring.

Second, some earlier mentioned characteristics of the SAT dataset for Praxis examinees affect comparisons between these data and the data in Table 5–2. The SAT analyses included 20 years of data, allowing more retesting opportunities and higher scores for some candidates. Further, standard deviation differences for SAT scores in Table 5–2 were calculated by using the SAT population standard deviation, not the standard deviation of SAT scores for Praxis test takers.

Data reports that combine testing records for first-time examinees with those of repeaters are called concurrent reports. The vast majority of state agencies report data for their licensure tests as concurrent reports.

Licensure testing programs generally report passing rates instead of average scores for their candidates because passing rates show about how many candidates have access to the profession. ETS states generally report passing rate data for the Praxis tests. NES states typically do as well. Passing rate data are described next.

Passing Rates

Table 5–5 shows passing rates for the two tests discussed above. It gives the average passing rates on PPST: Reading and PLT (K-6) by racial/ethnic group for 1998/1999 test takers. The candidate data are a subset of those used for Table

Page 103 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

TABLE 5–5 Praxis Passing Rates for Test Takers by Racial/Ethnic Group, 1998–1999

	PPST: Reading^a		PLT (K-6)^c
Ethnicity	N	% Passing^b	N	% Passing
African American	3,874	50	1,219	48
Asian American	670	59	280	82
Hispanic	375	65	163	65
White	21,944	86	12,569	86
^aData for 29 states and U.S. Department of Defense Dependents Schools were included. ^bAverage passing rates were calculated by averaging across the passing rates resulting from the application of state passing scores to data for students reporting data to their respective states. States were equally weighted in computing the averages. Note that the number of candidates taking each test exceeds the number reporting data back to a state. For the Praxis II test, 90 to 100% of examinees reported scores to their states. For the PPST reading test, the percentages reporting by racial/ ethnic group ranged from 50 to 80%. There was no discernable relationship between the groups’ reporting rates and passing rates. ^cData for 12 states were included. SOURCE: Data provided to the committee by Educational Testing Service on August 17, 2000.

5–3 because some candidates take Praxis but do not report their scores back to their states. PPST data are reported in Table 5–3 for candidates reporting scores back to the 29 states using it in 1998/1999; PLT data are shown for candidates reporting scores to 12 states. For the PLT test, between 90 and 100 percent of examinees reported scores to their states in 1998/1999. For PPST: Reading the percentages reporting by racial/ethnic group ranged from 50 to 80 percent. There was no discernable relationship between the groups’ reporting rates and the passing rates on PPST: Reading. Average passing rates were calculated by applying state passing scores to data for students reporting scores to their respective states. This provided instate passing rates; instate passing rates were then averaged across states. States were equally weighted in computing the averages. The differences between group passing rates on the two tests are shown in Table 5–6.

TABLE 5–6 Differences Between Praxis Passing Rates for Minority and White Test Takers, 1998–1999

Differences Between Whites and:	PPST Reading	PLT (K-6)
African Americans	36^a	38
Asian Americans	27	4
Hispanics	21	21
^aDifference in percentages.

Page 104 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

The data in Table 5–5 show substantial disparities between the passing rates of white and minority test takers on both tests. As Table 5–6 shows, the gap between African American and white test takers in 1998/1999 was 36 percentage points on the PPST reading test and 38 on the PLT (K-6) test. For Hispanics the differences were 21 percent on both tests. For Asian Americans the differences were 27 and 4 percent, respectively.

Like the data in Tables 5–3 and 5–4, these data have limitations. They are subject to two types of misinterpretation due to data aggregation. As already noted, they confound the scores for initial and repeat test takers. Group differences may be amplified by the fact that repeat test takers are more likely to be minority group members than majority candidates. The data also may misrepresent similarities and differences in passing rates across groups within states. These average passing rates combine data on passing rates across states using the same tests (based on states’ own passing scores, which vary). Some states have different demographic profiles. For example, Texas has a higher percentage of Hispanic candidates than many other states. One group may be more or less likely than another to test in states with relatively low passing scores. The combination of different passing scores and different demographic profiles across states makes direct comparisons of the passing rates across groups problematic.

Nonetheless, the pattern in these results is similar to the patterns observed between minority and majority examinees on the National Board for Professional Teaching Standards (NBPTS) assessments. Certification rates of slightly over 40 percent for white teachers have been reported (Bond, 1998). The reported certification rate for African American teachers was 11 percent, some 30 percent lower than the passing rate for white teachers. The NBPTS assessments are performance based and differ in format from the Praxis tests. The NBPTS assessments and the differences between them and conventional tests are described in Chapter 8.

The pattern in the Praxis results is also seen on licensure tests in certain other professions. For example, a national longitudinal study of graduates of American Bar Association-approved law schools found initial passing rates on the bar exam to be 61 percent for African Americans, 81 percent for Asians, 75 percent for Hispanics, and 92 percent for whites (Wightman, 1998). The corresponding eventual passing rates (after as many as six attempts) were 78, 92, 89, and 97 percent, respectively. Thus, the 31 percentage point difference between passing rates for African Americans and whites on initial testing shrank to a 19-point gap after as many as six attempts. The differences between Hispanics and whites dropped from 17 percentage points to 8. These data also must be interpreted with care, however. Like the Praxis results, data were aggregated across states that have very different passing scores and compositions of minority candidates. To illustrate, although states have different essay sections, almost all of them use the same multiple-choice test. On that test, minority students in one large western state had substantially lower scores than their white classmates. Nevertheless, they still had higher scores than the mostly white candidates in

Page 105 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

another state. These states also had quite different passing standards and different percentages of minority candidates.

Analogous data are found for medical licensure tests (but because the same passing score is used nationwide, these data are less subject to concerns about misinterpretations of aggregated data). On the first part of the medical tests, a difference of 45 percentage points has been reported for initial passing rates of white and African American medical students, but the difference in their eventual passing rates dropped to 11 points. Similarly, the 25 percentage point difference in initial passing rates between these groups on the second part of the exam dropped to a 9-point difference in eventual passing rates (Case et al., 1996).

The initial and eventual passing rates for lawyers and physicians may be affected by their common use of intensive test preparation courses for these exams. Test preparation courses are less widely available for teacher licensure tests. There may be other differences between these doctoral-level licensing tests and teacher licensure tests that play out differently for minority and majority examinees.

The committee was able to obtain information on initial and eventual passing rates for teacher licensure tests from only two states—California and Connecticut. These two datasets avoid some of the interpretation problems posed by aggregating data across states. They also allow examination of group differences for candidates’ first attempts and for test takers’ later attempts. Eventual passing rates are important because they are determinative; they relate fairly directly to the licensure decision. Again, the initial rates are important too, since candidates who initially fail but eventually pass may experience delays and additional costs in securing a license.

Table 5–7 shows the number and percentage of candidates who passed the California Basic Education Skills Test (CBEST) on their first attempt in 1995/ 1996 and the percentage of the 1995/1996 cohort that passed the CBEST by the end of the 1998/1999 testing year. Table 5–9 provides analogous data for California’s Multiple Subjects Assessment for Teachers (MSAT) examination. First-time passing rates on the 1996/1997 test are given, along with passing rates for that cohort by 1998/1999. Tables 5–8 and 5–10 give group differences for these tests.

Table 5–7 shows that initial passing rates for 1995/1996 minority candidates on the CBEST exam were lower than for white examinees. The difference between African American and white initial passing rates was 38 percentage points. The gap between rates for Mexican Americans and whites was 28 percentage points, and the difference between Latinos/other Hispanics and whites was 22 percentage points. The passing rates for all groups increased after initially unsuccessful candidates took the test one or more additional times; and as the eventual rates show, the differences between passing rates for minority and majority groups decreased. The gap between African American and white candidates’ CBEST passing rates fell from 38 percentage points to 21. The gap

Page 106 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

TABLE 5–7 Passing Rates for the CBEST by Population Group, 1995– 1996 Cohort

	First-Time Passing Rates		Eventual Passing Rates
Ethnicity	N^a	% Passing	N	% Passing
African American	2,599	41	2,772	73
Asian American	1,755	66	1,866	87
Mexican American	3,907	51	4,344	88
Latino or other Hispanic	2,014	47	2,296	81
White	25,928	79	26,703	94
^aThe size of the 1995–1996 cohort differs for the first-time and eventual reports because first-time rates consider candidates who took all three CBEST sections on their first attempt; eventual rates consider candidates who took each CBEST section at least once by 1998/1999. SOURCE: Data from Carlson et al., (2000).

between Mexican Americans and whites dropped from 28 to 6 points, and the gap between Latino/other Hispanic examinees and white candidates dropped from 22 to 13 percentage points.

From these data, eventual passing rates tell a different story than do initial rates. The committee contends that both sets of data need to be included in policy makers’ judgments about the disparate impact of tests for licensing minority and majority group teacher candidates.

For the MSAT, initial and eventual passing rates for all groups were lower than CBEST passing rates. In addition, passing rates for minority candidates were lower than majority passing rates on the MSAT. The difference between African American and white candidates on the first MSAT was 49 percentage points. By the end of the 1998/1999 testing year, the difference dropped to 42 percent. A 35 percentage point difference between Mexican American and white candidates on the first attempt dropped to 26 percentage points by the end of the third year. The difference for Latino/other Hispanic and white candidates dropped from 33 to 22 percentage points.

Tables 5–11, 5–12, 5–13, and 5–14 provide similar data for Connecticut teacher candidates. The structure of the Connecticut data set differs from that of the California data in that, passing rates are shown for all Connecticut candidates who tested between 1994 and 2000. For the California analyses, the records of first-time candidates in a given year were matched to any subsequent testing attempts made in the next several years. The Connecticut analyses begin with initial testers in 1994, and the data set follows these individuals over the next six years. The data set also includes initial testers from 1995; the records of these candidates are matched to any retest attempts occurring in the next five years.

Page 107 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

TABLE 5–8 Differences Between CBEST Passing Rates for Minority and White California Candidates, 1995–1996 Cohort

Differences Between Whites and:	First-Time Passing Rates	Eventual Passing Rates
African Americans	38^a	21
Mexican Americans	28	6
Latinos or other Hispanics	22	13
Asian Americans	13	7
^aDifference in percentages.

TABLE 5–9 Passing Rates for the MSAT by Population Group, 1996– 1997 Cohort

	First-Time Passing Rates		Eventual Passing Rates
Ethnicity	N^a	% Passing	N	% Passing
African American	424	24	424	46
Asian American	543	62	543	81
Mexican American	989	38	989	62
Latino or other Hispanic	428	40	428	66
White	7,986	73	7,986	88
^aThe size of the 1995–1996 cohort differs for the first-time and eventual reports because first-time rates consider candidates who took all three CBEST sections on their first attempt; eventual rates consider candidates who took each CBEST section at least once by 1998/1999. SOURCE: Data from Brunsman et al. (1999).

TABLE 5–10 Differences Between MSAT Passing Rates for Minority and White California Candidates, 1996–1997 Cohort

Differences Between Whites and:	First-Time Passing Rates	Eventual Passing Rates
African Americans	49^a	42
Mexican Americans	35	26
Latinos or other Hispanics	33	22
Asian Americans	11	7
^aDifference in percentages.

Page 108 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

TABLE 5–11 Passing Rates for Praxis I: Computer-Based Test by Population Group, 1994–2000 Connecticut Candidates

	First-Time Passing Rates		Eventual Passing Rates
Ethnicity	N	% Passing	N	% Passing
African American	354	48	452	55
Asian American	96	54	227	66
Hispanic	343	46	442	59
White	8,852	71	10,035	81
SOURCE: Data provided to the committee by the State of Connecticut Department of Education on February 9, 2001. See text for a description of this dataset.

TABLE 5–12 Differences Between Praxis I: Computer-Based Test Passing Rates for Minority and White Connecticut Candidates, 1994–2000

Differences Between Whites and:	First-Time Passing Rates	Eventual Passing Rates
African Americans	23^a	26
Asian Americans	17	15
Hispanics	25	22
^aDifference in percentages.

TABLE 5–13 Passing Rates on the Praxis II: Elementary Education Tests by Population Group, for 1994–2000 Connecticut Candidates

	First-Time Passing Rates		Eventual Passing Rates
Ethnicity	N	% Passing	N	% Passing
African American	64	33	122	64
Asian American	38	66	48	83
Hispanic	66	54	95	78
White	2,965	68	3,877	89
SOURCE: Data provided to the committee by the State of Connecticut Department of Education on February 9, 2001. See the text for a description of this dataset.

Page 109 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

TABLE 5–14 Differences Between Praxis II: Elementary Education Tests Passing Rates for Minority and White Connecticut Candidates, 1994–2000

Differences Between Whites and:	First-Time Passing Rates	Eventual Passing Rates
African Americans	35	25
Asian Americans	2	6
Hispanics	14	11
^aDifference in percentages.

Likewise, records for first-time candidates from 1996 are included along with any retest records generated in the next four years. Similarly, first-time takers from 1997, 1998, and 1999 are included with retesting records from the next three, two, and one years, respectively. For each candidate initially testing between 1994 and 2000, the latest testing record is considered the eventual testing record. Because of this structure, candidates testing unsuccessfully for the first time in 2000 and passing in 2001 or later do not have their later successful attempts included in the analysis. Therefore, the reported eventual passing rates in Tables 5–11 through 5–14 are conservative estimates.

The Connecticut results show some of the same patterns as the California data. Minority passing rates were lower than majority passing rates on the initial and eventual administrations for both tests. The differences decreased for Hispanic and Asian American candidates on Praxis I and for African American and Hispanic test takers on the Praxis II Elementary Education tests. The differences between African American and white candidates on the Praxis I tests increased slightly from the initial testing to the eventual testing.

Although data for only a small number of tests are reported in Tables 5–7 through 5–14, in each case they showed that minority teacher candidates had lower average scores and lower passing rates than nonminority candidates. These differences exist on the initial attempts at licensure testing. The gaps decrease but do not disappear when candidates have multiple testing opportunities. The committee does not know how well these results generalize to those of other states. The committee contends that data on initial and eventual passing rates for minority and majority candidates should be sought from other states so that a broader picture of disparate impact on teacher licensure tests can be developed.

THE MEANING OF DISPARITIES

The differences in average scores and passing rates among groups raise at least two important questions. First, do the scores reflect real differences in competence or are the tests’ questions biased against one or more groups? Sec-

Page 110 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

ond, are the inferences drawn from the test results on specific tests (i.e., that some candidates have mastered some of the basic knowledge, skills, and abilities that are generally necessary to practice competently) sufficiently well grounded to justify the social outcomes of differential access to the teaching profession for members of different groups?

Bias

The finding that passing rates for one group are lower than those of another is not sufficient to conclude that the tests are biased. Bias arises when factors other than knowledge of a test’s content result in systematically higher or lower scores for particular groups of test takers. There are a number of factors that contribute to possible test bias: item bias, appropriateness of test content, and opportunity to learn issues.

Some researchers have found evidence of cultural bias on teacher tests that are no longer is use, especially tests of general knowledge (Medley and Quirk, 1974; Poggio et al., 1985, 1986). These findings have led to speculation that tests which rely more heavily on general life experiences and cultural knowledge than on a specific curriculum that can be studied may unfairly disadvantage candidates whose life experiences are substantially different from those of majority candidates. This would especially be the case if the content and referents represented on certain basic skills or general knowledge tests, for example, were more commonly present in the life experiences of majority candidates (Bond, 1998). At least some developers of teacher licensure tests, though, put considerable work into eliminating bias during test construction. Items are examined for potentially biasing language or situations, and questionable items often are repaired or removed (Educational Testing Service, 1999a). Additionally, items that show unusually large differences among groups are reexamined for bias. Items that show such differences may be removed from scoring. There is disagreement among committee members about the effectiveness of the statistical and other procedures used by test developers to reduce the cultural bias that might be present in test items. Some committee members contend that these procedures are effective in identifying potentially biased items, whereas others are more skeptical about these methods’ ability to detect biased questions. Some members worry that the procedures are not systematically applied.

Other researchers have reservations about the content of pedagogical knowledge tests. They argue that expectations about appropriate or effective teaching behaviors may differ in different kinds of communities and teaching settings and that tests of teacher knowledge that rely on particular ideologies of teaching (e.g., constructivist versus direct instruction approaches) may be differentially valid for different teaching contexts. Items or expected responses that overgeneralize notions about effective teaching behaviors to contexts in which they are less valid may unfairly disadvantage minority candidates who are more likely to

Page 111 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

live and work in these settings (Irvine, 1990; Ladson-Billings, 1994; Delpit, 1996).

Perhaps most important, the fact that members of minority groups have had less access to high-quality education for most of this country’s history (National Research Council, 2001), and that disparate impact occurs across a wide range of tests could suggest that differential outcomes reflect differential educational opportunities more than test bias. In addition to uneven educational opportunities, some contend that these differences may relate to differences between groups in test preparation and test anxiety (Steele, 1992). At the same time, concerns have been raised that the disparities in candidate outcomes on some teacher licensing tests exceed those on other tests of general cognitive ability (Haney et al., 1987; Goertz and Pitcher, 1985). One explanation for these larger historical differences is that there have been geographic differences in the concentrations of test takers of different groups taking particular tests and that these are correlated with differences in educational opportunities available to minorities in different parts of the country (Haney et al., 1987). This hypothesis also may explain why differences among groups are much smaller on some teacher tests than on others and why the pattern for Hispanics does not necessarily follow that for African Americans.

Another explanation is that minority candidates for teaching are drawn disproportionately from the lower end of the achievement distribution among minority college students. Darling-Hammond et al. (1999) suggest this could arise if the monetary rewards of teaching are especially low for minority group members relative to other occupations to which they now have access.

Consequences

When there are major differences in test scores among groups, it is important to evaluate the extent to which the tests are related to the foundational knowledge needed for teaching or to a candidate’s capacity to perform competently as a teacher. If minority candidates pass the test at a lower rate than their white peers, the public should expect that there is substantial evidence that the test (and the standard represented by the passing scores that are in effect) is appropriate. For example, the test should be a sound measure of the foundational skills needed for teaching, such as basic literacy skills or subject matter knowledge, that teachers need to provide instruction effectively or should accurately assess skills that make a difference in teacher competence in the classroom. This concern for test validity should be particularly salient when large numbers of individuals who are members of historically underrepresented minority groups have difficulty passing the tests.

Lower passing rates for minority candidates on teacher licensure tests mean that a smaller subset of the already small numbers of minority teacher candidates will move into the hiring pool as licensees and that schools and districts will

Page 112 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

have smaller pools of candidates from which to hire. This outcome poses problems for schools and districts in seeking a qualified and diverse teaching force. Currently, 13 percent of the teaching force is minority, while minority children make up 36 percent of the student population (U.S. Department of Education, 2001). There are many reasons to be concerned about the small numbers of minority teachers (Darling-Hammond and Sclan, 1996). The importance of minority teachers as role models for minority and majority students is one source of concern. Second, minority teachers can bring a special level of understanding to the experiences of their minority students and a perspective on school policies and practices that is important to include. Finally, minority teachers are more likely to teach in central cities and schools with large minority populations (Choy et al., 1993; National Education Association, 1992). Because minority teachers represent a relatively larger percentage of teacher applicants in these locations, a smaller pool of minority candidates could contribute to teacher shortages in these schools and districts.

There are different perspectives on whether these problems should be the focus of policy attention and, if so, what should be done about them. From a legal perspective, evidence of disparate outcomes does not, by itself, warrant changes in test content, passing scores, or procedures. While Title VII of the Civil Rights Act of 1964 says that employment procedures that have a significant differential impact based on race, sex, or national origin must be justified by test users as being valid and consistent with business or educational necessity, court decisions have been inconsistent about whether the Civil Rights Act applies to teacher licensing tests. In two of three cases in which teacher testing programs were challenged on Title VII grounds, the courts upheld use of the tests (in South Carolina and California), ruling that evidence of the relevance of test content was meaningful and sound. Both courts ruled that the tests were consistent with business necessity and that valid alternatives with less disparate impacts were not available.²

In the third case, Alabama discontinued use of its teacher licensing test based on findings of both disparate impact and the failure of the test developer to meet technical standards for test development. The court pointed to concerns about content-related evidence of validity and to arbitrary standards for passing scores as reasons for overturning use of the test. These cases and other licensure and employment testing cases demonstrate that different passing rates do not, by themselves, signify unlawful practices. The lawfulness of licensure tests with disparate impact comes into question when validity cannot be demonstrated.

²

In its interim report (National Research Council, 2000), the committee reported the ruling in a case involving the California Basic Educational Skills Test (Association of Mexican American Educators v. California, 183, F.3d 12055, 1070–1071, 9th Cir., 1999). The court subsequently set aside its own decision and issued a new ruling on October 30, 2000 (Association of Mexican American Educators v. California, 231 F.3d 572, 9th Cir., en banc). The committee did not consider this ruling.

Page 113 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

POLICY OPTIONS

The disadvantages that many minority candidates face as a result of their teacher licensure test scores is not a small matter. These disparate outcomes also affect society in a variety of ways. The committee contends that the effects of group differences on licensure tests are so substantial that it will be difficult to offset their impact without confronting them directly. To the extent that differences in test performance are a function of uneven educational opportunities for different groups, reducing disparities in the educational opportunities available to minority candidates throughout their educational careers is an important policy goal. This will take concerted effort over a sustained period of time. In the shorter run, colleges and universities that prepare teaching candidates who need greater developmental supports may need greater resources to invest in and ensure minority students’ educational progress and success.

The committee also believes it is critically important that, where there is evidence of substantial disparate impact, work must be done to evaluate the validity of tests and to strengthen the relationships between tests and the knowledge, skills, abilities, and dispositions needed for teaching. In these instances the quality of the validity evidence is very important.

CONCLUSION

The committee used its evaluation framework to evaluate a sample of five widely used tests produced by the Educational Testing Service. The tests the committee reviewed met most of its criteria for technical quality, although there were some areas for improvement. The committee also attempted to review a sample of National Evaluation Systems tests. Despite concerted and repeated efforts, though, the committee was unable to obtain sufficient information on the technical characteristics of tests produced by NES and thus could draw no conclusions about their technical quality.

On all of the tests that the committee reviewed, minority candidates had lower passing rates than nonminority candidates on their initial testing attempts. Though differences between the passing rates of candidate groups eventually decrease because many unsuccessful test takers retake and pass the tests, eventual passing rates for minority candidates are still lower than those for nonminority test takers.

The committee concludes its evaluation of current tests by reiterating the following:

The profession’s standards for educational testing say that information sufficient to evaluate the appropriateness and technical adequacy of tests should be made available to potential test users and other interested parties. The committee considers the lack of sufficient technical infor-

Page 114 Cite

Suggested Citation:"5. Evaluating Current Tests." National Research Council. 2001. Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality. Washington, DC: The National Academies Press. doi: 10.17226/10090.

×

mation made available by NES and the states to evaluate NES-developed tests to be problematic and a concern. It is also significant because NES-developed tests are administered to very large numbers of teacher candidates.

The initial licensure tests currently in use rely almost exclusively on content-related evidence of validity. Few, if any, developers are collecting evidence about how test results relate to other relevant measures of candidates’ knowledge, skills, and abilities.
It is important to collect validity data that go beyond content-related validity evidence for initial licensing tests. However, conducting high-quality research of this kind is complex and costly. Examples of relevant research include investigations of the relationships between test results and other measures of candidate knowledge and skills or on the extent to which tests distinguish candidates who are at least minimally competent from those who are not.
The processes used to develop current tests, the empirical studies of test content, and common-sense analyses suggest the importance of at least some of what is measured by these initial licensure tests. Beginning teachers should know how to read, write, and do basic mathematics; they should know the content areas they teach.
The lower passing rates for minority teacher candidates on current licensure tests pose problems for schools and districts in seeking a qualified and diverse teaching force. Setting substantially higher passing scores on licensure tests is likely to reduce the diversity of the teacher applicant pool, further adding to the difficulty of obtaining a diverse school faculty.