Evaluating K-12 and Credentialing Checklists

In 2012, we created two checklists for evaluators to use as a tool for evaluating K-12 and credentialing assessment programs. The purpose of these checklists is to assist evaluators in thoroughly reviewing testing programs by distilling the best practices for testing outlined by various professional associations, including the Standards for Educational and Psychological Testing, the U.S. Department of Education’s Standards and Assessment Peer Review Guidance, the Standards for the Accreditation of Certification Programs, the Code of Fair Testing Practices in Education, and the Rights and Responsibilities of Test Takers.

The checklists were developed to allow evaluation of five aspects of testing: 1) Test Development, 2) Test Administration, 3) Reliability Evidence, 4) Validity Evidence, and 5) Scoring and Reporting. A separate checklist was developed for each area; each of the checklists presents detailed indicators of quality testing programs that evaluators can check off as observed (O), not observed (N), or not applicable (NA) as they conduct evaluations. Three examples of checklist items are included below (one each from the Test Development, Test Administration, and Scoring and Reporting checklists).

The checklists are intended to be used by those wishing to evaluate K-12 or credentialing assessment programs against consensus criteria regarding quality standards for such programs. One of the main sources informing development of the original checklists was the guidance provided in the then-current edition of the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999). However, much has changed in testing since the publication of the 1999 Standards, and the Standards were revised in 2014 to address emerging methods and concerns related to K-12 and credentialing assessment programs. Consequently, revised checklists have been produced to reflect the new Standards.

The latest edition of the Standards, as compared to the 1999 edition, pays greater attention to testing diverse populations and the role of new technologies in testing. For example, the following three key revisions to the Standards are reflected in the new checklists:

Validity and reliability evidence should be produced and documented for subgroups of test takers. Testing programs should collect validity evidence for various subgroups of test takers from different socioeconomic, linguistic, and cultural backgrounds, as opposed to aggregating validity evidence for an entire sample of test takers. A focus on validity evidence within unique subgroups helps ensure that test interpretations remain valid for all members of the intended testing population.
Tests should be administered in an appropriate language. Given that test takers can come from linguistically diverse backgrounds, evaluators should check that tests are administered in the most appropriate language for the intended population and intended purpose of the test. Interpreters, if used, should be fluent in both the language and content of the test.
Automated scoring methods should be described. Current tests increasingly rely on automated scoring methods to score constructed-response items previously scored by human raters. Testing programs should document how automated scoring algorithms are used and how scores obtained from such algorithms should be interpreted.

Although these three new themes in the Standards illustrate the breadth of coverage of the checklists, they provide only a sample of the changes embodied in the full version of the revised checklists, which contain approximately 100 specific practices that testing programs should follow distilled from contemporary professional standards for assessment programs. The revised checklists are particularly helpful in that they provide users with a single-source compilation of the most up-to-date and broadly endorsed elements of defensible testing practice. Downloadable copies of the revised checklists for K-12 and credentialing assessment programs can be found at (bit.ly/checklist-assessment).

About the Authors

Gregory J. Cizek

Professor Educational Measurement and Evaluation University of North Carolina, Chapel Hill

Gregory J. Cizek is Professor of Educational Measurement and Evaluation at the University of North Carolina—Chapel Hill, where he teaches courses in applied measurement and evaluation. He has authored of over 300 books, chapters, articles, and conference papers related to his scholarly interests in standard setting, validity, and test security. Dr. Cizek is a member and past president of the National Council on Measurement in Education (NCME); he has managed credentialing testing programs, worked on state and national assessment programs, and served as vice-president of a local school board; he began his career as an elementary school teacher. Dr. Cizek is the recipient of the AERA Division D award for Significant Contribution to Educational Measurement and Research Methodology (2006) and recipient of the NCME award for Outstanding Dissemination of Educational Measurement Concepts (2007). He received his PhD in Measurement, Evaluation, and Research Design from Michigan State University.

Audra Kosh

Doctoral Student Learning Sciences and Psychological Studies University of North Carolina, Chapel Hill

Audra Kosh is a doctoral student in Learning Sciences and Psychological Studies at the University of North Carolina—Chapel Hill. She is currently a research fellow at MetaMetrics, where she works on developing mathematics assessments and conducts research on automatic item generation. Ms. Kosh previously taught eighth-grade mathematics in Prince George’s County Public Schools, Maryland, and worked as a research analyst for Westat in Rockville, Maryland. Her research interests include educational measurement, mathematics learning and teaching, and informal learning opportunities. Ms. Kosh holds a M.A.T. in Secondary Mathematics from American University.

Except where noted, all content on this website is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Checklists for Evaluating K-12 and Credentialing Testing Programs

About the Authors

Gregory J. Cizek

Audra Kosh

Related Blog Posts

Data Storytelling Leads to More Relevant, Usable Reports

The Space Between Research and Evaluation

Steps for Communicating Evaluation Design to PIs

Checklists for Evaluating K-12 and Credentialing Testing Programs

About the Authors

Gregory J. Cizek

Audra Kosh

Never Miss a Post

Related Blog Posts

Data Storytelling Leads to More Relevant, Usable Reports

The Space Between Research and Evaluation

Steps for Communicating Evaluation Design to PIs