Blame the Tests

by E. D. Hirsch, Jr.
January 15th, 2013

In Praise of Samuel Messick 1931–1998, Part III

The chief practical impact of NCLB has been its principle of accountability. Adequate yearly progress, the law stated, must be determined by test scores in reading and math—not just for the school as a whole, but for key groups of students.

Now, a decade later, the result of the law, as many have complained, has been a narrowing of the school curriculum. In far too many schools,  the arts and humanities, and even science and civics, have been neglected—sacrificed on the altar of tests  without any substantial progress nationwide on the tests themselves. It is hard to decide whether to call NCLB a disaster or a catastrophe.

But I disagree with those who blame this failure on the accountability principle of NCLB. The law did not specify what tests in reading and math the schools were to use. If the states had responded with valid tests—defined by Messick as tests that are both accurate and have a productive effect on practice—the past decade would have seen much more progress.

Since NCLB, NAEP’s long-term trend assessment shows substantial increases in reading among the lowest-performing 9-year-olds—but nothing comparable in later grades. It also shows moderate increases in math among 9- and 13-year-olds.

So, it seems that a chief educational defect of the NCLB era lay in the later-grades reading tests; they simply do not have the same educational validity of the tests in early grades reading and in early- and middle-grades math.

 ****

It’s not very hard to make a verbal test that predicts how well a person will be able to read. One accurate method used by the military is the two-part verbal section of the multiple-choice Armed Forces Qualification Test (AFQT), which is known for its success in accurately predicting real-world competence. One section of the AFQT Verbal consists of 15 items based on short paragraphs on different subjects and in different styles to be completed in 13 minutes.  The other section of the AFQT Verbal is a vocabulary test with 35 items to be completed in 11 minutes. This 24-minute test predicts as well as any verbal test the range of your verbal abilities, your probable job competence and your future income level. It is a short, cheap and technically valid test. Some version of it could even serve as a school-leaving test.

Educators would certainly protest if that were done—if only because such a test would give very little guidance for classroom practice or curriculum. And this is the nub of the defects in the reading tests used during the era of NCLB: They did not adequately support curriculum and classroom practice. The tests in early-grades reading and in early- and middle-grades math did a better job of inducing productive classroom practice, and their results show it.

Early-grades reading tests, as Joseph Torgesen and his colleagues showed, probe chiefly phonics and fluency, not comprehension. Schools are now aware that students will be tested on phonics and fluency in early grades. In fact, these crucial early reading skills are among the few topics for which recent (pre-Common Core) state standards had begun to be highly specific. These more successful early reading tests were thus different from later ones in a critical respect:  They actually tested what students were supposed to be taught.

Hence in early reading, to its credit, NCLB induced a much greater correlation than before between standards, curriculum, teaching and tests. The tests became more valid in practice because they induced teachers to teach to a test based on a highly specific subject matter—phonics and fluency. Educators and policymakers recognized that teaching swift decoding was essential in the early grades, tests assessed swift decoding, and—mirabile dictu—there was an uptick in scores on those tests.

Since the improvements were impressive, let’s take a look at what has happened in over the past decade among the lowest performing 9-year-olds on NAEP’s long-term trend assessment in reading.

Note that there is little to no growth among higher-performing 9-year-olds, presumably because they had already mastered phonics and fluency.

Similarly, early- and middle-grades math tests probed substantive grade-by-grade math knowledge, as the state standards had become ever more specific in math. You can see where I’m going: Early reading and math improved because teachers typically teach to the tests (especially under NCLB-type accountability pressures), and the subject matter of these tests began to be more and more defined and predictable, causing a collaboration and reinforcement between tests and classroom practice.

In later-grades reading tests, where we have failed to improve, the tests have not been based on any clear, specific subject matter, so it has been impossible to teach to the tests in a productive way. (The lack of alignment between math course taking and the NAEP math assessment for 17-year-olds is similarly problematic.) Of course, there are many reasons why achievement might not rise. But specific subject matter, both taught and tested, is a necessary—if not sufficient—condition for test scores to rise.

In the absence of any specific subject matter for language arts, teachers, textbook makers, and test makers have conceived of reading comprehension as a strategy rather than as a side effect of broad knowledge. This inadequate strategy approach to language arts is reflected in the tests themselves. I have read many of them.  An inevitable question is something like this: “The main idea of this passage is….” And the theory behind such a question is that what is being tested is the ability of the student to strategize the meaning by “questioning the author” and performing other puzzle-solving techniques to get the right answer. But, as readers of this blog know, that is not what is being tested. The subject matter of the passage is.

This mistaken strategy-focused structure has made these tests not only valueless educationally, but worse—positively harmful. Such tests send out the misleading message that reading comprehension is chiefly strategizing. That idea has dominated language arts instruction in the past decade, which means that a great deal of time has been misspent on fruitless test-taking activities. Tragically, that time could have been spent on science, humanities and the arts—subjects that would have actually increased reading abilities (and been far more interesting).

The only way that later-grades reading tests can be made educationally valid is by adopting the more successful structure followed in early reading and math. An educationally valid test must be based on the specific substance that is taught at the grade level being tested (possibly with some sampling of specifics from previous and later grades for remediation and acceleration purposes). Testing what has been taught is the only way to foster collaboration and reinforcement between tests and classroom practice. An educationally valid reading test requires a specific curriculum—a subject of further conversations, no doubt.