The Work of a Great Test Scientist Helps Explain the Failure of No Child Left Behind

by E. D. Hirsch, Jr.
January 10th, 2013

In Praise of Samuel Messick 1931–1998, Part II

In a prior post I described Messick’s unified theory of test validity, which judged a test not to be valid if its practical effects were null or deleterious. His epoch-making insight was that the validity of a test must be judged both internally for accuracy and externally for ethical and social effects. That combined judgment, he argued, is the only proper and adequate way of grading a test.

In the era of the No Child Left Behind law (2001), the looming specter of tests has been the chief determiner of classroom practice. This led me to the following chain of inferences: Since 2001, tests have been the chief determiners of educational practices. But these tests have failed to induce practices that have worked. Hence, according to the Messick principle, the tests that we have been using must not be valid. Might it be that a new, more Messick-infused approach to testing would yield far better results?

First, some details about the failure of NCLB. Despite its name and admirable impulses it has continued to leave many children behind:


NCLB has also failed to raise verbal scores. The average verbal level of school leavers stood at 288 when the law went into effect, dropped to 283 in 2004, and stood at 286 in 2008.

Yet this graph shows an interesting exception to this pattern of failure, and it will prove to be highly informative under Messick’s principle. Among 4th graders (age 9) the test-regimen of NCLB did have a positive impact.

Moreover, NCLB also had positive effects in math:

This contrast between the NCLB effects in math and reading is even more striking if we look at the SAT, where the test takers are trying their best:

So let’s recap the argument. Under NCLB, testing in both math and reading has guided school practices. Those practices were more successful in math and in early reading than in later reading. According to the Messick principle, therefore, reading tests after grade 4 had deleterious effects and cannot have been valid tests. How can we make these reading tests more valid?

A good answer to that question will help determine the future progress of American education. Tune in.


  1. I was depressed to hear news headlines here in CA announcing that Tom Torlakson, the state superintendent of education, plans to implement new state tests that measure “critical thinking and problem solving skills”. I know what words like this mean to my fellow educators: teach critical thinking and problem solving skills. Goodbye meaty content.

    I agree with Messick: great tests will lead to great teaching. I fear our ed leaders still don’t grasp what great tests should look like.

    Comment by Ponderosa — January 10, 2013 @ 9:49 pm

  2. Attempting to “teach” critical thinking and problem solving in our schools translates into the de-intellectualization of our students, as in their dumbing down. How can anyone think critically or solve a problem without first establishing a rich foundation of background information on the topic which would then allow them to address an issue intelligently?

    Comment by Paul Hoss — January 11, 2013 @ 8:27 am

  3. I wonder whether those fourth graders who showed reading progress would have likewise showed an increase in literary knowledge, historical knowledge, and so forth.

    I suspect not (though I might be wrong). I suspect that in the case of the reading tests, the increases can be attributed to test preparation, which can boost scores in the short term, at certain levels, in certain situations.

    Perhaps fourth graders are likelier than other age groups to see their test scores boosted through test prep. Many can decode and comprehend simple texts. Yet those unacquainted with multiple-choice tests may stumble on certain questions or run out of time. Thus, extensive practice can raise their scores (without increasing their knowledge).

    In the upper grades, students must know more in order to do well on the tests, yet it is not clear what they must know (beyond a few points of grammar and a few ELA terms). Test prep probably has less of an effect on the scores, except in writing, where, if students follow the directions to the letter, they are likely to get a decent score, even if what they write makes little sense.

    This is just a hypothesis–but if it is correct, then the fourth-grade tests may be no more valid than the eighth- and twelfth-grade tests. They show increases because of their particular sensitivity to test preparation.

    Math is a different matter. There, it is much clearer what students are supposed to know, so test preparation and subject-matter instruction are not so far flung.

    In any case, the old question rises up again: why treat reading comprehension tests as the measure of effectiveness, when r.c. is only one aspect, and possibly a side effect, of what students should be learning?

    Comment by Diana Senechal — January 11, 2013 @ 11:31 am

  4. I posted this anecdote on another website, as my most recent encounter with the products of a system that talks about critical thinking and problem solving but disdains content. I was at a local bakery when the power went out; the cash register could be opened but could make no calculations. None of the 4 clerks (20s-30s)could calculate sums of purchases and figure sales tax, with the aid of a calculator. It took me 10 minutes to show them how to do it (“Wow, you must be a math teacher!” No, I just learned sixth-grade arithmetic). They had no conceptual understanding of the problem and no idea of the relationship between percentages and decimals; that 6% can be expressed as .06. They also had no number sense. A data-entry error resulting in $16.00 due for a $10.00 purchase did not register with them as incorrect. The district uses Everyday Math. I’m betting that the same disdain for content permeates the system. (my kids were not schooled here)

    When I was growing up, in the 50s-60s, very few of the adults in my small town had more than a HS education and many of the older people had less. However, they were all literate, numerate and had solid general knowledge. I never saw signs like the recent ones in my current abode: “Happy Holiday’s”, “Celebrate and Festive With Us” and “Sign-up Now for Classes” (at a CC). Sigh

    Comment by momof4 — January 11, 2013 @ 11:41 am

  5. Years ago, I remember reading some sample passages and questions from an 8th-grade test; I think it was the NAEP. One talked about the Blue and the Gray, mentioned specific battles and Sherman’s March to the Sea. Students without knowledge that the Blue and the Gray meant the US Civil War, the outcome of various battles and the identity of mentioned generals would have been in big trouble. Another passage required previous knowledge of the structure and movement of the solar system, seasons, tides etc. Content knowledge.

    Comment by momof4 — January 11, 2013 @ 1:03 pm

  6. How can we say with any confidence that anything is attributable to NCLB? If scores go up after 2002 how do we know it’s due to NCLB? If scores go down after 2002 how do we know that’s due to NCLB? If scores stay flat after 2002 maybe it means NCLB was a good affect that was countered by some bad effect that has not been identified. Or maybe NCLB was a bad affect that was countered by some other good affect. Or maybe NCLB was neutral, but a good affect also existed that would have raised scores had it not been for another bad affect that lowered scores.

    The comments of both Diana Senechel and momof4 seem very relevant to me, and they would be equally relevant if NCLB had never existed.

    How do we know that NCLB has any actual relevance to educational achievement? We have no doubt that NCLB has a lot of relevance to what educators are forced to deal with, but that is not quite the same as saying it has any real relevance to actual teaching and learning. Maybe NCLB is like a cruel wind that blows mercilessly on weary travelers, forcing them to hold their coats tight as they trudge forward, but maybe that cruel wind has no real effect on the path, or the destination, or even the speed of those beleaguered travelers.

    Comment by Brian Rude — January 12, 2013 @ 1:25 am

  7. [...] Posts Blame the Tests The Work of a Great Test Scientist Helps Explain the Failure of No Child Left Behind If He’s So Important, Why Haven’t We Heard of Him? A Backward Glance O’er Travel’d Roads [...]

    Pingback by Blame the Tests « The Core Knowledge Blog — January 15, 2013 @ 10:27 am

  8. I think that one thing NCLB has done is to require disaggregation of scores for various groups, so a good school-wide average is not enough. In my kids’ old district, there were specific, often magnet, programs put in place in URM-majority schools so the school average would be good. However, NCLB rules revealed a strongly bi-modal score distribution; lots of kids failing and lots of proficient (really, well-beyond).

    However, the AYP requirement is often ridiculous. I recently read that my kids’ old HS is in danger of failing to make AYP. Since over 95% of kids go to 4-yr college (few requiring any remediation and most going to competitive or elite schools), it’s hard to make progress when almost all are proficient (or better) at the start. Conversely, it’s (relatively) easy to make progress when you start at less than 10% passing.

    Comment by momof4 — January 15, 2013 @ 11:29 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

While the Core Knowledge Foundation wants to hear from readers of this blog, it reserves the right to not post comments online and to edit them for content and appropriateness.