Blame the Tests

by E. D. Hirsch, Jr.
January 15th, 2013

In Praise of Samuel Messick 1931–1998, Part III

The chief practical impact of NCLB has been its principle of accountability. Adequate yearly progress, the law stated, must be determined by test scores in reading and math—not just for the school as a whole, but for key groups of students.

Now, a decade later, the result of the law, as many have complained, has been a narrowing of the school curriculum. In far too many schools,  the arts and humanities, and even science and civics, have been neglected—sacrificed on the altar of tests  without any substantial progress nationwide on the tests themselves. It is hard to decide whether to call NCLB a disaster or a catastrophe.

But I disagree with those who blame this failure on the accountability principle of NCLB. The law did not specify what tests in reading and math the schools were to use. If the states had responded with valid tests—defined by Messick as tests that are both accurate and have a productive effect on practice—the past decade would have seen much more progress.

Since NCLB, NAEP’s long-term trend assessment shows substantial increases in reading among the lowest-performing 9-year-olds—but nothing comparable in later grades. It also shows moderate increases in math among 9- and 13-year-olds.

So, it seems that a chief educational defect of the NCLB era lay in the later-grades reading tests; they simply do not have the same educational validity of the tests in early grades reading and in early- and middle-grades math.


It’s not very hard to make a verbal test that predicts how well a person will be able to read. One accurate method used by the military is the two-part verbal section of the multiple-choice Armed Forces Qualification Test (AFQT), which is known for its success in accurately predicting real-world competence. One section of the AFQT Verbal consists of 15 items based on short paragraphs on different subjects and in different styles to be completed in 13 minutes.  The other section of the AFQT Verbal is a vocabulary test with 35 items to be completed in 11 minutes. This 24-minute test predicts as well as any verbal test the range of your verbal abilities, your probable job competence and your future income level. It is a short, cheap and technically valid test. Some version of it could even serve as a school-leaving test.

Educators would certainly protest if that were done—if only because such a test would give very little guidance for classroom practice or curriculum. And this is the nub of the defects in the reading tests used during the era of NCLB: They did not adequately support curriculum and classroom practice. The tests in early-grades reading and in early- and middle-grades math did a better job of inducing productive classroom practice, and their results show it.

Early-grades reading tests, as Joseph Torgesen and his colleagues showed, probe chiefly phonics and fluency, not comprehension. Schools are now aware that students will be tested on phonics and fluency in early grades. In fact, these crucial early reading skills are among the few topics for which recent (pre-Common Core) state standards had begun to be highly specific. These more successful early reading tests were thus different from later ones in a critical respect:  They actually tested what students were supposed to be taught.

Hence in early reading, to its credit, NCLB induced a much greater correlation than before between standards, curriculum, teaching and tests. The tests became more valid in practice because they induced teachers to teach to a test based on a highly specific subject matter—phonics and fluency. Educators and policymakers recognized that teaching swift decoding was essential in the early grades, tests assessed swift decoding, and—mirabile dictu—there was an uptick in scores on those tests.

Since the improvements were impressive, let’s take a look at what has happened in over the past decade among the lowest performing 9-year-olds on NAEP’s long-term trend assessment in reading.

Note that there is little to no growth among higher-performing 9-year-olds, presumably because they had already mastered phonics and fluency.

Similarly, early- and middle-grades math tests probed substantive grade-by-grade math knowledge, as the state standards had become ever more specific in math. You can see where I’m going: Early reading and math improved because teachers typically teach to the tests (especially under NCLB-type accountability pressures), and the subject matter of these tests began to be more and more defined and predictable, causing a collaboration and reinforcement between tests and classroom practice.

In later-grades reading tests, where we have failed to improve, the tests have not been based on any clear, specific subject matter, so it has been impossible to teach to the tests in a productive way. (The lack of alignment between math course taking and the NAEP math assessment for 17-year-olds is similarly problematic.) Of course, there are many reasons why achievement might not rise. But specific subject matter, both taught and tested, is a necessary—if not sufficient—condition for test scores to rise.

In the absence of any specific subject matter for language arts, teachers, textbook makers, and test makers have conceived of reading comprehension as a strategy rather than as a side effect of broad knowledge. This inadequate strategy approach to language arts is reflected in the tests themselves. I have read many of them.  An inevitable question is something like this: “The main idea of this passage is….” And the theory behind such a question is that what is being tested is the ability of the student to strategize the meaning by “questioning the author” and performing other puzzle-solving techniques to get the right answer. But, as readers of this blog know, that is not what is being tested. The subject matter of the passage is.

This mistaken strategy-focused structure has made these tests not only valueless educationally, but worse—positively harmful. Such tests send out the misleading message that reading comprehension is chiefly strategizing. That idea has dominated language arts instruction in the past decade, which means that a great deal of time has been misspent on fruitless test-taking activities. Tragically, that time could have been spent on science, humanities and the arts—subjects that would have actually increased reading abilities (and been far more interesting).

The only way that later-grades reading tests can be made educationally valid is by adopting the more successful structure followed in early reading and math. An educationally valid test must be based on the specific substance that is taught at the grade level being tested (possibly with some sampling of specifics from previous and later grades for remediation and acceleration purposes). Testing what has been taught is the only way to foster collaboration and reinforcement between tests and classroom practice. An educationally valid reading test requires a specific curriculum—a subject of further conversations, no doubt.

The Work of a Great Test Scientist Helps Explain the Failure of No Child Left Behind

by E. D. Hirsch, Jr.
January 10th, 2013

In Praise of Samuel Messick 1931–1998, Part II

In a prior post I described Messick’s unified theory of test validity, which judged a test not to be valid if its practical effects were null or deleterious. His epoch-making insight was that the validity of a test must be judged both internally for accuracy and externally for ethical and social effects. That combined judgment, he argued, is the only proper and adequate way of grading a test.

In the era of the No Child Left Behind law (2001), the looming specter of tests has been the chief determiner of classroom practice. This led me to the following chain of inferences: Since 2001, tests have been the chief determiners of educational practices. But these tests have failed to induce practices that have worked. Hence, according to the Messick principle, the tests that we have been using must not be valid. Might it be that a new, more Messick-infused approach to testing would yield far better results?

First, some details about the failure of NCLB. Despite its name and admirable impulses it has continued to leave many children behind:


NCLB has also failed to raise verbal scores. The average verbal level of school leavers stood at 288 when the law went into effect, dropped to 283 in 2004, and stood at 286 in 2008.

Yet this graph shows an interesting exception to this pattern of failure, and it will prove to be highly informative under Messick’s principle. Among 4th graders (age 9) the test-regimen of NCLB did have a positive impact.

Moreover, NCLB also had positive effects in math:

This contrast between the NCLB effects in math and reading is even more striking if we look at the SAT, where the test takers are trying their best:

So let’s recap the argument. Under NCLB, testing in both math and reading has guided school practices. Those practices were more successful in math and in early reading than in later reading. According to the Messick principle, therefore, reading tests after grade 4 had deleterious effects and cannot have been valid tests. How can we make these reading tests more valid?

A good answer to that question will help determine the future progress of American education. Tune in.

The End of Education Reform

by Robert Pondiscio
September 21st, 2009

A remarkable speech by Chester Finn of the Fordham Institute is all the more remarkable for the lack of chatter it has generated in the edusphere.  Titled “Is It Time to Throw in the Towel on Education Reform?” the September 9 speech at Rice University notes a broad consensus on education reform that has existed for better than two decades is coming apart at the seams.  “The overriding goal of that consensus was to boost America’s academic achievement at the K-12 level,” Finn notes, and it gave rise to “a tsunami of standards-based reform.”

He cites several major developments contributing to the fraying of that consensus.  Among them: unhappiness with NCLB and a palpable backlash against testing that “goes to the heart of standards-based reform.”  On school choice, he points out, far too many charters and schools of choice have been “disappointingly mediocre.”  Then there are the results of the reform era:

Despite all the reforming, U.S. scores have remained essentially flat, graduation rates have remained essentially flat, and our international rankings have remained essentially flat. You can find some upward blips but you can also find downward blips. Big picture, over 25 years, is flat, flat, flat. In other words, all the reforming has yielded little or nothing by way of stronger outcomes.

Finn also cites “principled critiques by serious people” as another crack in the ed reform wall:

E.D. Hirsch’s new book may be its most cogent example, at least until Diane Ravitch’s next book emerges—of both standards-based reform and school choice on grounds that these structural changes neglect crucial issues of content and pedagogy—neglect what actually goes on in classrooms between teacher and learner—while narrowing the curriculum and weakening the common culture. 

 Has the reform consensus “outlived its usefulness?”  Finn compares American education to the situation the nation found itself in when the Articles of Confederation proved insufficient to the needs of the new nation.  “We may be at a similar stage with regard to our public-education system,” he notes. “Further tugging and kicking at it from the banks of the Potomac is not going to modernize it.”

I’m suggesting to you that American education today resembles America itself in 1785. The old arrangement isn’t working well enough and probably cannot be made to. A new constitution is needed. It’s in that sense that we should throw in the towel on education reform and think instead about reinvention.

 Checker briefly lists his ideas for “essential ingredients” of this new constitution including national standards and measures; portable statewide “weighted-student” financing; and the replacement of traditional school districts “with an array of virtual systems and regional or national operators (some of them technology-based).”

A Bouquet of Dandelions

by Robert Pondiscio
June 18th, 2009

A study by the Center on Education Policy casts doubt on the conventional wisdom that No Child Left Behind causes teachers to shortchange high and low-performers, given the law’s incentives to get students to the proficient level.  

“If accountability policies were indeed shortchanging high- and low-achieving students, we would expect to see stagnation or decline at the basic and advanced levels,” says Jack Jennings, CEP’s President. “Instead, the percentages of students scoring at the basic-and-above and advanced levels have increased much more often than they have decreased, especially in the lower grades.”

Hear, hear for higher test scores at all chievement levels.  But how does that show high achieving students aren’t suffering under NCLB?  Testing is a measure of where students are, not where they could or even should be.  If there’s anything I learned teaching at a struggling school, it’s that the stronger students are largely assumed to be doing fine despite being neglected–a point nailed precisely in the Jack Kent Cooke Foundation’s “Achievement Trap” report a few years back. 

Such children are dandelions.  They will find a way to grow even in the harshest conditions.  I can walk out onto the sidewalk and gather a bouquet of dandelions growing up through the pavement cracks.  That doesn’t prove I’m a good gardener.


by Robert Pondiscio
May 4th, 2009

Former Ed Secy Margaret Spellings is the latest boldface name in the edusphere to say last week’s NAEP numbers show that NCLB is working.  Over at Common Core, Diane Ravitch takes a close look at the numbers and says, er…not so fast.  Her takeaway:

First, our students are making gains, though not among 17-year-olds. Second, the gains they have made since NCLB are smaller than the gains they made in the years preceding NCLB. Third, even when they are significant, the gains are small. Fourth, the Long Term Trend data are not a resounding endorsement of NCLB. If anything, the slowing of the rate of progress suggests that NCLB is not a powerful instrument to improve student performance.

The different takes on the NAEP tells Checker Finn that what we really need is an independent education-achievement audit agency “to sort out the claims and counterclaims about student performance and school achievement.” 

Advocates always do this sort of thing—reaching for whichever data they think make the most convincing case for their accomplishments, exertions and assertions (and, of course, making or implying causations that no reputable scientist would accept). This will continue. And usually the advocates get away with it because anybody who disputes their claims is also seen as having his/her own ax to grind. That’s why America would be so much better off with an independent education-performance audit bureau.

A fine idea, but like a newspaper ombudsman or “public editor,” there will always be some question about how one’s judgement is colored by the interest of whoever is signing the check.  Apropos of which, I keep running into this quote from David Simon, the creator of The Wire. 

 ”You show me anything that depicts institutional progress in America – school test scores, crime stats, arrest stats – anything that a politician can run on [or] anything that somebody can get a promotion on, and as soon as you invent that statistical category 50 people in that institution will be at work trying to figure out a way to make it look as if progress is actually occurring when actually no progress is.”

Sounds cynical, I know.  But hard to argue.

African-American Students Report to the Gym

by Robert Pondiscio
April 30th, 2009

So now it’s come to this.

Students at a Sacramento-area high school attended standardized test pep rallies — er, sorry…Heritage Assemblies – organized by race to pump up each ethnic group to take state tests.  “Students could go to any rally they wanted,” the Sacramento Bee reports, ”but the gatherings were designated for specific races – African Americans in the gym, Pacific Islanders in the theater, Latinos in the multipurpose room.”

The paper describes a scene in the gym at Laguna Creek High School, where students gathered before a large outline of Africa on the wall. “Last year we scored the highest percentage increase of any group,” Vice Principal Hasan Abdulmalik hollered at the crowd.


Laguna Creek High School Principal Doug Craig said dividing the students by race allowed staff to talk about test scores without making any one ethnic group feel singled out in a negative manner. “Is it racist? I don’t believe it is,” Craig tells the paper, which reports the practice of holding race-specific test prep rallies has become more common in California.  

Gathering and reporting data based on ethnic groups is one of the few unambiguous wins of the NCLB era.  It’s pushed the achievement gap to the front of our education agenda.  But I’m not sure holding “heritage rallies” even rises to the level of well-intentioned but wrong-headed.  At best, it’s yet another example of how schools are putting their problems–and their desperation– on the backs of kids. And a particularly disturbing example at that.

Update:  I was remiss in not tipping my hat to Anthony Rebora, who brought this item to my attention via his forum at Teacher Magazine.

Location, Location, Location

by Robert Pondiscio
February 19th, 2009

The real estate agent’s mantra — location, location, location — also works for schools.  Just as an identical home can fetch different prices in different places, an identical school can make AYP in some states, but not in others. 

That’s the upshot of a terrific new report by the Fordham Foundation, The Accountability Illusion, which looked at 36 actual schools (18 elementary, 18 middle schools) and determined whether each one would make AYP under the accountability rules of 28 different states.  No, they would not. 

In Massachusetts – a state that ensures students have to score high in order to be considered proficient and one with relatively challenging annual targets and AYP rules – only one of 18 elementary schools was projected to make AYP. In Wisconsin, with lower proficiency standards and more lenient annual targets and rules, 17 schools were projected to do so. Same kids, same schools – different states, different rules.

“In short,” the report concludes, ”how a school is labeled under NCLB depends largely on the state in which it’s located. This can demoralize educators in states with tough AYP rules while letting under-performing schools in lenient states slip under the accountability radar screen. It also creates the illusion of a national accountability system where there isn’t one.”

Here’s the executive summary of Fordham’s report, and here’s a video interview with Checker Finn about it.  And if you are one of those who prefers to laugh rather than weep in the face of outrage, Mathew Ladner of Jay Greene’s blog turns this whole miasma into a parody of the Budweiser “Real Men of Genius” ad campaign.  “Here’s to you, Mr. Wisconsin No Child Left Behind compliance guy!” Hilarious.

Can we now officially say that accountability as currently conceived and practiced is a joke?  A bad school in Massachusetts is a good school in Arizona. Failure in Nevada is magically redefined as success when it moves to Wisconsin.  Our crazy quilt of accountability systems only breeds cynicism about the whole enterprise (why improve schools when you can lower the bar?) and makes it baby simple to evade responsibility and all but impossible to reach informed conclusions about your child’s school. 

One standard, one yardstick, or else don’t bother.  Instead of location, location, location, let’s try transparency, transparency, transparency.

Heresy Watch

by Robert Pondiscio
February 18th, 2009

Things We Dare Not Say Dept.:  A survey of principals across Minnesota shows 97% think it is not possible for the state’s schools to meet the goals of universal proficiency set out under No Child Left Behind. The survey was released Tuesday by the St. Paul-based think tank Minnesota 2020 and the state’s principal associations.

According to the survey, 97 percent of responding principals say that the law’s main goal, to have every student proficient on math and reading tests by 2014, is unattainable. More than 70 percent of the principals say their schools spend more time and resources on test preparation in the law’s wake, and 40 percent say they have taken away class time from arts and other subjects.

Remember the recent comments from Palo Alto schools Superintendent Kevin Skelly who said educators are “deluding themselves” if they think the achievement gap can be completely closed?  The scales have fallen from his eyes. “During the past week I have thought about my comments and had a chance to discuss them with staff and parents,” Skelly said last week. “Their comments have caused me to change my thinking on this.”

When Patty Fisher of the San Jose Mercury News asked him what exactly he had changed his thinking about, Skelly took a pass.  “I want to move beyond my comments in the newspaper,” he said. ”There was a sense that I was giving up on kids and saying kids couldn’t achieve, and I could see why they took it that way.”  So does he really believe that any child — let alone every child — has “limitless” potential, Fisher wanted to know.

“The less I say at this point, the better,” says Skelly.

Duncan Bangs the Drum for National Standards

by Robert Pondiscio
February 12th, 2009

“If we accomplish one thing in the coming years,” Education Secretary Arne Duncan said this week, “it should be to eliminate the extreme variation in standards across America.”  Speaking at the American Council on Education’s annual meeting , Duncan said.

I know that talking about standards can make people nervous—but the notion that we have fifty different goalposts is absolutely ridiculous. A high school diploma needs to mean something—no matter where it’s from. We need standards that are college-ready and career-ready, and benchmarked against challenging international standards. We also need to break the culture of blame in which colleges blame high schools and high schools blame grade schools and grade schools blame parents for our failures.

Duncan was specifically speaking of ”high school standards” in his remarks, but EdWeek’s David Hof notes his comments suggest ”he’ll be pushing the issue in any reauthorization [of NCLB] that happens under his watch.”  Duncan also talked up national standards in an interview with EdWeek’s Alyson Klein last week.

“The Last Laugh Belongs to Bush”

by Robert Pondiscio
January 15th, 2009

School accountability driven by disaggregated data is “not just George W. Bush’s education legacy; it’s the jewel of any domestic achievement,” writes Richard Whitmire on Politico.  The president of the National Education Writers Association says finding shortcomings in the law is not difficult, but he dismisses the idea that the new administration will eviscerate No Child Left Behind.

The notion that Obama would gut a law exposing the maleducation of millions of black children is a fantasy. That’s why Democrats won’t break NCLB. They’ll start by changing the name of the law, ridding its association with the much-despised Bush. But the last laugh belongs to Bush, because his Texas-style accountability will survive. And that’s what makes No Child Left Behind, regardless of any name change, Bush’s lasting legacy.