The MET Research Paper: Achievement of What?

by Guest Blogger
December 19th, 2010

by Diana Senechal

A new study by the Measures of Effective Teaching (MET) Project, funded by the Bill and Melinda Gates Foundation, finds that students’ perceptions of their teachers correlate with the teachers’ value-added scores; in other words, “students seem to know effective teaching when they experience it.” The correlation is stronger for mathematics than for ELA; this is one of many discrepancies between math and ELA in the study. According to the authors, “outside the early elementary grades when students are first learning to read, teachers may have limited impacts on general reading comprehension.” This peculiar observation should raise questions about curriculum, but curriculum does not come up in the report.

When the researchers combined student feedback and math value-added (from state tests) into a single score, they found that “the difference between bottom and top quartile was .21 student standard deviations, roughly equivalent to 7.49 months of schooling in a 9-month school year.” For ELA, the difference between top and bottom quartle teachers was much smaller, at .078 student-level standard deviations.

What are students learning in ELA? Beginning in fourth grade, students appear to gain just as much in reading comprehension from April to October as from October to April—that is, the summer months away from school do not seem to affect their gains. According to the researchers, “the above pattern implies that schooling itself may have little impact on standard read­ing comprehension assessments after 3rd grade.” They posit, somewhat innocently, that “literacy includes more than reading comprehension … It involves writing as well.” The lack of teacher effects applied mainly to the state tests;  when the researchers administered the written Stanford 9 Open-Ended Assessment for ELA, the teacher effects were larger than for math.

What explains the relatively low teacher effects on the ELA state tests? The researchers offer two possibilities: (a) teacher effects on reading comprehension are small after the early elementary years and (b) the tests themselves may fail to capture the teachers’ impact on literacy. Both of these hypotheses seem plausible but tangential to the central problem: this amorphous concept of “literacy.” Why should schools focus on “literacy” in the first place? Why not literature and other subjects?

A curious detail may offer a clue to the problem: the correlation between value-added on state tests and the Stanford 9 in ELA is low (0.37). That is, teachers whose students see gains on the ELA state tests are not very likely to see gains on the Stanford 9 as well.  That is, teachers whose students see gains on the ELA state tests are unlikely to see gains on the Stanford 9 as well. (The researchers do not state whether the reverse is true.) The researchers thought some of this might be due to the “change in tests in NYC this year.” When they removed NYC from the equation, the correlation was significantly higher. (But the New York math tests changed this year as well, and this apparently did not affect things—the correlation for math between the state and BAM value-added is “moderately large” at 0.54.)

Is it not possible that NYC suffers from a weak or nonexistent ELA curriculum, more so than the other districts in the study? Certainly curriculum should be considered, if an entire district shows markedly different results from the others.

In math, there usually is a curriculum. It may be strong or weak, focused or scattered, but there is actual material that students are expected to learn. In ELA, this may or may not be the case. In schools and districts with a rigorous English curriculum (as opposed to a literacy program), students read and discuss challenging literary works, study grammar and etymology, write expository essays, and  more. In the majority of New York City public schools, by contrast, this kind of concrete learning is eschewed; lessons tend to focus on a reading strategy, and students practice the strategy on their separate books. New York City has taken the strategy approach since 2003 (and in some cases much earlier); Balanced Literacy, or a version of it, is the mandated ELA program in most NYC elementary and middle schools. The MET researchers do not consider curriculum at all; they seem to assume that a curriculum exists in each of the schools and that it is consistent within a district.

In short, when analyzing teacher effects on achievement gains, the researchers forgot to ask: achievement of what? This is not a trivial question; the answers could shed light on the value-added results and their implications. It may turn out that the curricular differences are too slight or vague to make a difference, or that they do not significantly affect performance on these particular tests. Or the investigation of such differences may turn the whole study upside down. In any case, it is a mistake to ignore the question.

Diana Senechal taught for four years in the New York City public schools and holds a Ph.D. in Slavic languages and literatures from Yale. Her book, Republic of Noise: The Loss of Solitude in Schools and Culture, will be published by Rowman & Littlefield Education in late 2011.

 
 

 

 

Ed Reform as the Compliance Police

by Robert Pondiscio
November 8th, 2010

Has the battle cry of ed reform evolved from “Just win, baby!” to “Just comply, baby?”

Time was when ed reform had a single focus:  accountability for results, observes Fordham’s Mike Petrilli.   But now, frustrated with the glacial pace of improvement and results, the impulse is to push for “change anywhere, anytime, anyhow—even if that means engaging in the same sort of regulating and rule-making and program-creating and money-spending  that we once abhorred.”

The most obvious example Petrilli cites is Race to the Top which, rather than reward results, “lavished money on those jurisdictions willing to pledge themselves to a set of prescriptive reforms.”   Then too, there are reformers pushing teacher quality who ”rightly point out that today’s evaluation systems are a total joke,” Petrilli writes. 

“But here’s their mistake: they are doing this pushing primarily at the state level, even though states don’t employ teachers—districts do. Of course, the reformers understand this, and thus have started to worry about how to “implement” statewide teacher evaluation systems. How do you make sure that districts, and principals, actually use the new evaluation instruments that the state develops? That they truly differentiate among teachers, and take action accordingly? There’s only one way to be sure: we’d better have a strategy to enforce compliance.”

The choice reformers face is between results-based reforms like charter schools or process-based reforms, like improved teacher evaluations,” Mike argues. 

“A smart person once said that the true test of one’s character isn’t how one handles adversity, but how one handles power. The school reform movement performed magnificently when facing adversity. But now that it has power, is it going to stick to its focus on results, or is it going to become the compliance police instead? Hold on to power (for benign purposes, of course!) or give it away?”

Tight on ends, loose on means do it my way.

Pitchers, Teachers, and Data

by Robert Pondiscio
June 28th, 2010

Over on Twitter, my friend Stephanie Germeraad, who is nearly as passionate about sports as she is about education, suggests education ought to steal a page from baseball when it comes to teacher seniority.  Commenting on the decline of legendary closer Trevor Hoffman, she tweets a quote from Alan J. Borsuk: “Schools can learn from baseball.  Brewers wouldn’t start Hoffman just because he’s been pitching longer.”  The point is that seniority is no guarantee of quality. Fair enough.  But here’s a sobering truth:  We are far more capable of measuring the effectiveness of relief pitchers like Hoffman than classroom teachers. 

If you’re a casual baseball fan, you might know a few ”facts” about the pitchers on your favorite team:  their won-loss record, their ERA  (the number of “earned runs” allowed per nine innings), or their WHIP (walks and hits per innings pitched).  To an expert, such statistics scratch the surface at best, and may even be irrelevant.  Wins are a function of a team’s offense, for example, as much as a pitcher’s effectiveness, while ERA and WHIP are strongly influenced by the defensive ability of the other eight men on the field.  An outfielder with greater range for example, will record an out on a ball that a lesser defender lets fall for a hit.  Same pitch, same swing, different outcome.

Among baseball geeks, you often hear discussions of fielding independent pitching, or ”FIP,” a measure of the things a pitcher is directly responsible for such a strikeouts, home runs and walks.  FIP helps you understand how well a pitcher pitched, regardless of how well the team played behind him.  Data even helps teams decide what kind of pitchers are best suited to their stadiums through analysis of   “park effects.”  A fly ball pitcher (yes, they keep track of fly balls, line drives and ground balls hit off every pitcher) might prosper in a big stadium like New York’s Citi Field, but allow lots of home runs in a bandbox like Philadelphia’s Citizens Bank Park.  A pitcher who “pitches to contact” (i.e., doesn’t strike out a lot of hitters) is fine if your team’s defense is strong.  If not, you might spend more to sign pitchers who are strikeout artists.  Data even helps spot problems as they occur.  Fans of the New York Mets are concerned that all-star pitcher Johan Santana’s fastball is topping out below 90 miles an hour of late, making his changeup, a slow-speed pitch, less likely to fool hitters expecting the fastball.

To a baseball fan statistics are a revelation.  The granularity and specificity are illuminating.  You can see, if you’re so inclined, a pitcher’s FIP, ERA, strikeouts, and his strikeout-to-walk ratio.  The percentage of batted balls that were hit on the ground, in the air, or for line drives can speak volumes about a pitcher’s effectiveness.  When a player’s agent goes to negotiate his contract, he can even discuss his “Wins Above Replacement” (WAR),  a statistic that measures the total value of a player over a given season compared to an average replacement player. 

If these kinds of numbers thrill you, adding depth and nuance to your love of baseball, thank Bill James.  It is no overstatement to say that no one has had a greater impact on baseball in the last 25 years than James, who pioneered and named the field of sabermetrics, the use of detailed statistics to analyze baseball team and player performance.   James has made a career of demonstrating the factors that lead to teams scoring runs and winning games, and how the efforts of individual players contribute to wins.  Some of his insights have been legendary and have overthrown time-honored beliefs about the game–why RBIs matter less than on-base percentage, for example. Or why stolen base attempts tend to hurt a team’s offense.  Before Bill James, baseball was all batting averages, bromides and intangibles–a century of baseball men who knew what they knew based on experience, instinct and rudimentary data.

We are in the test scores, bromides and intangibles era of measuring teacher quality.  If you’re a prinicipal, wouldn’t you love to know the “school effects” of teacher performance when it came time to make hiring decisions?  Would it change your perception of merit pay if there was a classroom equivalent of FIP–the factors directly under a teacher’s control?  What if we could compensate teachers based on their replacement value compared to an average first year teacher? 

“It’s far more than win/loss/ERA/WHIP” is the clubhouse mantra,” Stephanie tweeted, defending her assertion that education can profit from baseball’s example. ”Difference is, baseball doesn’t say they therefore can’t do it,” she wrote. Not quite right.  In baseball there is data–lots of it–to measure effectiveness clearly and fairly.  Difference is ”it’s far more than test scores” is not a mantra in ed reform. 

Education awaits its Bill James.

Teachers Union Disbands; Reformers Skeptical

by Robert Pondiscio
January 13th, 2010

No, Randi Weingarten did not announce the dissolution of the AFT in her big speech yesterday.  She talked about  her willingness to be more flexible on issues of how teachers are evaluated, promoted, and drummed out of the profession, including the use of test scores.   But one gets the sense that no matter what she had to say, the reaction from the ed reform commentariat would be variations on “the devil is in the details.”

Is Weingarten’s stance a big deal?  You decide.  Coverage here from the Washington PostNew York Times and Edweek.   Reactions from Eduwonk, Joanne Jacobs, EdWeek’s Stephen Sawchuk and The New Republic’s Seyward Darby

For now, I’ll merely channel inveterate skeptic Alexander Russo of This Week in Education on this one and merely ask: has anyone read or heard any takes on the speech that surprised them?  Or is everyone just speaking their talking points?

“An Unavoidable Element of Subjectivity”

by Robert Pondiscio
September 10th, 2009

Schools need much more than merit pay to recruit and retain good teachers, argues Kevin Carey at the Quick and the Ed.  “They need strong leadership, good facilities, safe working conditions, and the right kind of organizational culture,” he writes. “You can’t paper over the lack of those things by simply tacking on a salary bonus, even a big one, to the existing steps-and-lanes pay scale.”

Carey’s reasoned (and reasonable) take on merit pay feels like a welcome departure from the teacher-quality-and-test-scores über alles refrain more commonly sung by accountability hawks.  Especially in his recognition that “we need to build schools great people want to teach in, and that means fully recognizing their value in all ways, including pay.”

The great schools of the future will be professional meritocracies in a way today’s public schools are not, but not by adding test scores to the mechanistic logic of an industrial-age salary scale. Rather, they’ll spend a great deal of energy on getting the conditions and culture right, and then negotiate substantially higher and substantially more variable salaries with individual teachers. It will be an expensive, time-consuming, imperfect process with an unavoidable element of subjectivity. It will also be much, much better than what most schools use today.

Agreed.  I’d also wager there isn’t one teacher in a thousand who wouldn’t welcome merit pay in a school that spent “a great deal of energy on getting the conditions and culture right.” 

The phrase “unavoidable element of subjectivity” also strikes me as a recognition of the infinite complexity teachers face in working with our most disadvantaged students (any attempt to move past mindless “teachers fear accountability” sloganeering is a welcome development).  Guest-blogging over at Joanne Jacobs, the always insightful Diana Senechal captures the dilemma of nuance-averse accountability well.  “With dumbed-down tests, vapid literacy programs, an overwhelming focus on test prep at the exclusion of essential subjects, and unreliable rating systems, we end up taking a yardstick to a void–and declaring miracles whenever we please,” she wrote.  The flip side of that — the thing that teachers reasonably fear — is that it is too easy to declare failure whenver we please, and hold teachers solely responsible when they are too often reduced to foot soldiers with no control over what or even how they teach. 

This cannot be said often enough: teachers are not by nature accountability-averse.  They are, however, sensibly averse to having an extraordinarily difficult and complex task measured by crude and simplistic tools.

Update:  John Thompson, a vocal teacher advocate who also viewed Carey’s post favorably, takes up a similar theme at This Week in Education.  “I’ve never understood why ‘reformers,’ who are angered by the terrible results of policies set by principals and central offices, respond by attacking teachers who do not set those policies. But the answer, which the New Teacher Center makes clear, is not to attack principals but to use ‘contextual data’ to enhance teacher and principal quality and create a learning culture which attracts and retains educators.”

Diane Ravitch on Teacher Evaluation and Value-Added

by Guest Blogger
November 18th, 2008

by Diane Ravitch

In his post, “Getting Value-Added Right,” Robert raises excellent questions, and his restaurant metaphor is apt. The value-added growth model, as Dan Willingham notes in the comments section and his post on the Britannica Blog, is not ready for prime time. There are too many intervening variables to hold teachers solely accountable for the test-score growth of every student. Given high rates of mobility, there is a large fluctuation in the student population in schools. As Thomas J. Kane and Douglas O. Staiger point out in one of their papers, their inherent volatility make test scores a poor basis for an accountability system.

The imprecision of test score measures arises from two sources. The first is sampling variation, which is a particularly striking problem in elementary schools. With the average elementary school containing only sixty-eight students per grade level, the amount of variation stemming from the idiosyncrasies of the particular sample of students being tested is often large relative to the total amount of variation observed between schools. The second arises from one-time factors that are not sensitive to the size of the sample; for example, a dog barking in the playground on the day of the test, a severe flu season, a disruptive student in a class, or favorable chemistry between a group of students and their teacher. Both small samples and other one-time factors can add considerable volatility to test score measures.

There are many, many reasons why one-year changes in scores are not reliable. There are many reasons why it is hard to give credit or blame for students’ test score gains and losses from year to year. Until we have better tests and have ironed out many of the confounding variables, it is unfair to make credible inferences about teacher performance from test scores, let alone use such data to dispense rewards and punishments.

There is another reason to worry about value-added growth models that determine a teacher’s fate and compensation. If we turn teaching into an activity whose sole purpose is to produce gains on tests that we know are mainly low-level and dumbed-down, we will not make education better. We may succeed in destroying it altogether. We better find ways to emphasize the quality of curriculum (think Core Knowledge) and to de-emphasize the number of times that kids are asked to check off a box on standardized tests in the course of a month. Or our education system will be far worse than ever.

Diane blogs on education at Bridging Differences — ed.

How Not to Evaluate Teachers

by Robert Pondiscio
November 3rd, 2008

UVA professor and Core Knowledge board member Dan Willingham, who routinely graces this blog with his observations, is now blogging over at Britannica Blog.  His first post is up today, and it’s a barn burner: How NOT to Evaluate Teachers.  Plans to evaluate teachers based on standardized test scores are “fatally flawed,” he writes.

Obviously, the measure cannot be based on a one-time test score, because a student’s achievement is a product of (at least) his home environment, neighborhood, and prior schooling. So you must try to assess how much the student learns over the course of the year. But these “value added” measures bring lots of thorny statistical problems. For example, suppose your plan is to administer a test in the Autumn and one in the Spring, and to compare them to see how much students have gained. Well, some Autumn test-takers will have moved by the Spring.  Can’t you just ignore those scores? No, because low-income students are more likely to move than high-income students, and low-income students tend to score lower. So if you ignore missing data, you’re biasing the estimate.

Dan lists other problems that he says are old stuff to statisticians, and concludes ”there’s nothing wrong with using value-added measures in research, with all the caveats of the method understood, as one in an array of tools to address a research question. But using it as a measure of an individual teacher’s efficacy is foolish.”

Hiring and Firing

by Robert Pondiscio
September 30th, 2008

Jay Mathews, the dean of education reporters, takes a strong stand on teacher retention, arguing that giving principals the unfettered power to hire and fire teachers is “crucial” to closing the achievement gap.

This is a difficult choice and a hard time for D.C. teachers. They are fine people who have chosen a tough profession and put their hearts into their work. Many fear being judged by principals who were not skillful teachers themselves and have little clue as to what helps kids learn and what doesn’t. But I don’t see any way the city’s children are going to get the instruction they deserve — the imaginative, fun-loving, firm teaching found at schools like KEY — unless principals are given the power to hire and fire teachers based on demonstrated skill and improved learning in class.

Mathews cites the example of the KIPP DC:KEY Academy, where principal Sarah Hayes dismissed two teachers who were not cutting it, despite efforts to improve.  “If KEY were a traditional school, Hayes’s only reasonable option would have been to mentor the teachers, note her dissatisfaction on their evaluations and recommend that they not be kept after a two-year probation,” he writes.  “That is the way it goes in most school systems. Staffing rules, tenure agreements and low expectations tend to favor weak teachers unless they do something awful.”