The trouble with value-added-modeling
March 6, 2012 6:52 AM   Subscribe

 
Jesus, that's horrible. There actually looks like there is marginally more correlation in the "language arts--math" graph than in the "same subject--different grade" graph.
posted by OmieWise at 6:59 AM on March 6, 2012 [1 favorite]


Linda Darling-Hammond wrote a great op-Ed in Education Week.
posted by smirkette at 7:03 AM on March 6, 2012 [2 favorites]


Holy fuck. That is pretty damning evidence.
posted by ook at 7:03 AM on March 6, 2012


I don't usually take subjective metrics too seriously, but that's amazing. Those poor teachers would be almost better off rolling dice to get their ratings.
posted by bonehead at 7:04 AM on March 6, 2012 [1 favorite]


It must suck to be a teacher in NY.
posted by dobie at 7:05 AM on March 6, 2012 [3 favorites]


As my stats prof (who's a quant education researcher) says, these tests are much better at measuring a kid's socioeconomic status and parental education than their reading and math skills.
posted by smirkette at 7:10 AM on March 6, 2012 [17 favorites]


Depressing, but not really shocking, considering how teachers have become the go-to punching bag for politicians of all stripes over the last few years. People like Bill Gates and Michelle Rhea are part of the problem; loudmouth dilettantes with zero actual knowledge of what they claim to care so much about.
posted by Edgewise at 7:16 AM on March 6, 2012 [4 favorites]


People with non-teaching jobs are evaluated with imperfect, spurious metrics, too.
posted by downing street memo at 7:16 AM on March 6, 2012 [5 favorites]


and apparently, NYC's "worst" teacher is a very talented teacher who takes on challenging students who are still learning English. Yes, lets punish her.

/hamburger
posted by jb at 7:17 AM on March 6, 2012 [8 favorites]


By the way, it should be said that value-added modeling is a tool and is not a bad thing if used correctly. My understanding is that there are a very few teachers who consistently, year after year, show up in the bottom or top range of the ranking. When you start averaging over many years, or, better yet, averaging over many teachers in the same school, you really start to damp the noise. That's how the numbers were intended to be used. But once the numbers are in the hands of policymakers, the modelers lose control of their use, of course.
posted by escabeche at 7:18 AM on March 6, 2012 [7 favorites]


So, if every statistical/quantitative effort so far designed to assign numeric 'goodness' scores to teachers is an abject failure, can we agree that it's really really difficult and extremely subjective process to distinguish 'good' teachers from 'bad' teacher, at a macro level? If so, does that mean we can finally drop the canard about teachers' unions being terrible because they prevent us from "firing the bad teachers?" Because it looks like the NY teachers' union is the only thing standing between the NY Post et al. trying to railroad a bunch of people out of their jobs because of a horrible statistical model.
posted by Mayor West at 7:20 AM on March 6, 2012 [14 favorites]


Holy fuck. That is pretty damning evidence.

Holy fuck. That is a pretty damning fallacy thoughout the article's argument: "if you don't agree with our premise, then you don't really 'know' education"!
posted by The 10th Regiment of Foot at 7:33 AM on March 6, 2012 [2 favorites]


You want to improve education? Give examinations to law-makers. Shame the ones who fail. Requiring legislators to understand such basic things as statistics, history and principles of education, they will be less likely to impose stupid-assed laws and constraints on teachers.
posted by dances_with_sneetches at 7:35 AM on March 6, 2012 [16 favorites]


Escabeche, that is a LOT of noise there. How many years are needed to get useful information from that sort of analysis?

It also seems trivially easy to game that sort of ranking. Administrators can put these kids in with the teacher they favor, and put those kids in with the one they want an excuse to fire. Three years later, the first teacher gets a $10,000 bonus and the other one is kicked to the curb, entirely unemployable as a teacher, through no fault of their own. There's no way to control against that sort of manipulation of the ranking system, and it will happen because people are people.

The third post on the same subject by the same person graphs children's performance in 5th grade and their past performance in 4th grade. Students' performance is clustered tightly and predictably together, highly dependent each year on their performance the year before and not at all dependent on their schools or teachers.

My personal belief, backed up by nothing much, is that home and other external environments shape the students and determine their test scores and ability to advance far mroe than teachers.
posted by jsturgill at 7:38 AM on March 6, 2012 [7 favorites]


The people who make decisions about which teachers to fire should also be judged on these scores. They should be retained or fired based on the improvement shown in the following year among the teachers they decided to retain. (Somehow, I doubt the school administrators will agree.)

On second thought, maybe the administrators should all just be fired.
posted by dsword at 7:53 AM on March 6, 2012 [9 favorites]


can we agree that it's really really difficult and extremely subjective process to distinguish 'good' teachers from 'bad' teacher, at a macro level?

Yes. Macro is exactly the problem. Having been in school for 18 years of my 42 on the planet, I'm pretty certain I could tell a good teacher from a bad after about 2 hours in the classroom observing. Of course, that micro, time intensive method doesn't work for politics and advancing a bureaucratic agenda though and it also doesn't work well for the vast middle - the 'average' teacher.
posted by spicynuts at 7:55 AM on March 6, 2012


Macro is exactly the problem. Having been in school for 18 years of my 42 on the planet, I'm pretty certain I could tell a good teacher from a bad after about 2 hours in the classroom observing.

Eh, research shows that students ratings aren't much good either. Students have a really crappy idea of what makes a good teacher, since what "seems" good to them ("easy" learning) may not be the best way to learn.

Don't pretend like you have a good grasp of this difficult problem just because you've sat in a lot of classrooms.
posted by Philosopher Dirtbike at 8:04 AM on March 6, 2012 [6 favorites]


Where's everybody from yesterday's thread on NYC schools to tell us how this is all the kids' fault for "not taking ownership of their education"?
posted by Jon_Evil at 8:06 AM on March 6, 2012 [1 favorite]


Holy fuck. That is a pretty damning fallacy thoughout the article's argument: "if you don't agree with our premise, then you don't really 'know' education"!
For people who know education, this is shocking, but there are people who probably are not convinced by my explanation that these should be more correlated if the formulas truly measured learning. Some might think that this really just means that just like there are people who are better at math than language arts and vice versa, there are teachers who are better at teaching math than language arts and vice versa.

So I ran a different experiment for those who still aren’t convinced. There is another scenario where a teacher got multiple ratings in the same year. This is when a middle school math or language arts teacher teaches multiple grades in the same year. So, for example, there is a teacher at M.S. 35 who taught 6th grade and 7th grade math. As these scores are supposed to measure how well you advanced the kids that were in your class, regardless of their starting point, one would certainly expect a teacher to get approximately the same score on how well they taught 6th grade math and 7th grade math. Maybe you could argue that some teachers are much better at teaching language arts than math, but it would take a lot to try to convince someone that some teachers are much better at teaching 6th grade math than 7th grade math. But when I went to the data report for M.S. 35 I found that while this teacher scored 97 out of 100 for 6th grade math, she only scored a 6 out of 100 for 7th grade math.

Again, I investigated to see if this was just a bizarre outlier. It wasn’t. In fact, the spreads were even worse for teachers teaching one subject to multiple grades than they were for teaching different subjects to the same grade.
posted by ennui.bz at 8:07 AM on March 6, 2012 [1 favorite]


Spicynuts, I imagine that a good portion of your 18 years in school were not spent making an honest attempt to ascertain the quality of your teachers. Because you were a small child for a good chunk of it. And then you were a teenager.

Also, given that that all these administrators and lawmakers spent a greater or equal time in school should lend credence to the notion that simply going to school does not make you a qualified judge of a teacher.
posted by Jon_Evil at 8:14 AM on March 6, 2012 [1 favorite]


To expand a little on my drive-by comment: it sounds as though, in education, "learning" is as difficult to measure as "return on investment" is in business. In other words, you can measure how smart a kid is, and maybe how smart she is at the end of the year versus how smart she was at the beginning. But it's very difficult to assign causation for that result.

In business, good managers deal with this by embracing ambiguity, picking goals that they believe will lead to profitability/ROI, and designing feedback systems that re-inform the goals on a regular basis, based on real-world results. If the goals are unfeasible and employees are overworked, you tone them down. If the goals are being achieved but profits don't follow, you shift the goals. And if folks aren't achieving their goals, you re-train, re-motivate, or fire them.

Do people unfairly lose their jobs in this process? Yes. Absolutely. Most people posting in this thread have jobs, and I guarantee that you personally answer to your boss for a goal or metric you don't have full control over. But the idea is that good businesses gradually figure out how to set goals and how to set employees on the right path to achieve them. No evaluation system is perfect, but the idea is to make it good enough that the principal goal of the business - profitability - is achieved, and then continually make it better.

I do not understand why teachers should be exempt from this process, which, yes, doesn't preclude the possibility of a talented person being fired. Look around you; talented people are fired all the time.
posted by downing street memo at 8:20 AM on March 6, 2012 [2 favorites]


@Philosopher Dirtbike
Eh, research shows that students ratings aren't much good either. Students have a really crappy idea of what makes a good teacher, since what "seems" good to them ("easy" learning) may not be the best way to learn.

Here's another link for you.
Carrell and West (2010) show that student evaluations negatively correlate with student success in the following class. I think it might have been on the blue at some point.
posted by yeolcoatl at 8:24 AM on March 6, 2012


I do not understand why teachers should be exempt from this process

Did someone claim that teachers shouldn't be evaluated? No, this is a discussion about a particular metric used to evaluate them, and why it is a nonsensically bad choice and does not even remotely measure what it claims to measure.
posted by RogerB at 8:25 AM on March 6, 2012 [9 favorites]


We can't even begin to discuss "improving" the quality of education until we've tripled the number of teachers we have and then doubled their salaries. That will bring us up to a level where we can even begin to discern the teacher effects.
posted by DU at 8:29 AM on March 6, 2012 [2 favorites]


Did someone claim that teachers shouldn't be evaluated? No, this is a discussion about a particular metric used to evaluate them, and why it is a nonsensically bad choice and does not even remotely measure what it claims to measure.

My point is that every metric is flawed, and you have to start somewhere.
posted by downing street memo at 8:31 AM on March 6, 2012


jsturgill: the point of VAM is that it tracks kids over time. To really do the gaming you describe, the administrator has to predict which kids will get worse. That might be possible, but anything that is predictable from what the analyst knows goes in the model and doesn't affect the teacher-effect. That kids do largely the same relative to one another year-to-year is what they expect.

These things come with huge error bars, but those bars are knowable. It's true that the single-year delta in a classroom is dominated by external factors; my understanding is that you need more student-years to get a reliable estimate (discounting the first few as teachers ramp up) than most districts are willing to wait. The things are in theory workable but in practice too noisy to be very useful other than as a warning bell to trigger observation.
posted by a robot made out of meat at 8:32 AM on March 6, 2012 [1 favorite]


I do not understand why teachers should be exempt from this process...

If you have a world-view, and every conceivable situation seems to fit into that world-view, maybe it's time to reconsider that world-view. Just because teachers are employees and get paid doesn't mean that there is any relationship between what a school does and what a business does:

A PUBLIC SCHOOL IS NOT A FOR-PROFIT BUSINESS.

Notice what's missing: profit!

My point is that every metric is flawed, and you have to start somewhere.

You've put yourself in a very small box here or rather you are putting the world into a very small box. Enormous, complicated, and very successful businesses were founded before there was the idea or even the means to collect data from even semi-meaningful business metrics.
posted by ennui.bz at 8:33 AM on March 6, 2012 [10 favorites]


The "business" part is irrelevant, ennui.bz; in both cases, the problem is evaluating the performance of individuals in complex environments.
posted by downing street memo at 8:34 AM on March 6, 2012


Oh, and FWIW the testing approach is mostly a response to how unaccountable, tool-less, and irrational administrators can be. They can't hire and fire the way the business world does because of restrictive contracts. When they do get that power they go nuts some substantial fraction of the time (or such is the story teachers tell me). I think it would be a worthwhile experiment to give administrators more power and applied the VAM metrics to keep them honest, coupled with some direct cross-review and other measures.
posted by a robot made out of meat at 8:36 AM on March 6, 2012


My point is that every metric is flawed, and you have to start somewhere.

That's a bit broad. Every study is flawed, yes. Not all of them are all over the map like this is. There are degrees of suck involved. Is your argument here "well, yeah, that looks pretty terrible, but then nothing's perfect, so we're good"?
posted by middleclasstool at 8:37 AM on March 6, 2012 [4 favorites]


In business ... If the goals are unfeasible and employees are overworked, you tone them down.

Honestly, I simply do not believe you about this. I don't mean you're lying or anything, just mistaken. Can you list five instances where this happened in the US in Fortune 500 companies? Oh, and where it happened at management's behest and not because a union raised a stink, or because they ran afoul of some regulation or law?

If the goals are being achieved but profits don't follow, you shift the goals.

Again, I just don't believe that this actually happens. Can you list five instances where this happened in the US in Fortune 500 companies that were not simply instances of a new executive throwing out the old executive's metrics?
posted by ROU_Xenophobe at 8:38 AM on March 6, 2012 [1 favorite]


Could anyone point me to the full data set? I've been trying to find it, but without any luck.
posted by evidenceofabsence at 8:39 AM on March 6, 2012


the point of VAM is that it tracks kids over time. To really do the gaming you describe, the administrator has to predict which kids will get worse

Not get worse, just improve the least. And that's easy to game if you have the test scores, which administrators presumably do. Want to fuck over a grade 8 teacher? Send him or her students who haven't been improving well in grades 4-7.
posted by ROU_Xenophobe at 8:42 AM on March 6, 2012


...the problem is evaluating the performance of individuals in complex environments.

Well, I thought your point was that managers ultimately make decisions based upon the profitability of the group. The whole problem for your business manager types is there isn't anything approaching the hardness of profitability for schools. So, people keep trying to think one up so that US schools can be run as well as GM.

Just because profitability starts to look ambiguous when you consider ROI doesn't change the fact that education just doesn't have basic accounting profits to start with.
posted by ennui.bz at 8:43 AM on March 6, 2012 [1 favorite]


There might be a combination of tests that can show teaching ability. This one is not it. It doesn't even have face validity. Only 0.35 correlation coefficient between years? You can't use that for personnel decisions.
posted by demiurge at 8:45 AM on March 6, 2012 [1 favorite]


This has been addressed a bit already, but does anyone here have a decent background in the methodology used/discussed in the post?

Two questions that come to mind immediately are whether single-year data are relevant to these VAM scores (gut reaction suggests that they sound like longitudinal metrics), and whether his assertion that a correlation coefficient of .24 is small (shockingly so, even).

In some contexts, a .24 correlation is taken as concrete proof. And not knowing anything about these metrics/this field, I'm trying to figure that out in context.
posted by graphnerd at 8:47 AM on March 6, 2012 [3 favorites]


My point is that every metric is flawed, and you have to start somewhere.
posted by erniepan at 8:56 AM on March 6, 2012


every metric is flawed, and you have to start somewhere

Why are we even bothering with all this messy data, then? Let's just fire people based on a coin flip. If the corporate world (effectively) does it, it must be good enough for education!
posted by RogerB at 8:57 AM on March 6, 2012 [1 favorite]


Not get worse, just improve the least. And that's easy to game if you have the test scores, which administrators presumably do. Want to fuck over a grade 8 teacher? Send him or her students who haven't been improving well in grades 4-7.

Yes, worse relative to peers is (hopefully) improving least. a) my understanding is that they take ranks instead of using the raw number, so kids who are already low getting lower doesn't do much b) if Dr. Evil is using old test scores, I the analyst also know those scores and can plop them into a better growth model c) does not growing before strongly predict continuing to not grow? seems like negatively correlated catch-up is possible.
posted by a robot made out of meat at 8:58 AM on March 6, 2012


Could anyone point me to the full data set? I've been trying to find it, but without any luck.


http://www.ny1.com/content/top_stories/156599/now-available--2007-2010-nyc-teacher-
performance-data#doereports

posted by Glomar response at 9:00 AM on March 6, 2012 [1 favorite]


Not get worse, just improve the least. And that's easy to game if you have the test scores, which administrators presumably do. Want to fuck over a grade 8 teacher? Send him or her students who haven't been improving well in grades 4-7.

That's not how value-added metrics work, or should work if done right. They're not an absolute measurement of student progress. Those low-performing students will have predicted scores that are based on and take into account their previous student history and incoming grade level proficiency, and they will be compared to similarly low-performing students within the same district or state pool.
posted by BurntHombre at 9:01 AM on March 6, 2012 [1 favorite]


My point is that every metric is flawed, and you have to start somewhere.

I don't think the analysis, or my post, is meant to argue that teachers shouldn't be evaluated -- only that naked VAM scores are almost certainly a very poor basis on which to make firing decisions. Nobody involved in developing the scores thinks that's how they should be used, as far as I know. But the worry is that they'll end up being used in exactly that way.

It seems to me that your claim here is something like "a flawed measure is better than nothing." But it's certainly not clear that's true, whether in education or business. If you substantially increase the extent to which random fluctuation drives firing of teachers, you make the job less attractive, which means that you're effectively lowering the salary you pay teachers without saving the taxpayer any money. That's a cost. Now the hope is that a well-designed measure and well-thought-out use of the measure will have the benefit of improving test scores in the school system, or even improving the quality of education on more general metrics. What many people are claiming is that raw VAM is so weakly related with outcomes that it's very unlikely the benefit outweighs the cost -- so that it is better, in fact, not to start at all than to start at that particular spot.

I should say, though, that the people who run school districts are not fools, and as far as I know no school district has actually proposed firing teachers for a bad one-year VAM score. The DC schools, for instance, uses an IMPACT score, which mixes VAM with in-class ratings by principals (and maybe other measures besides, I'm not sure) and I don't think they make any firing decisions based on a single annual number. The problem is that there's undoubtedly going to be political pressure on them to do exactly that, especially once the individual numbers are public, and I don't think there's a way to avoid the numbers being public once they're computed; they really are public records.
posted by escabeche at 9:03 AM on March 6, 2012


Well, I thought your point was that managers ultimately make decisions based upon the profitability of the group. The whole problem for your business manager types is there isn't anything approaching the hardness of profitability for schools. So, people keep trying to think one up so that US schools can be run as well as GM.

Look, I'm not arguing that the analogy is exact. But just like "profit" seems generally to be the fruit of business, "learning" seems to be the fruit of education. We know what a profitable business looks like, and what an educated student looks like, but what we have relatively little visibility into is how business people and teachers achieve their respective fruits, or what impact they have at all.

The whole point is that profitability is the result of a complex system, just like learning, and the vast, vast majority of that complex system is out of the control of managers and teachers. The macroeconomy and SES/home conditions, respectively, determine most of the outcome before the action even starts.

These discussions tend to go the direction of acknowledging the significant non-teaching inputs into student learning and throwing up one's hands. If a teacher doesn't cause learning, however we construe that, we say, maybe he got a bad class, or maybe the administrators were up in her shit all year and made it impossible to teach. Look at this thread - people are positing scenarios in which an administrator hates a teacher so much that she gives him a bad class, on purpose. Maybe this happens, but is it a macro problem?

My point is that everyone else has similar constraints on their ability to actually do anything at work. Businesses embrace minimal heuristics for judging performance and make do with what they have; in education, anyone who even suggests a heuristic is attacked as a corporate stooge trying to privatize schools.
posted by downing street memo at 9:04 AM on March 6, 2012


I read about this in the NYTimes a couple of days ago. They did an initial analysis by examining specific schools. I was absolutely stunned when I read one analysis that asserted at one school they examined, 1/3 of the teachers were below average, 1/3 were average, and 1/3 above average.

[facepalm]
posted by charlie don't surf at 9:04 AM on March 6, 2012 [1 favorite]




"In some contexts, a .24 correlation is taken as concrete proof."

It depends on a back-of-the-envelope computation I'm not doing, but I think these numbers certainly might be concrete proof up to social-science standards that the VAM score is not totally unrelated to some aspect of what the teacher is doing in the classroom. But that's a pretty low bar to clear, if you're asking the measure to make personnel decisions for you.
posted by escabeche at 9:10 AM on March 6, 2012 [1 favorite]


Anybody who's interested in a deeper look at the model used to produce the scores can check out the technical report here.

Honestly I don't think much of these blog posts; the statistical analysis is pretty shallow and in some cases based on wrong assumptions (for example, he makes a major point about percentile rankings not going up between teachers' first and second years, but in year of employment 1-3 percentile rankings are relative to other teachers in your year and therefore you would not expect the average ranking to change between years 1, 2, and 3).

The technical report also has this interesting correlation table showing how stable rankings are across years:
Correlation between 2007-08 and 2008-09 value added
        Math   ELA
Grade 4 0.48   0.24
Grade 5 0.45   0.33
Grade 6 0.62   0.24
Grade 7 0.50   0.20
Grade 8 0.59   0.24
So it looks like correlations are surprisingly high for math but lower for english. At any rate, the low correlations shouldn't surprise one given the very wide confidence intervals on many estimates.
posted by myeviltwin at 9:12 AM on March 6, 2012 [3 favorites]


Why are we even bothering with all this messy data, then? Let's just fire people based on a coin flip. If the corporate world (effectively) does it, it must be good enough for education!

Yes, what I am suggesting is a weighted coin flip, just how everyone else is evaluated in the world.
posted by downing street memo at 9:16 AM on March 6, 2012


This is an incredibly complex and difficult subject and, frankly, the linked article is so self-evidently axe-grindy and question-begging that I can't see it shedding much light or contributing to a useful discussion.
posted by yoink at 9:21 AM on March 6, 2012 [2 favorites]


I was just googling for that, myeviltwin. Thank you!
posted by smirkette at 9:22 AM on March 6, 2012


Rated each year as “exceeding expectations,” she showed positive value-added scores in most subjects every year, except for the year she taught 4th grade, when English-language learners, or ELLs, are mainstreamed in Houston. The pattern of lower scores in classes with large numbers of ELLs is well known. As another teacher said: “I’m scared I might lose my job if I teach in an [ELL] transition-grade level, because my scores are going to drop, and I’m scared I’m going to get fired.” When teachers avoid these classes, high-need students are increasingly taught by less effective novices.

I'm not here to defend anyone's particular implementation, but clearly if there is a feature that's known it should be adjusted for. If kids are transitioning from special to mainstream, then there isn't comparable previous-years data, and it's inappropriate to use that method. It's also possible she got unlucky that year; as I pointed out the measures are extremely variable.
posted by a robot made out of meat at 9:24 AM on March 6, 2012


myeviltwin, that report is interesting, but it's from the Value-Added Research Center which obviously wants to make their own model look good. They chose to do their in-depth analysis on 5th grade math, which has better numbers than English. I'm not saying they did anything unethical, but I would like to look at a similar document that was from a more neutral source.
posted by demiurge at 9:28 AM on March 6, 2012


Students have a really crappy idea of what makes a good teacher

I didn't mean students, I mean me as a 42 year old.
posted by spicynuts at 9:37 AM on March 6, 2012



Spicynuts, I imagine that a good portion of your 18 years in school were not spent making an honest attempt to ascertain the quality of your teachers.


I guess I didn't make my point well enough - I was trying to say that identifying a BAD teacher and identifying a GREAT teacher should be relatively simple from observation. It's the broad middle or average and the layers of distinction in between that are extremely difficult to judge.
posted by spicynuts at 9:38 AM on March 6, 2012


Article VI of the US Constitution specifies that “no religious test shall ever be required as a qualification to any office or public trust under the United States.” However, given the amount of faith that Administrators and Legislators need to believe in the legitimacy of teacher ratings based on standardized test scores, I assert that hiring and firing teachers based on this data is nothing more than an unconstitutional form of state-sanctioned paganism.
posted by Dr.Rhetoric at 9:44 AM on March 6, 2012 [1 favorite]



I shake my head. We are trying to make an educational model that was concocted for the 19th century work for the 21st.

When a person needed to know reading, writing and basic math, huge, crowded classrooms were fine. Your child had a fairly decent chance at getting the minimum amount of education needed to get on with his or her life.

Clearly this isn't working today, yet we're doing almost everything we can think of to keep this going instead of re-thinking the whole thing.

Think about how specialized our jobs are now. Does it still make sense to teach all the way through college, a survey of everything one needs to know? No. So why is it still out there?

I have a liberal arts education and I love the arts, so that's not what I'm talking about. The existing system fails to take into account each students aptitude, ability and talent.

The typical classroom has a couple of Special Ed kids who are being mainstreamed, a couple of kids with behavioral issues, some kids who have a homelife so horrific you wonder how they wake up and get to school in the morning, some kids whose parents are so checked out that they don't even know when reports cards go home, or what school hours are, some kids who have no interest in the class and a few who love the subject. Oh and throw in the FOBs (fresh off the boat) who don't speak the language.

So you take this squad of kids, each with his or her own issues, and desires and goals and you cram them into an aging classroom in a building that would have 100 health and safety violations if it didn't belong to the county and by virtue of being in that building, we are said to be educating them.

As Miss Krabapple says, HA!

One teacher, no matter how talented, can't possibly address all the needs of all these kids. Not even part way. So you make accomodations for the Speds, you try to keep the behavioral cases from freaking out and grinding the class to a halt, you see if one of the other kids who speaks the native language of the FOBs can explain the lesson and you try to keep the kids engaged in the lesson.

This is the hardest job in the world.

By all means, try to measure that shit. I dare you.
posted by Ruthless Bunny at 9:50 AM on March 6, 2012 [8 favorites]


I didn't mean students, I mean me as a 42 year old.

Your opinion about what makes a good teacher is, as you said, based on your experiences as a student. Given that students can't tell what makes a good teacher, it stands to reason that your argument for why you'd be able to do it, with "18 years" experience as a student, is flawed.

I was trying to say that identifying a BAD teacher and identifying a GREAT teacher should be relatively simple from observation.

Well, identifying a BAD teacher and identifying a GREAT teacher should be relatively simple from student outcomes, too, but it isn't that simple. Perhaps the lesson of this FPP is that what one thinks is obvious is sometimes untrue.
posted by Philosopher Dirtbike at 9:55 AM on March 6, 2012


When a person needed to know reading, writing and basic math, huge, crowded classrooms were fine. Your child had a fairly decent chance at getting the minimum amount of education needed to get on with his or her life.

Clearly this isn't working today


One of the more stubborn counterintuitive findings in educational research is that class size seems to have relatively little impact on learning outcomes. What matters is total student contact-hours per teacher (i.e, a teacher teaching five 20 student classes will do worse than one teaching two 40 student classes).
posted by yoink at 9:56 AM on March 6, 2012


I needed to take a breather before responding. The scatterplot/correlation approach is exactly the sort of thing that my high schools students like to do for their research projects. What the blogger does not address, at all, is the actual model that is being used.

Hierarchical Linear Modeling (HLM) is an advanced regression model that nests individual student growth (change) within a classroom taught by a certain teacher. What that means is that the model is looking for the effect of a particular teacher on the slope of the student's growth line, and whether that is better or worse than the mean slope. The model controls for the following student and classroom level variables:

The student-level variables included in the model (the X variables in equation 1) include gender, race, English language learner (current and former), free- and reduced-price lunch, disability (by special education services recommended), summer school enrollment, absences and suspensions (lagged one year to avoid potential endogeneity problems), retained in grade before pretest year, change in school between pretest and posttest year, and new to city in pretest year. The classroom-level variables included in the model (the Z variables in equation 1) include class size, classroom averages of pretests and most of the student-level variables in X (variables were excluded if there was insufficient variation at the classroom level to measure a precise effect), and proportion of students in the classroom new to city in the posttest year (a variable excluded from X because of its rarity among individual students in the analysis sample).


Radenbush, the grandfather of all of this, does say that it is still a complex problem where the model should only be part of the teacher assessment. This is state of the art, its practitioners know its flaws, but it is nothing like the straw man that the blog post is attacking.
posted by cgk at 10:03 AM on March 6, 2012 [5 favorites]


I didn't mean students, I mean me as a 42 year old.

I've spent lots and lots of hours flying in planes, and I once watched that documentary about the stewardess who landed that jumbo jet. Piloting a 747 doesn't look too tough to me; I'm sure I can pick it up as I go. Who wants to fly in coach while I test out my totally-unassailable theory?

I was trying to say that identifying a BAD teacher and identifying a GREAT teacher should be relatively simple from observation.

Great! Publish your methodology, making you sure you specifically address why we are privileging your opinion above the rest of the teeming masses of humanity who all claim to be trivially able to evaluate teacher performance based on their years spent as students, and we'll subject it to the same rigor we subject every other theory to. I'm sure it will prove to the the unifying theory that blows hundreds of years of study out of the water with its breathtaking comprehensiveness and simplicity.

Is this what we sound like when we mansplain things, or go all Engineer Syndrome on a problem? Because I solemnly swear never to do it again.
posted by Mayor West at 10:05 AM on March 6, 2012


Just to add a perspective from a "pro-VAM" side here's a link to the original Los Angeles Times piece when they reported on their in-depth statistical study into seven years of value-added data from LAUSD teachers. And here's a link to their database with all the teacher ratings averaged over the last seven years. Play around in the database for a while and you'll see that with seven years of data to work with, you seldom find massive discrepancies between math effectiveness and English effectiveness of the kind the FPP blog piece is describing.

I don't think there's really much doubt that over a sufficient length of time VAM measures something. I also don't think there's much doubt that what it measures overlaps in significant ways with what we would mean when we use a phrase like "teacher effectiveness." I also don't doubt that there are kinds of excellence that it is relatively insensitive to and that it would be absurd to rely on snapshot data rather than longitudinal data in assessing teacher performance.
posted by yoink at 10:08 AM on March 6, 2012


I've spent lots and lots of hours flying in planes, and I once watched that documentary about the stewardess who landed that jumbo jet. Piloting a 747 doesn't look too tough to me; I'm sure I can pick it up as I go. Who wants to fly in coach while I test out my totally-unassailable theory?


What a stupid analogy. We aren't talking about doing...we're talking about evaluating. You're going to know pretty much right away, sitting in couch, if the pilot sucks or not though, aren't you? We're all gonna have to agree to disagree on this.
posted by spicynuts at 10:12 AM on March 6, 2012


The model controls for the following student and classroom level variables:

The idea behind controlling for all those things in HLM is that what's left over - the value added score - should be a less noisy picture of what your attempting to measure. And yet, the scatterplots show mostly noise. Noting that the value added scores come from an HLM isn't a defense against the critique. I routinely look at scatterplots of values from hierarchical models as a check to make sure that my model makes sense, and that I'm measuring something. Simple scatterplots are powerful.
posted by Philosopher Dirtbike at 10:19 AM on March 6, 2012


I'd like to dig in on this more at some point, but it does seem like the wrinkle for this particular round of teacher-judging is that by doing a whole bunch of math to try to account for all the different ways in which environment and student history and so forth affect test scores, they managed to come up with a different set of nonsensical numbers. This NY Times article covers the angle whereby teachers from the 'good' schools where the kids come from affluent backgrounds and get super high test scores were rated badly because the model expected them to have even better outcomes than they did.

Cgk, I'm not sure I understand your point; yes, it's a complicated regression model that tries to control for all kinds of background factors; doesn't that mean that it should be reasonable to expect some correlation in a teacher's own scores from classroom to classroom or discipline to discipline?
posted by yarrow at 10:23 AM on March 6, 2012 [1 favorite]


To defend spicynuts a little bit here, I took his comment as calling for evaluation by observation rather than VAM, but maybe I'm wrong there.
posted by Navelgazer at 10:26 AM on March 6, 2012


Question: Could we also grade the parents using the same methodology, and using between-student variation within the same classroom?
posted by The Ted at 10:41 AM on March 6, 2012 [1 favorite]


This shouldn't come as much of a surprise. The reason New York originally objected to releasing these numbers is because they aren't worth very much, as far as statistical rigor goes. They were worried about people drawing conclusions from the numbers. Not conclusions they didn't like, but conclusions, or at least conclusions other than "Gee, these numbers don't tell us very much."

And hey, look! These numbers don't tell us very much.
posted by valkyryn at 10:46 AM on March 6, 2012 [1 favorite]


The idea behind controlling for all those things in HLM is that what's left over - the value added score - should be a less noisy picture of what your attempting to measure.

Shouldn't it be the opposite? You've reduced the data to "teacher value add" + some idiosyncratic error term. You've basically taken everything out that should make the data "look neater" on a scatter chart. The error term becomes much bigger as a factor in the data, and you've removed a lot of the factors that are correlated with performance on both the english and the math exams?

Serious question. I pretty much forget all but the most basic stats from undergrad.
posted by JPD at 10:56 AM on March 6, 2012


See, that seems like it depends; if "teacher value add" is a real thing and you've really isolated it, then it should correlate with itself in different measurements. If what's left over after pulling out the other factors looks random on a scatterplot, then either you haven't managed to pull out the thing you're looking for or it doesn't exist.

The "control" factors might or might not make the data look neater depending on their distributions and intensity relative to each other.
posted by yarrow at 11:04 AM on March 6, 2012


@downing street memo

Yes, what I am suggesting is a weighted coin flip, just how everyone else is evaluated in the world.

The article is not saying that a weighted coin flip is bad, the article is saying that the coin that is currently being flipped is almost completely unweighted, and we should find a different coin that is weighted more.
posted by yeolcoatl at 11:07 AM on March 6, 2012


The error term becomes much bigger as a factor in the data, and you've removed a lot of the factors that are correlated with performance on both the english and the math exams?

Your model of a student outcome will include variance from teacher effects, all the crap you're trying to control for, and error. You could just naively take the average outcomes for every teacher, and call that your evaluation. But it has sources of variance you can remove. So, you apply a statistical model to estimate the teacher effect by controlling for all the crap. The variance of that new measure will be a better, less variable estimate of the teacher effect.

That's why doing a scatterplot on the HLM estimates should show a stronger correlation than on the naive estimate, because sources of variance (noise) have been statistically removed.
posted by Philosopher Dirtbike at 11:11 AM on March 6, 2012


Do the scatterplots on the blog post represent raw data, or the data after it has been adjusted to account for the controlled factors?
posted by jsturgill at 11:13 AM on March 6, 2012


Do the scatterplots on the blog post represent raw data, or the data after it has been adjusted to account for the controlled factors?

They're estimates from a statistical model which attempts to control for various factors.
posted by Philosopher Dirtbike at 11:20 AM on March 6, 2012


So cgk is operating under a misunderstanding about what the scatterplots represent?
posted by jsturgill at 11:23 AM on March 6, 2012


So cgk is operating under a misunderstanding about what the scatterplots represent?

I'm not sure what cgk's understanding is; cgk is correct that a hierarchical linear model was used, but that's not really relevant to whether one can use a scatterplot to examine the estimates. Like I said, I use scatterplots for examining estimates in my own research. I've even published scatterplots of estimates from hierarchical (nonlinear) models in scientific journals. There's nothing wrong with that at all, and it can often be quite informative.
posted by Philosopher Dirtbike at 11:32 AM on March 6, 2012


I'm not sure what cgk meant either, but it read to me at the time like s/he was asserting the scatterplots were a straw man consisting of data that did not actually represent the true model being used, presumably (since s/he bolded it) because the blog post didn't use corrected data.

Since it is corrected data being used in the post, which does represent what the model has to tell us, what's the straw man?
posted by jsturgill at 11:38 AM on March 6, 2012


FWIW: I'm primarily a HS history & language arts teacher, but I've taught middle school and subbed frequently at the elementary level. I absolutely agree with the author's premise regarding the difference (or lack thereof) between teaching math & language arts in elementary, or the differences between a different grades within a single subject.

In elementary school, it really is more about teaching methods, classroom management and such than about subject knowledge. In middle school, you do sometimes see sharp differences in behavior, achievement & the rest between grades--your 6th graders are angels, your 7th graders seem to be mostly raving psychos, or vice versa--but in the end, if you're a solid teacher at one grade, you're probably fine with the others.

I'm not well-educated in statistics, but seeing such sharp differences in performance really does make me question the measuring method much more than the teacher.
posted by scaryblackdeath at 11:47 AM on March 6, 2012 [1 favorite]


yarrow: the correlation between disciplines argument is exactly what raised my eyebrow. I have done field work in a system where I was told "most of our students don't have science until middle school." Much depends on teacher content area knowledge. Teaching math and teaching reading are not the same and, controlling for student and classroom level variables, the fact that teachers differ between one and the other is one of the least surprising things anyone could tell me.

Teacher value add is mediated by classroom variables. Think of it this way: I am good at teaching stats and get good reviews, if I taught Mandarin I would stink and my reviews and student performance would reflect that. I may still consider myself a good teacher, but my performance will be mediated by the classroom you put me in.

On preview, I am arguing that the first scatterplot is is only scandalous if his assumed correlation between subjects is true, and I reject that assumption. That is what the above paragraph is about. I also see that I was talking about things that were off topic from the original link as I veered into other articles on the blog, so you are all correct pointing out that I have deviated from the topic. So allow me to be more narrow and also point out that he does have some more interesting scatter plots.
posted by cgk at 11:47 AM on March 6, 2012


your 6th graders are angels, your 7th graders seem to be mostly raving psychos, or vice versa

Based on my experience substitute teaching, the 7th graders are much more likely to be the raving psychos :)
posted by Philosopher Dirtbike at 11:59 AM on March 6, 2012


My point is that every metric is flawed, and you have to start somewhere.

That's not really a point; it's a platitude. "Starting somewhere" with a very flawed system of measurement can lead to bad decisions, and then a backlash that ends up discrediting the system; that is obviously worse than doing nothing. I believe that we have to start somewhere useful. I very much agree with measuring teacher performance...well. I don't believe that good teaching is such an elusive quality to be beyond measurement. But I think that this article clearly demonstrates that this is not occurring under the current regime of testing.
posted by Edgewise at 12:00 PM on March 6, 2012


Think of it this way: I am good at teaching stats and get good reviews, if I taught Mandarin I would stink and my reviews and student performance would reflect that. I may still consider myself a good teacher, but my performance will be mediated by the classroom you put me in.

There is so much more to teaching primary and secondary school than the subject. Especially in primary school, for which the plot in question was made, many of the skills that would make one an effective teacher (relating to and communicating with students, controlling the classroom, etc) have nothing to do with the subject taught. And no one is teaching stats or Mandarin in elementary school (ok, there may be some Mandarin in NY, but you get the point). It seems to me reasonable to expect there to be a correlation between math and language arts scores. We can argue about what the we feel the STRENGTH of that correlation should be, but it is perfectly reasonable to wonder why we don't see one.
posted by Philosopher Dirtbike at 12:16 PM on March 6, 2012


Yes, and if there really is no correlation then the powers that be in NYC schools should be questioning why they don't send kids to different teachers to learn math v. language arts.
posted by yarrow at 12:20 PM on March 6, 2012


(That particular scatterplot was only elementary school teachers.)
posted by yarrow at 12:21 PM on March 6, 2012


The essential problem with using statistical data to measure human intellectual growth is validity. While the testing corporations that crank out the standardized tests can offer mountains of data to indicate some degree of reliability (the ability to get results that are repeatable), they cannot speak to their inability to demonstrate that their test actually measure what they claim. If you start with bad data, it doesn't matter how often you get the same answers with it when you manipulate it.

The need by particular political factions for numerical information with which to judge teachers cannot logically surmount the quality of the information. Suppose your brakes are going bad on your car, and you need the brakes to last until payday because otherwise you can't afford to have them fixed. Your need to have the brakes last until payday has nothing to do with the physics of how long the brakes will actually last. Your need simply isn't relevant. Similarly, the need to generate data on how well teachers are doing based upon their students' test scores cannot surmount the lack of relevance this information has if there is no validity to the tests in the first place.

The English tests, for example, often try to quantify that which simply cannot be quantified. If a student taking a writing test is given a multiple choice question that asks "Which of the following sentences would make the best thesis statement for an essay on the topic of X," I would respond that there is no right answer to that question even if the test constructors have a NEED for one answer to better than the rest. While some examples of thesis statements might be arguably stronger than others, to say that one statement is the best out of four choices is simply to treat epistemological constructs as though they have clear ontological origins. They do not. Even very popular opinions will always remain opinions no matter badly some political group needs them to be something else. Neither the state nor testing companies have a right to claim (as they do when forcing students to bubble in one of four choices on a Scantron sheet) that there is one correct interpretation of a metaphor nor that there is an exact science to the placement of commas.

I stand by my previous assertion. Treating statistical information as though it offers genuine insight when it is based upon magical thinking is a form of paganism. The Constitution guarantees that the government does not have the right to impose a religion upon any of us; forcing public school employees to have faith in numerical data that cannot account for the validity of the information it offers is no different that forcing them to worship idols that have been carved by human hands.
posted by Dr.Rhetoric at 12:33 PM on March 6, 2012 [1 favorite]


JPD: "
Shouldn't it be the opposite? You've reduced the data to "teacher value add" + some idiosyncratic error term. You've basically taken everything out that should make the data "look neater" on a scatter chart. The error term becomes much bigger as a factor in the data, and you've removed a lot of the factors that are correlated with performance on both the english and the math exams?
"

The unexplained error in a model of score as function of teacher value and household income will always be less than the unexplained error in score against teacher value only. You are not "taking out" the other factors while leaving the error in, you are conditioning on them, which reduces the variance.

I am really disappointed by the OP blog post, or how many people immediately jump on the outrage bandwagon.. Real life data is messy; I would be much more worried if the scatterplot was a straight line. Also, from his Part 3:

the high correlation in this plot reveals that the primary factor in predicting the scores for a group of students in one year score is the scores of those same students in the previous year score. If there was a wide variation between teachers’ ability to ‘add value’ this plot would look much more random. This graph proves ...

The only thing the graph "proves" is that the guy has an axe to grind and/or very poor understanding of the actual models. It's like looking at climate data by scattering one year's average station temperatures against the previous year and saying "the primary factor in predicting next year's temperature is this year's, and if there was wide variation in human carbon output, which you fools think should cause an increase, this plot would look much more random". Blech
posted by ivancho at 12:40 PM on March 6, 2012


Not sure if this makes me mostly sad or more grrrrrr.

The typical classroom has a couple of Special Ed kids who are being mainstreamed, a couple of kids with behavioral issues, some kids who have a homelife so horrific you wonder how they wake up and get to school in the morning, some kids whose parents are so checked out that they don't even know when reports cards go home, or what school hours are, some kids who have no interest in the class and a few who love the subject. Oh and throw in the FOBs (fresh off the boat) who don't speak the language.

So you have a teacher that's darn good at working with these kids, and the administration knows it, and they care that the kids are with a darn good teacher. But if you throw them all into the same classroom, where they can work at close to the same level and progress comfortably, the poor teacher gets the penalty strike.

So admin throws them into various classrooms, the kids are a level behind and never catch up, the kids working at grade level are neglected, fall behind, get bored and act out, and several teachers get the strike.

What to do, what to do?

It's teh sucks.

Remember gifted and talented? Remember VoTech? When kids were actually assessed for their needs, those programs actually WORKED. You never want to stereotype a kid and limit him/her in what they can do, but so many kids being 'mainstreamed' are left behind. And so many kids that are capable of academic excellence don't get what they need, unless they're in a district with tons of money.
posted by BlueHorse at 12:44 PM on March 6, 2012 [1 favorite]


JPD: "You've basically taken everything out that should make the data "look neater" on a scatter chart.."

Uf. On re-read, you are exactly right and can safely ignore my irrelevant math blather about unexplained error above.
posted by ivancho at 12:54 PM on March 6, 2012


Real life data is messy; I would be much more worried if the scatterplot was a straight line.

Well, yes, so would I, but there's a middle ground between apparently explaining nothing and explaining everything. We're talking about peoples' careers here; if I got scatterplots like these in my own research, I'd be disappointed and probably would have trouble saying I'm measuring much of anything with my value-added score. Maybe I'd be able to turn out a paper saying how interesting it is that the correlation is apparently so small.

But much more rides on this than just a publication, or an academic discussion of modeling; can you really look at those scatterplots of value added scores for the same teacher, teaching different grades, and tell me that the model is doing a good job, and yielding enough information to penalize teachers by?

Based on my experience as a statistical modeler, my answer would be "no". Do you see a problem with the basic logic in part 2? Wouldn't you expect to see a relationship in those scatterplots?
posted by Philosopher Dirtbike at 1:09 PM on March 6, 2012


Philosopher Dirtbike: "can you really look at those scatterplots of value added scores for the same teacher, teaching different grades, and tell me that the model is doing a good job, and yielding enough information to penalize teachers by? "

I don't know enough about the policy implementation to judge whether the model is being used correctly. I know enough statistics to see that when someone dismisses a graph with "There is almost no correlation." and then it turns out from the actual report:
Correlation between math and ELA value added, same teachers
          2008-09 value added Multi-year value added
Grade 4   0.52                0.55
Grade 5   0.39                0.49
Grade 6   0.41                0.48
to conclude that that someone is probably not interested in objective evaluation of the data.

I have plenty of misgivings about firing people based on z-scores, or grading on the curve, or any other such rat-race unpleasantness. There are many arguments, statistical or otherwise, against basing policies purely on the above model - but in my opinion those scatterplots are not a meaningful analysis, they are just noise.

Just as a test, I ran a basic simulation of 700 teachers, where teacher value-added stddev was 0.5x the noise (pretty generous, I think), generating score increases for two separate grades. My rank scatterplot looks more random than his, my correlation is 0.18, and 10% of teachers have difference of 60 points or more. So does that mean that value-added is a real signal? No, just that plots without context and correlation numbers without proper statistical analysis mean very little.
posted by ivancho at 3:53 PM on March 6, 2012 [1 favorite]


How about we use the "did you get emails from graduated students thanking you for preparing them for a difficult college class?" metric, because frankly that's the only one I give a damn about, or measure myself by. Of course, that one's a bit hard to put a number on and let you use people as political punching bags, so I guess it won't catch on.
posted by Dr.Enormous at 5:53 PM on March 6, 2012


That 4th grade v. 5th grade ELA scatter plot is interesting--it's not even linear. I'm only just starting to cut my teeth on quant research, but even I know you can't just run a linear model through data that shows some kind of curvilinear relationship without some sort of transformation.

I'm a total data analysis n00b, so I won't be offended by corrections (in fact, please do! I need to learn this stuff).

Dr. Rhetoric's point about the validity of the test measures is a good one, in my opinion. Most educators I know think that portfolios of student work do a much better job of demonstrating and documenting student mastery in all subjects. But finding statewide ways to "measure" these would be cost prohibitive. Sadly these policy makers, politicians, and would-be reformers don't understand that quant can't do everything--it's not supposed to; you need qualitative research as well. Quant can point you to typical & atypical points, but you have to actually engage with people to find out why their results are such as they are, and that's where the real insight comes in. I can't find who originally said it, but I hear it all the time in class: "There is no such thing as a good [mathematical] model, but some are useful."

There are many education scholars who used to espouse the current testing model, but now see that it's counterproductive and inaccurate; Diane Ravitch, one of the initial supporters of NCLB, is probably the most vocal of these. Sadly, policy makers are too busy dismantling the public school system to pay attention to the actual frigging research.
posted by smirkette at 7:42 PM on March 6, 2012


teacher value-added stddev was 0.5x the noise

Out of academic interest, is .5σ considered demonstrative?

In my world, anything less than 3σ is inconclusive, indeed some regulated or standard methods require 10σ to measure an effect quatitatively.
posted by bonehead at 8:56 PM on March 6, 2012


Check out Enrico Fermi's and General Groves's in the "theory of citing". Really, great teachers, like great generals, are lucky, not gifted.
posted by SPrintF at 9:15 PM on March 6, 2012


In my world, anything less than 3σ is inconclusive, indeed some regulated or standard methods require 10σ to measure an effect quatitatively.

Not the same σ; you're talking about standard error of an estimate, ivancho's talking about standard deviation relative to a noise term.

So does that mean that value-added is a real signal? No, just that plots without context and correlation numbers without proper statistical analysis mean very little.

What you demonstrated is that when there's a lot of noise relative to signal, it can be hard to discern your relationship. I wouldn't argue that it isn't there - but I would argue that we need to base pay on a technique that separate teacher quality from noise, because we shouldn't be making teacher pay contingent on mostly noise. A scatterplot CAN tell you quite a bit, and it is an indispensable tool. It can be supplemented by a correlation coefficient, but actually a scatterplot tells you much more than a single number. With respect to "proper statistical analysis", I don't know what you mean. Scatterplots are part of every "proper" statistical analysis I do, and the value-added scores come from a "proper" statistical model.

I don't know enough about the policy implementation to judge whether the model is being used correctly. I know enough statistics to see that when someone dismisses a graph with "There is almost no correlation." and then it turns out from the actual report:

You know that we're talking about different years right? The scatterplot in question is from 2010, those numbers from the report are older. There's no necessary conflict between the two, besides the fact that previous good correlations would cause us to expect larger correlations in 2010; but if we don't see them, there's possibly something wrong. Maybe his analysis is wrong. But if it is right...

I suppose it's time to just look at the data myself.
posted by Philosopher Dirtbike at 1:36 AM on March 7, 2012


There's a long article in the Washington Post today about a teacher who was fired based on value added modeling.
posted by OmieWise at 5:14 AM on March 7, 2012


ivancho's talking about standard deviation relative to a noise term

That's exactly what I'm talking about too---signal response magnitude over baseline variation.
posted by bonehead at 6:24 AM on March 7, 2012


That's exactly what I'm talking about too---signal response magnitude over baseline variation.

You're talking about variation of a statistic relative to its theoretical behavior which is not the same as variation of a quantity relative to another quantity. I assure you that tiny tiny variations in phenomena which are precisely estimated are regarded as decisive in your field.
posted by a robot made out of meat at 8:10 AM on March 7, 2012


You can't separate quantities if the natural variability of the overall system is greater than the variance between quantities. Deconvolution-type models only get you so far.

I wouldn't argue that it isn't there - but I would argue that we need to base pay on a technique that separate teacher quality from noise, because we shouldn't be making teacher pay contingent on mostly noise.

This is a problem I wrestle with all the time, over-interpreting signal-to-noise. Standard reporting uncertainties vs population variability is a classic model problem in any quantitative assessment. A difference of .5σ gets into the realm where model confidences are quite low. I would not go to court with such a result, for example.

If people are going to have pay, advancement and even tenure decisions made from shaky statistics, I think it's fair to ask if the school boards are putting far too much weight on a model that may indeed show indications of teacher quality. However, I'd want a lot better quality of information to make assessments that have such high human costs. What's the acceptable p value to fire someone?
posted by bonehead at 8:56 AM on March 7, 2012


You can't separate quantities if the natural variability of the overall system is greater than the variance between quantities. Deconvolution-type models only get you so far.

The noise terms live at the per-test and per-kid level. There are repeat observations varying one factor but not the others, ie multiple kids taking multiple tests. The inference on a teacher could be arbitrarily precise with enough otherwise uncorrelated kids taking tests with uncorrelated measurement error. Quoting a single quantity as the "noise" for all targets of inference is meaningless.
posted by a robot made out of meat at 9:32 AM on March 7, 2012


The inference on a teacher could be arbitrarily precise with enough otherwise uncorrelated kids taking tests with uncorrelated measurement error.

The problem is that sample sizes are limited to 20 or 30 kids per year. Most teachers only teach a single class. Firing decisions are being made on relatively small sample sizes. Is that enough to produce enough confidence in the model result? I'm really not convinced that this is the case.

Further, it's easy to imagine problems in individual classes. Samples are correlated: schools have geographic catchment areas at minimum. There are going to be a lot of confounding factors.

Quoting a single quantity as the "noise" for all targets of inference is meaningless.

You're absolutely right, of course. I was simplifying too much. Of course, this just makes the problem of teasing out meaningful signal that much more difficult.
posted by bonehead at 10:00 AM on March 7, 2012


Philosopher Dirtbike: "You know that we're talking about different years right?

I missed that, sorry. FWIW, the 09-10 data has correlation 0.5044 for same teacher math/ela, so his "There is almost no correlation." is still utterly ridiculous. Even if that one number was lower, it would only mean that we should focus on more stable multi-year estimates.

Yes, of course scatterplots are indispensable for proper analysis - but you can't look at them without context. Sure, it looks fuzzy - but how fuzzy is too fuzzy? What if you are combining it with 4 other signals? What correlation should you expect if your model was valid and what if not? All of these are questions he never even touched upon, it was all just 'plot fuzzy, formula bad, go viral' grar.

From OmieWise's helpful article, we learn that value-added is somewhere between 20%(NY) and 50%(DC) of a teacher's evaluation, the remainder being classroom observations and other more subjective measures. And apparently the particular teacher's situation of really good subjective evaluation and really bad value-added score is an outlier, potentially caused by inflated previous year scores. So perhaps we can ease up on the 'careers are ruined by a cold statistical black box with bad correlations' outrage. Yes, 50% is probably too much weight - this seems like a more valid conversation than whether we should just drop all noisy metrics.


Frankly, I am not even trying to support the current system, or the value-added stats model - as others noted, until the political and societal factors are improved, education will still be a mess, measured or not. All I wanted to say is that the OP blog post seems like a bad source and a misuse of statistics and perhaps we should get more info before raging over a couple of shoddy graphs.
posted by ivancho at 10:13 AM on March 7, 2012 [1 favorite]


The problem is that sample sizes are limited to 20 or 30 kids per year. Most teachers only teach a single class. Firing decisions are being made on relatively small sample sizes. Is that enough to produce enough confidence in the model result? I'm really not convinced that this is the case.

Sure, and that's why these estimates come with huge estimated standard errors and why people are checking those versus empirical discrepancies (year-to-year variability, class-to-class variability).
posted by a robot made out of meat at 12:00 PM on March 7, 2012


They don't report SEs, but the raw data dump does have upper and lower confidence bounds for the value added percentiles.

And they're just terrible. The average interval is over 50 points wide. For a percentile.

The only reasonable CIs were for scores near the extremes. For teachers between the 20th and 80th percentiles, the confidence intervals averaged 65 points wide, and 32 percent of scores had a lower bound under 25 but an upper bound over 75.

These don't get a lot better with their multi-year measure. Only 9% of scores had confidence bounds outside 25 and 75 points, but the average CI for a percentile score between 20 and 80 is still 53 points wide.

So for most teachers, the best the model can do is throw up its little digital hands in frustration and say "Somewhere between shitty and great, exclusive."
posted by ROU_Xenophobe at 1:57 PM on March 7, 2012


There's stuff that people do well, but algorithms don't. Sometimes it's worth doing the hard work of coming up with an algorithm that can replicate human performance — developing decent computer vision so you can have a robot that will detect defective products 24 hours a day, 7 days a week is a win. But I'm not sure evaluating teacher performance is one of these tasks.

I can only say this because, unlike many areas where I'm happy for government to create large organizations to get things done, I've thought that education is something that works better locally. This may just be due to my experience attending schools that were a bit small minded but very well funded.
posted by benito.strauss at 2:45 PM on March 7, 2012


« Older There is just one moon and one golden sun   |   LulzSec Newer »


This thread has been archived and is closed to new comments