More evidence that student evaluations of teaching evaluate gender bias
January 11, 2016 6:50 AM Subscribe

Inside Higher Ed: There’s mounting evidence suggesting that student evaluations of teaching are unreliable. But are these evaluations, commonly referred to as SET, so bad that they’re actually better at gauging students’ gender bias and grade expectations than they are at measuring teaching effectiveness? A new paper argues that’s the case, and that evaluations are biased against female instructors in particular in so many ways that adjusting them for that bias is impossible.

The article discusses two recent studies.

Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness

Abstract: Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We show:

SET are biased against female instructors by an amount that is large and statistically significant
the bias affects how students rate even putatively objective aspects of teaching, such as how promptly assignments are graded
the bias varies by discipline and by student gender, among other things
it is not possible to adjust for the bias, because it depends on so many factors
SET are more sensitive to students' gender bias and grade expectations than they are to teaching effectiveness
gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors.
These findings are based on nonparametric statistical tests applied to two datasets: 23,001 SET of 379 instructors by 4,423 students in six mandatory first-year courses in a five-year natural experiment at a French university, and 43 SET for four sections of an online course in a randomized, controlled, blind experiment at a US university.

and

What’s in a Name: Exposing Gender Bias in Student Ratings of Teaching

Abstract: Student ratings of teaching play a significant role in career outcomes for higher education instructors. Although instructor gender has been shown to play an important role in influencing student ratings, the extent and nature of that role remains contested. While difficult to separate gender from teaching practices in person, it is possible to disguise an instructor’s gender identity online. In our experiment, assistant instructors in an online class each operated under two different gender identities. Students rated the male identity significantly higher than the female identity, regardless of the instructor’s actual gender, demonstrating gender bias. Given the vital role that student ratings play in academic career trajectories, this finding warrants considerable attention.

previously discussed on Inside Higher Ed: Students Praise Male Professors

posted by leahwrenn (44 comments total) 40 users marked this as a favorite

If you don't have access to the second article and want to read it, if you MeMail me, I...know someone...who can hook you up.
posted by leahwrenn at 6:52 AM on January 11, 2016 [4 favorites]

this is interesting. my partner teaches students, and was really disappointed with her evaluation for a particular course. but she then did improve the score (hugely) with the help of some uni support dept. that argues that they do measure something, no?
posted by andrewcooke at 7:07 AM on January 11, 2016

this is interesting. my partner teaches students, and was really disappointed with her evaluation for a particular course. but she then did improve the score (hugely) with the help of some uni support dept. that argues that they do measure something, no?

The problem comes from comparing evaluations between instructors, rather than comparing two different courses from the same professor -- the comparisons between professors are used for retention and promotion decisions, so if the evaluations are as biased as these and other studies show, then the comparisons are going to be equally flawed.
posted by Dip Flash at 7:13 AM on January 11, 2016 [20 favorites]

yeah, sorry, i miss-read the summary - it's possible (of course) from them to measure quality to some extent and still be biased (we were so happy with the improved score....)
posted by andrewcooke at 7:16 AM on January 11, 2016

Well, my wife is always much, much, much better evaluated than me, so this just proves I suck a little bit more than I already thought I did. Or that she rocks more.
posted by signal at 7:16 AM on January 11, 2016 [5 favorites]

it may also depend on culture. there's something weird about latin "macho" culture and women in ~~science~~ academia. i am no expert, so don't want to say anything dumb, but it seems that things are different to the usa/uk in some ways.
posted by andrewcooke at 7:19 AM on January 11, 2016

One immediate solution might be a rule that women's teaching evaluations may only be compared to other women.

But there are also other biases to be adressed. Someone teaching a required course can be rated much lower than when they teach an elective, though their teaching technique is similar/the same. Thus a non-tenured instructor assigned to teach large intro courses could be adversely affected compared to another who taught primarily electives.
posted by jb at 7:19 AM on January 11, 2016 [7 favorites]

I've actually seen my own teaching evaluations jump wildly based on how I graded. Higher average = better evaluations, including in the comments. I didn't change, my students' grades did.
posted by jb at 7:21 AM on January 11, 2016 [3 favorites]

Women's teaching could only be compared to other women? Hm. I bet that Black professors get rated worse than white ones; how does that factor in? (That is, where we find this kind of gender bias we usually also find racial bias; it would surprise me if Black faculty were treated equally with white faculty of their gender while men and women generally were not.) Visibly queer professors? Visibly butch women and visibly femme-y men? I bet they all get rated worse than straight, gender-normative people.

Are we only to compare butch women professors to butch women professors? Native cis male professors to Native cis male professors? It seems like this whole thing suggests that student evaluations in their current form aren't gonna do it.
posted by Frowner at 7:25 AM on January 11, 2016 [16 favorites]

andrewcooke: "there's something weird about latin "macho" culture and women in science academia. "

Sort off-topic but related: when I started teaching a design studio, as a newcomer, not a designer and never having taught a studio before, I noticed that the (mostly female) students would come to me with questions and requests for instructions more than they did to the female professors, even though they (the female professors) were designers and had been teaching the studio for years. It might be in part because I'm older than my co-professors, but I felt a definite whiff of patriarchy in the air.
posted by signal at 7:40 AM on January 11, 2016 [3 favorites]

Curious:
In the French data, male students tended to rate male instructors higher than they rated female instructors, but little difference was observed among female students. In the U.S. data, female students tended to rate perceived male instructors higher than they rated perceived female instructors, with little difference in ratings by male students. In both cases, however, the bias still positively impacted male instructors and disadvantaged female ones.
posted by modernnomad at 7:40 AM on January 11, 2016 [5 favorites]

Someone teaching a required course can be rated much lower than when they teach an elective, though their teaching technique is similar/the same. Thus a non-tenured instructor assigned to teach large intro courses could be adversely affected compared to another who taught primarily electives.

I know my university at least attempts to counteract this by asking students if they're taking the course to fulfill a requirement for their major, as an elective, or to fulfill a general education requirement. The published evaluations then have a "comparison group" which is based on class size and the primary reason(s) students are taking the class.
posted by damayanti at 7:56 AM on January 11, 2016 [3 favorites]

it may also depend on culture.

There are two data sets, one large French one, for essentially the entire first-year social sciences evaluations of a university over a 5 year period, the second a smaller US one, for a single on-line course. While the French results have quite a bit better sample sizes and randomization, the results from both are broadly similar. As the authors note: "There is no evidence that this is the exception rather than the rule".

You can argue that more work needs to be done, sure, but there's some cross-cultural validation here already.
posted by bonehead at 8:00 AM on January 11, 2016 [3 favorites]

One immediate solution might be a rule that women's teaching evaluations may only be compared to other women.

I bet even within women's evaluations, there's variation for gendered factors like age and looks that don't exist as strongly among male professors.

I'm in a professional program right now, and I have a pretty good idea of how seriously my fellow students take their teaching evaluations, so I think teaching evaluations should, on the whole, not be used for any kind of comparisons at all. They should be used entirely to let individual professors track their own improvement.
posted by jacquilynne at 8:15 AM on January 11, 2016 [13 favorites]

that argues that they do measure something, no?

Oh it measures something. Just not what it supposed to.

Students are pretty terrible at rating educational quality. It's like asking a child to tell you how good a meal is. In the end you will serve them pancakes smothered in chocolates covered with oreos in your quest for a high evaluation and they will tell you that you are the best parent in the universe. Later when the stomach ache hits or the diabetes kicks in they may have a different view.

I like the freakonomics guy's view. He said he didn't want his children to love him while they were children. He wanted them to love him when they were adults.
posted by srboisvert at 8:26 AM on January 11, 2016 [14 favorites]

I haven't yet read this newest study but it's contrary to the majority of (well-done) research in this area which (surprisingly) has found little or no relationship between gender and student ratings of teaching*. I don't yet know if this is a new finding that should be relied upon because this study is methodologically superior to older studies but I'm very wary of much of the research in this area because much of it is quite bad. I'm a higher ed scholar and I really try not to be a research snob about work that touches on my areas of expertise but many faculty members reach too far when they get outside of their discipline and experience especially when the topic naturally engenders controversy, strong emotions, and media attention (which is completely understandable given the completely inappropriate high stakes (mis)uses to which many departments and institutions subject these ratings!).

Incidentally, many of the issues already discussed illustrate why the scholars in this area refer to these as "ratings" of instruction and not "evaluations" of instruction; novices are simply not qualified to perform evaluations but they can certainly provide valuable input!

* Hativa's 2013 book "Student Ratings of Instruction: Recognizing Effective Teaching" is my current go-to resource on this topic and I'm looking specifically at pp. 81-82.
posted by ElKevbo at 8:38 AM on January 11, 2016 [7 favorites]

I like the freakonomics guy's view. He said he didn't want his children to love him while they were children. He wanted them to love him when they were adults.

The article isn't about children, it's about university students. They may be childish adults while they're under age 30, but still not children.

My view is that feedback should be non-numerical.
posted by mattamatic at 8:45 AM on January 11, 2016 [4 favorites]

Oh it measures something. Just not what it supposed to.

That's as far as possible from works like Herdon's The Way It's S'pozed to Be...
It's a couple of studies among many that seek to describe, not prescribe, nor require conclusion by a simile and hyperbole that's a breath away from Spare the rod, etc. And the love of children framed by zero-sum or exclusivity?...ugh!
posted by lazycomputerkids at 8:53 AM on January 11, 2016

I bet even within women's evaluations, there's variation for gendered factors like age

I'm convinced that age plays a significant role. I distinctly remember my cohort of female undergrads having much more fraught working relationships with our older female instructors than with our younger ones. Of course, in hindsight I can understand where they were coming from a lot better than I did as a punk kid - they'd all gone through a lot to get where they were! But some of them were tough on us constructively, while others were like the cliché of the parent who beats their kids because they got beaten themselves. I remember coming in the first day of a class taught by the program chair. She opened with, "It's an insult to me at mt age and experience to be forced to teach undergrads. I'm making it my personal mission to make sure half of you not only fail this class, but wash out of the program." I would have been just as horrified if a male professor had done it.

On a lighter note, my inner ten-year-old wants to believe that Stark and Boring brought Ottoboni on board to make their byline sound less grim.
posted by The Underpants Monster at 8:59 AM on January 11, 2016 [9 favorites]

When I started as a faculty member, the goal for tenure was supposed to be to demonstrate improvement in teaching, rather than attain an absolute level of teaching quality. That, at least, I think teaching evaluations could be useful for (now it's the attainment of 'excellence', which is harder to demonstrate).

Perhaps the University was on the right track before. . .

Our department does peer evaluations in addition to student evaluations (student evaluations are often not that great in physics; when someone is beating the school of arts and sciences averages that's seen as a sign of excellent teaching). I have no doubt that peer evaluations can be affected by unconscious bias, too, though.

In my own experience, grades definitely have an effect on evaluations -- one interesting thing I've found is that, in teaching a large course multiple times where the fractions of A/B/C/D stayed largely constant, my evaluations were much, much better when A grades started around 90% than when they started closer to 80%. This might be part of why people often get poor evaluations in physics (where numerical grades tend to be lower than in many classes).
posted by janewman at 9:00 AM on January 11, 2016 [1 favorite]

I'm convinced that age plays a significant role.

It's worth cautioning then that all of the data sets in this paper are for TAs/junior lecturers. I'd bet almost none of these evals were for folks over 35 in the datasets as they're described. However, the results are about male and female instructors of the same age group.
posted by bonehead at 9:15 AM on January 11, 2016 [1 favorite]

I may have shared this story before. My favorite (mis)use of student evaluations was at my old job, where everyone's scores were compared to the corresponding question's "college norm" (yep, the average of everyone's score, across the entire small liberal arts college), and if your score was on the bad side of the college norm, that was viewed by the administration as problematic for your progress towards tenure and promotion---even if it was only on the bad side by a few hundredths, for scores that were assigned as discrete values between 1 and 5 inclusive (and in practice, were almost always between 1 and 3, with 1 being "excellent").

Just think about that for a minute...

Nowadays, when I'm handing out student evaluations, I tell students that the administration interprets the numbers as grades that the students are giving me: 5 = A, 4 = B, etc. This seems to correspond to better/more meaningful evaluations. I know that when I have to fill out evaluations myself as a student, I'm very resistant to scoring the instructor as "excellent", but I'm happy to say they did an "A" job at teaching the course.
posted by leahwrenn at 9:20 AM on January 11, 2016 [4 favorites]

modernnomad:

"In the U.S. data, female students tended to rate perceived male instructors higher than they rated perceived female instructors, with little difference in ratings by male students."

My wife has mentioned conversations with her fellow female teachers about how the girls in their school tend to be more deferential to male teachers while the boys didn't seem to show any bias. This seems to somewhat support that observation.
posted by charred husk at 9:27 AM on January 11, 2016 [1 favorite]

Evals are nothing more than a mindless feedback system:
Easy Grading + Tells jokes in lectures = GOOD EVALS
Hard Grading + Less jokes in lectures = SHITTY EVALS

An Institution that heavily weighs evals in hiring = MOAR GRADE INFLATIONS PLEASE
posted by ovvl at 9:31 AM on January 11, 2016 [4 favorites]

leahwrenn, thank you. That's a great heuristic. I'll use it and pass it on.
posted by Dashy at 10:05 AM on January 11, 2016

ElKevbo, I'm very much appreciating your comments, thank you.
posted by alasdair at 10:21 AM on January 11, 2016 [1 favorite]

in some cases (physics 101, for example, at u chile) the same course (ie the same physics) is taught to multiple streams by different profs (each with their own presentations). the students all take the same exam (written by a committee including the profs lecturing). this gives a separate way to assess the relative strengths of the lecturers (by how well their students scored). seems that would be an interesting way to cross-check these results.
posted by andrewcooke at 10:23 AM on January 11, 2016 [1 favorite]

I got AWESOME student evaluations from my one semester of teaching, so this makes me sad (I am a man). Then again, much of the teaching was absolutely rubbish...
posted by alasdair at 10:24 AM on January 11, 2016

I'd be interested in seeing how this kind of thing plays out at smaller institutions. One major difference between a lot of the findings at larger places vs. small liberal arts colleges, is that we have a lot of "repeat business" in our classes. Anecdotally, I know quite a few of the harder professors do have higher ratings, even though their students aren't all making A's in those classes. But I think different things might matter more for residential vs. commuter students, and if you're going to see those profs again. I don't have a prediction as to how this would work with gender bias, but it might play out differently there too. As a faculty member at a women's college, there's a whole different situation there.
posted by bizzyb at 10:36 AM on January 11, 2016

I hate to burst bubbles but again turning to the Hativa book (that I often just leave sitting on my desk for quick reference) the majority of research has not found a relationship between either (a) expected grades and student ratings or (b) actual grades and student ratings. In other words, easy teachers are not the ones who get the best ratings. In fact, the relationship that appears to be most interesting is the one between effort and ratings with courses requiring more effort earning higher ratings (Hativa, 2008, pp. 53-61).

A friend of mine is a lecturer and a social psychologist with a speciality in bias. He has told me that the research didn't find that grades affected evaluations at first - but later found grades could affect the evaluation of female instructors.

Another friend has noted that she consistently receives lower teaching scores in her required course (itself on a challenging subject) than in her elective course.
posted by jb at 10:37 AM on January 11, 2016

Scott Carrell and Jim West's Journal of Political Economy article should be included. (here)
posted by scunning at 10:40 AM on January 11, 2016

ElKevbo: Any chance you could link to published journal articles, which more of us have access to than a specific book?
posted by hydropsyche at 11:19 AM on January 11, 2016 [2 favorites]

From the description of the book on Amazon, it appears to be primarily an argument for the effectiveness of teaching evaluations. The author appears to be an expert in her field (relevant PhD with an academic position), but I'm skeptical due to how one-sided the summary is, given that there are also well-designed and contradictory studies such as this one.

And TBH, given how pervasive gender bias is, in many different types of evalutions, I would be surprised if it didn't effect student evaluations. If student evaluations aren't actually affected, I want to know why: What is special about them?
posted by Kutsuwamushi at 11:36 AM on January 11, 2016 [1 favorite]

hydropsyche: Unfortunately I don't have a lot of available time today to comment on Metafilter, at least not right now while I'm at work, so I don't have time to hunt down many of the specific studies. But a good literature review of this general topic can be found in IDEA Paper #50 Student Ratings of Teaching: A Summary of Research and Literature by Stephen Benton and Bill Cashin. Cashin, in particular, has been involved in this research for decades and is probably the foremost expert on this topic.

A few papers on the topic discussed above about the relationship between student ratings and grades/workload:

Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades and less course work? Research in Higher Education, 44(5), 495-518.

Marsh, H. W., & Roche, L. A. (2000). Effects of grading leniency and low workload on students' evaluations of teaching: Popular myth, bias, validity, or innocent bystanders? Journal of Educational Psychology, 92(1), 202.

Remedios, R., & Lieberman, D. A. (2008). I liked your course because you taught me well: The influence of grades, workload, expectations and goals on students' evaluations of teaching. British Educational Research Journal, 34(1), 91-115.

And a few papers on the relationship between gender and student ratings:

Feldman, K. A. (1993). College students' views of male and female college teachers: Part II: Evidence from students' evaluations of their classroom teachers. Research in Higher Education, 34 (2), 151-211.

Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or a witch hunt in student ratings of instruction? New Directions for Institutional Research, 2001: 45–56. doi: 10.1002/ir.3
posted by ElKevbo at 12:14 PM on January 11, 2016 [10 favorites]

I am really nervous about getting my evaluations back from last semester. I have always got really great numbers on my evaluations in the past (although occasional stupid comments about my hair, being a bitch, or how "hot" I was). But last semester for the first time I co-taught with someone else who dropped the ball a lot. I spent the semester putting out fires that resulted from things he forgot to do or couldn't be bothered with. I couldn't always help the students since the class was more in the other guy's field than mine, and so I'd refer queries to him and he wouldn't always reply. I think objectively I was more responsive to the students and did much more work for the course, and I was the one who created and provided pretty much all the "scaffolding" material for the students who needed more help. But it also means I was more visible, and, being female, while my colleague was male, I suspect I might end up taking the fall for the problems when it comes to the evaluations. We will see.
posted by lollusc at 2:59 PM on January 11, 2016 [2 favorites]

another data point (sorry to post again, but this is kind-of a correction to earlier). i just got back from visiting paulina (who is in hospital tonight (nothing serious)) where we discussed this. it turns out that when she was first graded she went and cross-checked against grades given to others who had given the course for the first time. and surprise saw exactly this - that women had been graded lower than men. she asked on the local women prof's email list, but no-one else had heard of such a thing. so i guess chile isn't an exception...
posted by andrewcooke at 3:29 PM on January 11, 2016

I'd really like to know which U.S. University they pulled half of their data. Does anyone know? Higher education is heavily influenced by the surrounding culture, so this may have a huge impact on the data.

I don't think this is necessarily what you mean, but mentions of "surrounding culture" kind of feels like a dog whistle for "it's the fault of people of color or poor people" because, inevitably, it seems the ideal higher education setting is full of rich white people. How many people are really going to say with a straight face that student evaluations aren't biased against women, regardless of the university?
posted by hoyland at 3:56 PM on January 11, 2016 [1 favorite]

Well, there are going to be differences in teaching expectations and class styles if it was an Ivy vs. a large Agricultural campus vs. a regional branch campus vs. a small liberal arts school vs. a community college. It's a valid question and I don't think it's a dog whistle. It's certainly true that professors have different approaches and priorities when it comes to teaching at different kinds of institutions; it'd be interesting and instructive to know where these data are coming from.
posted by ChuraChura at 4:10 PM on January 11, 2016 [1 favorite]

Hoyland, I am sincerely NOT going down the road of blaming people of color. You are making a pretty strong assumption about my intent. Let me clarify. I'm referring to the schools surrounding culture and the focus of study at that particular university.
There are big differences between Montana State University for agriculture, Yale or Harvard for Law, Massachusetts Institute of Technology for Biology, School of the Art Institute of Chicago for Fine Arts, or UC Berkley for Sociology.
I'd just like to see more data to determine if there are any big variations across academic disciplines for a certain bias.
posted by Muncle at 4:38 PM on January 11, 2016 [1 favorite]

Somehow I'm not surprised to hear that hotness and easy grades are pretty much what give you good evals.

This reminds me of my ex's brief stint as a TA--he complained that "he's cute!" was written on his evals.
posted by jenfullmoon at 5:55 PM on January 11, 2016

Nowadays, when I'm handing out student evaluations, I tell students that the administration interprets the numbers as grades that the students are giving me: 5 = A, 4 = B, etc.

My university actually phrases it this way on the evaluation - it asks students to give the course a letter grade. I am happy to say that none of my students failed me last semester, although I did get one D.
posted by pemberkins at 4:54 AM on January 12, 2016

I'd have to search back through my collection of articles on this topic for links, but when I was reviewing studies on bias in SET, some articles indicated that there were quite a lot of complicating factors: women in STEM fields who conformed to gender roles versus nonconformed had different outcomes than women in arts and humanities who conformed versus nonconformed to gender roles, different trends across topic area being taught, trends that were specific to student group being taught regardless of the topic area of the course (with engineers and business majors tending to be more conservative and display more gender bias in SET), interactions between pedagogical methods and gender (more active pedagogies can get lower ratings, especially given that many SET questionnaires are designed to elicit response about lecturing effectiveness; more women than men use active pedagogies versus traditional lecture), the actual wording of questions, etc. Some studies suggest that bias in student surveys is masked in part by increased effort from female instructors.

The study that came out last year, mentioned in the FPP, that was a perfect double-blind controlled study of two instructors with two sections each of a coordinated online course, clearly indicates that there is gender bias in SET. I believe it was the subject of it's own FPP last year, in fact. Even that study shows some nuances in the effects, however, eg. with the largest gender effect being seen in the question about how well-organized the instructor was, but some other questions showing no statistically significant gender bias from students. I think the FPP makes a good and interesting point that students' gender bias doesn't always show up in the ways one might assume - that they exhibit gender bias in reporting on some verifiable, quantitative data about instructors such as time to return marked work, but not always a large or statistically significant gender bias in some of the questions that ask about more subjective, personality-related traits of their instructors (though others certainly are subject to gender bias).
posted by eviemath at 3:13 PM on January 12, 2016 [1 favorite]

« Older La-Z Rider | First X, Then Y, Now Z : Landmark Thematic Maps... Newer »

This thread has been archived and is closed to new comments

MetaFilter

More evidence that student evaluations of teaching evaluate gender bias
January 11, 2016 6:50 AM Subscribe

Tags

Share

More evidence that student evaluations of teaching evaluate gender bias January 11, 2016 6:50 AM Subscribe

Tags

Share

More evidence that student evaluations of teaching evaluate gender bias
January 11, 2016 6:50 AM Subscribe