Academia: The Absence of Evidence

Thursday's Child Has Far to Go

Oct 19, 2023

During the pandemic, I think many faculty gained a new appreciation for how deeply their colleagues felt about their own grading practices and the grading norms in their departments and disciplines because we all had to have active conversations about whether to consciously skew (or even forgo) grades in order to take account of the difficult conditions that students were facing at the other end of the Zoom classroom.

In these discussions, at many institutions the faculty divided into a number of factions:

A small but active group of evangelical “ungraders” who had already been experimenting with various forms of alternative practices in assigning course grades, such as contract grading, student self-assessment and self-grading, portfolio assembly, and so on.
Faculty who were vocal about their skepticism and disinterest in traditional grading and were aware of the weakness of evidence for its efficacy, but who were also wary of ungrading (or savvy about the increase in labor time that many styles of ungrading entail). These faculty often used a very narrow range of marks (usually B to A) for the vast majority of student work, reserving other grades for issues like massive lack of attendance or major assignments uncomplete; these faculty usually advocated spending the bulk of their time marking on direct narrative feedback and evaluation. (Essentially this is the way that Hampshire College and other experimental institutions of that era approached evaluation.)
Faculty who conceded that grading was a bothersome chore that didn’t measure the full performance or capabilities of students but still felt it was an important component of maintaining standards and rigor, and the only way to efficiently measure whether students had acquired key skills or understanding and could progress to the next part of a curricular sequence successfully.
Faculty who deeply, profoundly believed that grades were an essential tool for pushing students to do their best work, who embraced the compulsory or coercive character of grading practices. These faculty also often believe that there is a natural and necessary hierarchy of capability and dedication within a population of students in relationship to a particular subject or discipline and the job of grades is to reveal that hierarchy and assign students their proper place within that distribution.

The striking thing to me was how strongly people in each group feel about their take on grades (despite everybody agreeing that grading is an unpleasant part of the job). Those feelings are rooted in our own experiences of grading and they’re rooted in our personal vision of how education and society connect.

You might expect that given how important grades are to us as professionals and how strongly our institutions, our students, and our publics feel about grades, that there would be a ton of evidence behind any given approach to grading.

On one hand, well, there is a long history of believing that we are just beginning to understand how to connect grading and marking to our work as teachers, that we’re going beyond just relying on our intuition, that we need evidence.

Check this passage out, for example:

That’s from 1927.1

Try this one from 1913 for another example2:

Rhetorically, these could be from a number of studies of grading published in the last decade that promise that clarity from systematic research is coming any day now.3 These early authors describe research used to establish grade curving in order to force student populations to conform to what was presumed to be a natural distribution of intelligence and ability and to remove the subjective opinion of instructors from the process of that distribution, e.g., one of the ways of thinking that are still strongly with us today, even if the eugenics-tinged assumptions operating in 1913 and 1927 are not and the research methods used not even close to passing muster by contemporary standards.

I bring this point up in order to underscore a serious problem with the kind of advocacy offered by Jessica Grose this week in her New York Times newsletter. Grose argues that “lenient grading” cannot save struggling high school students, but that cracking down on chronic absenteeism can. By “lenient grading” she means policies that are spreading in K-12 institutions that make 50% or 60% the lowest grade a student can receive even if they do not turn anything in while also changing the threshold for failing out of a class or subject to that lowest mark—a change justified in diverse ways (trying to eliminate stigma, trying to encourage students, trying to improve student achievement).

Grose’s major complaint against these changes (and others, such as the California Math Framework) is that there is no evidence for their efficacy. Proponents, she says, are “advocating those polices based largely on, well, vibes”. She concedes that she’s heard from teachers who have had good experiences with the 50 percent “floor” and she agrees that “teaching an art, as well as a science”. But, she goes on, there’s no evidence at all, and surely that can’t be right? Why are we adopting policies for which there is no evidence of efficacy?

My problem with Grose’s point here is not that I believe in what she’s calling “lenient grading” as a formal K-12 philosophy or that I think the California Math Framework is a great change. (I’ve read some strong critiques of it that sound perfectly justified to me.) The issue is that there is also no evidence by her standards for anything that “grade leniency” is meant to replace. There’s no evidence that grading to the curve produces better outcomes in the longer term, there’s no evidence that giving out a requisite number of Fs accomplishes anything in particular. Most of what teachers do at the K-12 level or the college level with grading or evaluation is not backed by rigorous research that meets contemporary standards of replicability or data collection. None of those camps I mentioned at the beginning are really standing on firmly established foundations that are evidence-based.

Or to be less harsh, there is evidence, there is research, but it is inconclusive. Necessarily so, and not because of research design per se—though there are basic methodological and ethical problems with this kind of work because experimenting on human subjects for whatever reason is hard and dangerous. (Grose might note that you can’t provide evidence of the efficacy of something you haven’t tried, or have tried in only tentative or limited ways: there’s no way to simulate a new educational approach on a computer.) In fact, we experiment on human subjects all the time without bothering to think about it, and moreover, we are the products of a century or more of such experiments by powerful state and civil institutions. Every time the government makes a policy or an NGO adopts a new working theory for its projects, it’s experimenting on human beings, usually without any protocols and without any rigorous monitoring of outcomes.

The more important necessary limit here is not on research methods but on the fact that there is such a variety of basic views about what education is for. We can measure student achievement in terms of improvement or decline, but achievement within systems whose purpose is debatable or uncertain. I can establish a rigorous metric for measuring the quality of writing, for measuring knowledge of history, for measuring ability to do algebra. None of those answer basic questions: why do I want people to be able to write in the way that I deem to be “quality”, why do I want people to know history, why do I want people to be able to do algebra more competently?

Those questions often are answered tersely, truistically, blandly, as if it is obvious what the answers would be. Because isn’t improvement always good? Are you saying you want people to decline and fail? It depends. I would venture that Americans today compared to Americans in 1970 are on average less good at keeping a hula hoop going in a spiral around their waists. I would guess I could build a metric that might verify my guess. Would I then argue that we should be seeking improvements in our hula-hooping? No, because it doesn’t matter.

On most of the things we can build measures for, it’s not clear that there is any moment in the past where student populations broadly achieved the standards that we today regard as unachieved. We can trace declines in student performance without being able to attest to what was possible in the past before the declines that is no longer possible today.

What we have in many cases is a theory about what would happen if only student achievement was persistently improving year after year after year. And take a look at those theories if you want to see claims for which there is no evidence.

There’s a skills gap! The reason people are underemployed is they can’t write or do algebra! If they could, everybody would be fully employed at good-paying jobs! You might find evidence for a skills gap (but it’s thin) and evidence that writing and algebra open up some careers (but not in and of themselves), but not that the problem of underemployment or income inequality would be fixed by better student achievement.

You can shift the terms of that theory from jobs to empowerment, autonomy, or development of self, to more intrinsic measures of self-worth, motivation and possibility, and you still won’t find a lot of evidence, at least none that ends the argument. You can argue that if we made achievement more equitably available to all students, we’d have a more equal and diverse society. Maybe, but that’s not a thing you can prove with evidence.

Most of what we want students to achieve, whether in particular subjects or in general, is about what we want society to be, about our hopes for a better world, about what we think is ethical, fair or generative in how we relate to one another. If you believe that in general people only do what is necessary or productive if they’re made to (whether by direct commands or by incentives), you will never accept, whatever the evidence, that an approach to education that works otherwise is appropriate. If for no other reason, because you would feel that young people educated in that other system would be ill-prepared for a world that was more coercive, hierarchical or controlled. You will be well-disposed to Grose’s advocacy for coercive approaches to absenteeism first and foremost because you think that’s the way it ought to be.

If you believe that everyone is already perfect just the way they are, that all people already know what is most important to them, you will never accept an education system that encodes the idea of transformation, that students are incomplete or lacking in some respect and that education is required to repair that condition.

Values are not the opposite of evidence. We don’t just feel our values: they’re based on our experiences in the world and our evaluation of data and information, often within vigorous systems. But complaining that we lack evidence for a change in policy not only fails because the status quo is equally bereft of evidence, but also because it’s the equivalent of bringing a knife to a gun fight. Evidence alone doesn’t answer to a values problem, it only enriches and informs the struggle over values. Measuring whether achievement improves in a system that defers or avoids arguments about why what’s being measured actually matters traps us all in a Punch-and-Judy show that never ends.

Spence, Ralph Beckett, 1901-. The Improvement of College Marking Systems. New York: Teachers College, Columbia University, 1927.

Finkelstein, I. E. (Isidor Edward). The Marking System In Theory And Practice. Baltimore: Warwick & York, inc., 1913.

For example, Guskey, Thomas R., and Susan M. Brookhart, editors. What We Know About Grading : What Works, What Doesn’t, and What’s Next. ASCD, 2019.

4 Comments

Sam Tobin-Hochstadt

Oct 20, 2023Liked by Timothy Burke

[By the by, I would want to very much to split your number 4 into the first half (which I agree with) and the second half (which I disagree with strongly).]

I think a big problem in this debate is that the side Grose is arguing with is not really willing to say what they thing straightforwardly, and thus ends up proffering the kind of unsatisfying claims about evidence you are taking issue with. In particular, I think there are many people in the "ed biz" who think

1. The primary thing provided by educational institutions is credentials that are then relied upon by other actors in the economy and society (including subsequent educational institutions).

2. Actually mastering the skills/practices that are currently required for those credentials is much less valuable for marginal students than getting those credentials.

3. Therefore, we should simply provide those credentials to many currently-struggling students, which would benefit them substantially, regardless of their success at mastering the associated skills etc.

(I don't think this is true but I certainly understand why people think it's plausible.)

But, for I think understandable reasons, those people think saying 1-3 publicly will be received very badly. Their solution is, unfortunately, to try to just do 3 without persuading anyone, really.

In contrast, I think the values Grose's side espouses (success in graded school settings is reflective of valuable qualities and useful learning, the kinds of measurable outcomes she cites and that social scientists like to measure are uncomplicately good indicators of a successful life) are, while certainly contestable, at least explicitly or implicitly stated by their advocates, as you see in that column.

Expand full comment

3 replies by Timothy Burke and others

3 more comments...

Eight by Seven

Academia: The Absence of Evidence

Thursday's Child Has Far to Go