Academia: Assessment (Cri de Coeur Edition)

Thursday's Child Has Far to Go

Oct 28, 2021

In writing an email to a colleague and friend this week to answer his simple question, I found myself spontaneously unloading about two decades’ worth of thinking and reading about assessment in higher education. Sorry, friend.

Better I unburden myself here, where my readers can take it or leave it as they see fit. Buckle up.

My basic starting point in the email was that at many colleges and universities, the pandemic has interrupted the established workflow of institutional assessment, in part to unburden people from that labor and in part because the labor of both faculty and staff was so different during the pandemic that it would need its own assessment standards, which seems like a lot of work for something we hope is going to end. My thought was that this interruption could be an opportunity to just rethink the whole thing comprehensively.

Rethinking is not abandonment. At its core, professionalism centers on an individual and collective responsibility to continuously examine, correct and improve working practices. We need assessment. There are good ways to do assessment.

This is where I start on this issue. The problem with institutionally-controlled assessment in higher education as it took shape about twenty years ago was that it began with a blanket assumption that no useful assessment was taking place and that in the absence of assessment, the quality of faculty practice in both teaching and scholarship was in peril. There were three problems with this assertion: 1) many faculty, as part of their professional ethos, continuously performed various forms of self-assessment at all stages of their workflow and adjusted their practices as a result; 2) faculty were extensively vetted at hiring, tenure, promotion and in publication in ways that incorporated assessment; 3) examinations and other assignments in courses were a way of doing direct assessment of learning objectives. Most faculty then and now are constantly thinking about whether what they just did in the classroom worked, whether the course they just taught worked, whether the major they’re a part of is working, and making constant iterative adjustments to their work.

All of that got ignored because from the perspective of institutional-level interest in assessment, none of that was separable from the working practices of faculty except (at best) as retrospective testimony, and thus, none of it was comparable or measurable from the perspective of the administrative hierarchy nor shareable with external publics. The latter loomed large during George W. Bush’s administration because of an sustained argument that higher education should be understood as a commodity and that consumers were entitled to information in advance of committing their money to such an expensive service—a position that was paralleled by the push for school choice in public K-12 education. Existing data that might inform choices about higher education: graduation rates, actual costs to matriculants, size and composition of faculties, endowment sizes, etc., were deemed insufficient: consumers had to also know that faculty were being evaluated on performance and were constantly improving their performance; accreditors had to know that; and administrative leaders had to know that. To know that, they had to create ways of measuring performance that stood outside of testimonies about faculty labor and faculty expertise.

I’m still a bit bitter that the forms of professional self-assessment that most of us practiced (and still practice) were and are so easily regarded as valueless. Somewhere in the 1990s, the idea that dedicated professionals (indeed, most workers) could be trusted in some sense to do their work became such a profound heresy against managerial authority that no one in management since then has even felt the need to defend creating external systems to monitor and govern the performance of employees. That’s been one of the issues around working remotely during the pandemic—rather than just adopting a performance-based standard (did the work get done and was it good? if yes, then who cares whether someone was playing Minesweeper or changing diapers during the time designated for work?) the dogma is still “I need to monitor people while they work because if I don’t they won’t work well or at all”. (Which is, if we’re talking about evidence, a substantially untested idea, like many such dogmas.)

But I do understand the problem that people who felt that need to create systemic forms of evaluation were facing when it came to teaching and scholarship in higher education. Other professions naturally create some metrics that are reasonable proxies for the quality of individual professional labor. You can track surgical outcomes, including preventable errors, and many surgeries are comparable, so it’s a huge dataset where anomalously bad (and good) performance can be identified with confidence. You can track attorneys’ billing hours, arbitration outcomes, trial outcomes, and so on. You can track commissions on the sales floor. You can track comparable outputs of manufacturing teams. You can check the on-base percentage of baseball hitters and the subscription growth of Substack newsletters. A lot of those metrics can have perverse or unintended outcomes, as anybody who has shopped in a department store where the sales staff is earning commissions knows very well. But teaching in higher education in the United States was and still is a puzzle to measure, as is scholarship.

Why? In K-12 education you can use high-stakes standardized testing as one way to evaluate teaching effectiveness. You can use courses that have fairly standard content across institutions as another way. Again, those strategies have had perverse consequences for students and teachers, as Jerry Muller’s The Tyranny of Metrics and Cathy O’Neill’s Weapons of Math Destruction have documented. (And as any professor who was teaching when the first wave of No Child Left Behind-affected students arrived in college can tell you.) But American higher education doesn’t have anything that allows for those comparisons. There are probably a hundred or more universities and colleges in the United States where some courses in African history are taught to undergraduates, but if you got my syllabi together with all the others, you’d see that we all make different choices about what to teach and how to teach it, and we all assess what we’ve taught in different ways, to the point that comparing our work except in a loose and qualitative way is difficult (and you need a person who knows the field to even begin to try that much).

People charged with institutional assessment (or the consumers of data created by assessment, like accrediting agencies) might have dreamed a bit of breaking faculty on the wheel and creating the same kinds of standardizations that make other professions easier to monitor, and who knows? Maybe someday, but likely not, because to do so would destroy the value proposition that most universities and colleges cling to closely: that they are unique, special, different, and in some way or another, better than other institutions, at least for the students they hope to recruit. Whether you’re selling a high-end artisanal salsa brand at Whole Foods or Tostitos Chunky Salsa to go with some Doritos, you do not want anyone to create a single standardized salsa metric that everyone has to follow.

So what we’ve ended up with, by and large, is a system where faculty in departments, divisions and individual teaching are pushed to articulate concrete learning objectives for courses, majors, and graduation, and institutions are pushed to create concrete mission objectives, in order to create valid and concrete measures of success and improvement. In making these systems, most institutions have had to derogate or sideline objectives, values or missions which are not measurable even if there’s a valid argument to be made about those objectives or missions. So, for example, I might believe that an important objective of teaching African history in an American college is to give students a wider range of understanding of human possibility and human life, or to position them to challenge racism in new ways, but unless I can create a test or instrument that demonstrates that after a semester’s worth of work, students have measurably progressed towards those objectives, it’s no good for assessment.

I can set similar objectives for my scholarly work, but that’s even more unmeasurable and thus of no use whatsoever—all that can be measured is the amount that I publish, the size of grants, the number of citations I get, the “impact factor” of my work, all of which may be bad proxies for what many of us regard as scholarly quality or what we would describe as the purpose or objective of scholarly work.

Whether assessment driven by creating measurable objectives with concrete outcomes generally improves the quality of instruction or the quality of scholarship isn’t really something we commonly discuss. It’s just assumed that there’s some form of feedback mechanism not unlike the way that the procedural and surgical checklists described by Atul Gawande have functioned in reducing medical errors. You see the checklist, you’re compelled to follow the checklist, you avoid the avoidable errors and become more self-consciously aware of the possibility of error at the same time. Embedded in the checklist idea is a notion that if people believe they’re doing a good job without having some form of external evidence, they will drift towards unreflective self-compliments and rationalizations; they will look at whatever it is that they’re doing now and deem it to be exactly what they meant to do. If there are medical errors, the unreflective surgeon will deem them unavoidable. But along comes the checklist (or other form of evidence-based attention to practice) and it turns out that errors can be reduced and practices improved. I buy that. It’s why I think some kind of assessment of teaching and scholarly practice is important.

The problem lies in the tautology of thinking that in order to do assessment you have to create new data that can be measured without asking whether what can be measured is actually what you need to know to independently assess your performance. Suppose, for example, I have a very easily concretized learning objective in a microbiology class: that students should learn specific lab skills that apply to microbiological research and should retain those skills as they continue their studies or work in biological and medical contexts. I can test them as they enter the class and as they leave the class and that’s a way to measure whether they met my learning objectives. But the end of the semester is a terrible time to find out whether what I’ve taught has been retained in a useful way—there’s a lot of educational research that documents that an end-of-semester test is not necessarily a good measure of longer-term retention of information and skills. To really measure the objectives of my course, I should be trying to assess what students are doing next semester, next year, in three years, in ten years. Which is where we get to a point quickly where that’s so hard to measure that it’s effectively unmeasurable. So you could say, “That’s why this should be departmental, then, not individual”: check on your majors at the beginning and the end of their studies. But then you either have to generalize the learning objectives across a course of study that might have high variability because of electives to the point of unmeasurability (or non-comparability) or you have to rigorously standardize the progression of a major and enforce that progression with some form of high-stakes assessment. Moreover, the more you make this about departmental assessment, the more difficult it gets to track this back to the quality of the professional labor of individuals, which was supposedly the point in the first place.

And if you move to make it departmental and about an entire course of study, you still can’t touch what you need to measure, which is post-graduate retention and use of your learning objectives. That’s the thing: any long-serving faculty professional knows that the time frame that students acquire and deploy what they’ve learned is hugely variable. The end of a course is maybe the worst time ever to evaluate how successful a professor has been. There’s a course I took as an undergraduate that had a huge impact on me and has been deeply useful to me throughout my life. You could test me now on the texts we read and the things we wrote and I’d still do great. The best time to evaluate me on the course’s learning objectives would probably have been about five years after I took it.

I hear all the time about Swarthmore students, some of them women or underrepresented minorities, who really struggled in STEM classes while they were here—sometimes individual classes, sometimes entire programs of study. Most of them tested poorly and it would have seemed as if the learning objectives of the course were not met in their cases. But a decade later, many of them are successful academic researchers, medical professionals, engineers, and so on, and they often testify that those courses all made sense to them later—they were able to put it together at some point beyond the terminus of a single course. I don’t want to build a highly standardized sequence that measures pedagogical success and failure in rigid semester-long segments—if we’re serious about inclusivity and equity, we need to recognize that learning doesn’t happen with that kind of clockwork precision.

These are nearly impossible things to measure in practical terms. So we create the kind of data we can manage to create. Which is a whole different problem, and a major reason faculty dislike many styles of institutional assessment: because they have become a major addition to the workload without any acknowledgement of such. Because faculty know that the things we can readily measure via such insertions into our workflow are not the things that we want or need to know about our performance—some of what we want to know is not tangible or measurable, some of it can’t be measured in any practical way—we end up having additional work that feels as if it is of no value to the professionals undertaking that labor. Imagine how long a department at an institution as small as Swarthmore would have to create data about learning objectives using an unchanging standardized procedure before the resulting dataset would be large enough that you could say something remotely rigorous about it. By the time you could say anything, it would be decades later and you’d be commenting on the trends in the professional labor of faculty who’d long since retired.

The data we make now is only rarely helpful to the practices of the moment. It feels as if it is going to be used either to enable the assertion of new forms of supervisory authority (and thus to erode professional autonomy) or for a performative reassurance to external agencies and publics that our quality is being continuously monitored and improved.

And yet I’ve said I believe in assessment. What could that be? Let me try something as simple as Gawande’s checklist where there’s a productive disconnect between the list and the assessment. The list is something doctors have populated for themselves, out of their own professional practice. What ought you to do before surgery? When a patient is on a ventilator? What evidence-based practices are important that we already know about? Make a checklist. Go over it each and every time before a procedure, when monitoring a patient in intensive care, when using a medical instrument or device. The assessment is on the other end: do our survival rates improve? Does our measured rate of medical error go down? Do measurements of long-term recovery and improved functioning improve? You don’t make the professionals responsible for the collection of those metrics: they’re embedded in the institution’s workflow and you don’t need to screw around with mission statements to decide whether they matter: survival, improvement, thriving are hard-coded into the nature of the institution. It’s true that correlation isn’t causation, but if you make one change (introducing checklists and requiring everyone to use them unfailingly) and those other metrics change for the better, you’re justified in thinking that the checklists had something to do with that change. You can keep refining the checklists as new evidence about procedural efficacy created by actual research becomes available, but you also might accept that at some point, the improved performance created by checklists does not have to infinitely tick-up incrementally every year. Instead, the checklists are a procedural improvement that you maintain constantly as a guarantor of quality.

What might be like that for faculty labor? I’ll stick to teaching here—measuring scholarly performance is a different discussion, though also very important.

I’ve frequently referred to Daniel Chambliss and Christopher Takacs’ 2014 book How College Works in my online writing, because it made a huge impact on me when I first read it. It’s based on a long-term mixed-method study that the authors did of students and graduates at Hamilton College. Among the most important things that they argue as a result of this research is that if you want to understand how to improve and maintain good academic outcomes (as measured by grades, graduation rates, completion of programs of study, reported satisfaction of students and graduates, etc.—the data most institutions collect already the same way that hospitals collect data about medical outcomes), there’s two things that matter vastly more than any thing else.

1) Did a student make a friend or friends within their first year of attending a particular institution? and 2) Did a student eventually make a strong connection to a faculty member or other academic professional who really understood that student and was able to help them shape a narrative about their studies, talents, interests and future? Chambliss and Takacs observe that this could happen quite late in a student’s time at the college and it would still have the same powerful effects as if it had happened early on—it would retroactively reorder their academic work up to that point.

All the other things that faculty, administrators and assessors debate and think about? Relatively unimportant by comparison. And here’s the other thing they found: when it comes to making strong connections of that kind, the Hamilton faculty sorted into a power law distribution. E.g., a very small number of faculty were responsible for a large number of reported connections. As in any power law, there was a long tail—some faculty were named as producing one or two such connections. But plainly a small number of people had the knack for doing something professionally that turned out to have a huge impact on student success.

What if we threw out all the attempts to declare and measure learning objectives and missions and said simply, “Let’s do some research to identify what it is that the faculty who make most of those connections are doing”. I have some intuitions both about the material preconditions of success and the professional practices involved. Being on the tenure-track, feeling secure in your work, trusting in your administration and your colleagues, and having the time to do this work and being at the right scale of student-to-faculty are some of the preconditions. The practices, I’d wager, include being available and accessible in a variety of ways, being open, being emotionally intelligent in engagement with a wide variety of dispositions, being trustworthy, being deeply trained in your own scholarly field but also being aware of other fields and disciplines. But maybe there are techniques and habits that we’d discover on investigation that would surprise me.

So here you have something that you can monitor from your existing data-collection workflows rather than embedding new and intrusive data-creation obligations in the workflow of professionals. Maybe you add something to post-graduation satisfaction surveys: did you form a strong useful connection to one or more faculty? If so, who? You can change practices via something like a checklist or other form of professional development: you disseminate what you know about how faculty successfully form connections and how forming connections improves outcomes. You stay focused on that one simple thing that makes a complex difference rather than trying to endlessly herd cats into making diverse measurable data that will never ever be standardized or large-scale enough to say anything useful or rigorous about in a timeframe that can recursively affect ongoing practices. You leave professionals the room to digest and incorporate the “connection checklist” according to their own dispositions and talents, but you keep working on making that checklist a part of mindful, self-reflective professional preparation and practice. You support the material and programmatic preconditions for forming connections. What you want is to “flatten the curve” of that power law—to see more connections made by more people and thus ensure that no matriculated student goes without that vital contribution to their success.

Whatever you do, assess the assessments in a way that’s accountable to the professionals who are being assessed (and takes seriously the expertise they have about their work and about assessment itself). Audit the labor involved in data creation, evaluate whether any of it is actually used once it’s made. Don’t bracket off the important goals that aren’t measurable or the ones that are too hard to measure—don’t make measurability the self-fulfilling criteria that directs all efforts to determine whether we are doing what we ought, as well as we might.

Image credit: Photo by William Warby on Unsplash

Eight by Seven

Discussion about this post