Interactive dilemmas about AI and grading
For forty years, machine learning in education automated logistics: scanning forms, checking syntax, tracking submissions.
Generative AI does something different. It automates judgment.
That shift changes what it means to teach, to assess, to be assessed. These scenarios explore the trade-offs—not to give answers, but to surface the questions we should be asking.
6 scenarios available. Each visit is randomized.
Forty-three ungraded essays sit in your inbox. Your department expects a manuscript draft by Friday. Your partner went to bed two hours ago.
Yesterday, a colleague mentioned they've been using AI to grade student writing. "It's like having a TA who never sleeps," they said. "I got ten hours of my week back."
You've always believed that grading is where real teaching happens—in the margins, in the questions, in the careful attention to a student's developing voice.
But that belief doesn't grade essays. And right now, you're drowning.
What do you do?
You upload the essays. Within seconds, grades appear alongside detailed feedback: thesis clarity, evidence usage, paragraph structure.
You spot-check a few. The assessments are... reasonable. Not exactly what you'd write, but defensible.
Three weeks later, a student visits office hours. "I don't understand this feedback," she says, pointing to a comment. "What does 'consider strengthening the logical throughline' even mean?"
You don't remember writing that. Because you didn't.
How do you respond?
It takes until 2:30 AM. You write comments about Sarah's emerging analytical voice. You note where Marcus's argument loses its thread. You catch a citation error that would have cost Jamie points later.
At the department meeting Friday, your chair shares a report: faculty using AI grading tools report 40% more time for research. Publications are up.
A colleague leans over. "The students can't tell the difference anyway."
Her expression shifts. "So... a computer graded my essay?"
You spend twenty minutes discussing her argument together. It's a good conversation—better, maybe, than margin comments would have been.
That night, you see a post in a student forum: "Apparently Prof. [You] is using ChatGPT to grade papers? We're paying $60k a year for AI feedback?"
Forty-seven upvotes. Your chair wants to meet tomorrow.
You pull up her essay and improvise an explanation. It works. She nods, thanks you.
Walking back to your office, you feel the weight of a small deception. The feedback wasn't wrong. But it wasn't yours. And now you've claimed it.
Your colleague Panos just ran an experiment. He used an AI voice agent to proctor oral exams for his 87 students, then had a "council of LLMs" grade their responses.
Total cost: $15.
He shares the results at a faculty lunch:
He turns to you. "You should try it. It's incredibly efficient, and honestly? It might assess understanding better than our written exams."
What's your reaction?
Panos walks you through the setup. Students schedule a 15-minute slot. The AI asks questions, follows up on vague answers, probes understanding. Three different models grade the transcript independently.
"The consistency is remarkable," he says. "No tired grader giving harsher scores at midnight. No unconscious bias about who 'sounds smart.'"
You think about your own grading—how your standards drift across a stack of 50 exams. How a strong essay early can make the next one seem weaker by comparison.
Then you think about what it means to be examined by a machine. The student speaking into a void, judged by something that doesn't know them, can't see their nervousness, can't offer a reassuring nod.
What matters more?
Panos shrugs. "Traditional exams are stressful too. At least this tests what they actually know, not how well they write under time pressure."
But you're thinking about your students with anxiety. The ones who freeze when speaking. The ones for whom a human examiner's patience makes the difference between demonstrating knowledge and going blank.
"Did you offer accommodations?" you ask.
"The AI doesn't judge their affect," he replies. "In some ways, that's its own accommodation."
You're not sure if that's liberating or chilling.
You decide to pilot AI oral exams in one section. The consistency is real—scores cluster tightly, feedback is uniform, no student can claim unfair treatment.
But then a student comes to your office hours. She's crying.
"The AI kept asking me to clarify," she says. "I knew the answer, but I couldn't explain it the way it wanted. With a real person, I could have drawn a diagram. I could have seen if they understood."
Her score: C+. You would have given her a B+.
You think about your own education. The professor who noticed you were struggling and asked a gentler follow-up question. The examiner who smiled when you finally found the right words.
Assessment has never been just about measuring knowledge. It's about being seen—having someone recognize your thinking, even when it's imperfect.
Can an algorithm do that? Does it need to?
"What if the point of oral exams isn't just the grade?" you say to Panos. "What if it's the last time some of these students ever have a real intellectual conversation with an expert in the field?"
He pauses. "That's a $15,000 conversation, if you cost out faculty time."
You're serving on the academic integrity committee when someone raises a troubling point.
"We have seventeen pages of policy about students using AI. But nothing about faculty using AI to grade."
Silence. Then the murmurs begin.
You learn that at least a dozen faculty are already using AI grading tools. Some departments have informal norms. Most have nothing. Students haven't been told.
What position do you take?
You advocate for disclosure requirements. If faculty use AI in grading, students should know. The committee agrees to draft a policy.
But then the questions multiply:
"Does spell-check count?"
"What about Gradescope's OCR features?"
"If I ask ChatGPT to help me phrase feedback, is that AI grading?"
The line between "tool" and "automation of judgment" blurs. Every attempt at definition creates new edge cases.
Six months later, you have a 12-page draft that no one fully understands—and the technology has already changed twice.
You argue for minimal regulation. Faculty have always chosen their own pedagogical methods. Why should this be different?
The committee moves on. No policy is written.
A year later, a student lawsuit makes the news. A graduate failed her comprehensive exams. She discovered the grading rubric was generated by AI, and one of the three "faculty graders" was actually a language model.
Her lawyer argues: "She was promised assessment by qualified experts. She received assessment by software. That's breach of contract."
Your institution's general counsel calls an emergency meeting.
You've been using a hybrid approach: AI for initial feedback on mechanics and structure, your own comments for substantive engagement. It's working well. Mostly.
Then you review Kai's essay.
The AI flagged it harshly: "organizational inconsistency," "unconventional thesis placement," "informal register inappropriate for academic writing."
Suggested grade: C+
But when you read it yourself, something different happens. The essay is fragmented, yes. Risky. But also alive. Kai has done something you haven't seen all semester—genuine thinking on the page, wrestling with ideas in real time.
It might be the best thing you've read all year.
What do you do?
You change the grade to A-. You write a personal note: "This took risks that paid off. Don't let anyone convince you to play it safe."
Kai comes to office hours, grateful. "I wasn't sure if that approach would work."
But later, you wonder: how many students did the AI grade down for taking risks you never saw? You only caught Kai's essay because something in the first paragraph made you read more closely.
The AI processes 50 essays in the time it takes you to really see one. What brilliance is being quietly, consistently filtered out?
You keep the C+. The rubric exists for a reason. If you make exceptions for writing you personally like, you're introducing bias, not eliminating it.
Kai doesn't come to office hours. The next essay is competent, conventional, forgettable. It gets a B+.
You think about what you've taught: play it safe, hit the marks, don't take risks.
A colleague mentions that studies show AI grading systems tend to reward "safe" writing—clear structure, predictable moves, conventional voice. The creative outliers get flagged as errors.
You're in your second year on the tenure track. Your mentor—a full professor who's been unfailingly supportive—takes you aside after a department meeting.
"I need to be honest with you," she says. "Your teaching evaluations are good, but your publication record needs work. The committee will notice."
She glances around, then lowers her voice.
"I've been using AI to grade for the past year. It's not perfect, but it's given me back ten hours a week. That's how I finished my book."
She pauses. "I'm telling you because I care about your career. The people who figure this out are going to have an advantage. I don't want you to be the one who falls behind."
What do you feel?
You take her advice. You start using AI grading. You do get more writing done.
But you notice something: the work feels different now. You used to think about your students constantly—their ideas, their struggles, their growth. Now you think about your manuscript.
At a conference, you run into a grad school friend who's still hand-grading everything. She looks exhausted. Her publication record is thin. She's worried about her contract renewal.
"I just can't bring myself to do it," she says. "Grading is how I really teach."
You don't tell her what you've been doing. You're not sure why.
You thank your mentor but don't take her advice. The next two years are hard. You publish less than your cohort. Your tenure case is borderline.
In the committee meeting, a colleague argues for you: "Her teaching is exceptional. Students cite her feedback as transformative."
Another colleague responds: "We're not hiring teachers. We're hiring researchers who also teach."
You get tenure—barely. But you watch junior colleagues who used AI grading sail through with stronger files. One of them confides: "I feel bad about it sometimes. But I couldn't have made it otherwise."
A student named Jordan asks to see you. They seem nervous.
"I need to show you something," they say, pulling out a folder. Inside are printouts of feedback from three different professors, all in your department.
"I ran these through a detector. They all have the same patterns—same sentence structures, same phrases, same rhythm. I'm pretty sure they were written by AI."
You look at the papers. Jordan is right. The feedback is good—but it's unmistakably uniform in ways human writing isn't.
"Are we being graded by robots?" Jordan asks. "Because we're paying a lot of money to learn from people."
How do you respond?
You take Jordan's concerns to your chair. The response is defensive.
"Faculty are allowed to use tools to assist their work. This isn't cheating—it's efficiency."
"But the students don't know," you say.
"Do they need to? The feedback is accurate. The grades are appropriate. What's the harm?"
You think about Jordan, who spent hours analyzing their feedback, trying to learn from what they thought was a professor's expert insight. The harm is hard to name but easy to feel.
Jordan's petition gains 200 signatures. The student paper picks it up. The administration promises to "study the issue."
"Did the feedback help you?" you ask.
Jordan pauses. "I mean... yeah. I guess. But that's not the point."
"Isn't it? If you learned something, if you improved, does it matter whether a person or a program wrote the comment?"
Jordan looks at you for a long moment. "When I get feedback from a professor, I think: someone who knows this field, who's read thousands of essays, who's thought deeply about writing—that person noticed this about my work. That means something."
They gather their papers. "If it's just an algorithm, it means nothing at all."
Forty years of machine learning in education automated logistics: scanning forms, checking syntax, tracking submissions.
Generative AI automates judgment.
That shift changes what it means to teach, to assess, to be assessed. And most institutions haven't begun to grapple with it.
The supposed gains are many: time saved, consistency improved, feedback accelerated. But the losses may be harder to measure. The professor who noticed a student's voice emerging. The comment that asked exactly the right question. The human recognition that said: I see you. I'm paying attention. Your thinking matters to me.
These things don't scale. Perhaps that's the point.
There are no right answers here. Only trade-offs we should make with open eyes.