Paul Gestwicki's Blog: specifications grading

Showing posts with label specifications grading. Show all posts

Tuesday, January 2, 2024

Notes from "Grading for Growth"

I read portions of Grading for Growth as part of my preparations for the Spring's preproduction class. This book makes a case for "alternative grading," establishing four pillars for their efforts. These are:

Clearly defined standards
Helpful feedback
Marks indicate progress
Reattempts without penalty

I've been reading Talbert's blog for some time, and it's that last one that gives me some difficulty. I was hoping that reading the book would help me understand some practical matters such as grading management and dealing with skills that build upon each other. However, I found myself taking more notes about CS222 Advanced Programming and CS315 Game Programming than about CS390 Game Studio Preproduction.

I have read Nilson's Specification Grading and many articles on alternative grading, so I skipped through some of the content and case studies. The first case study is the one that was most relevant to me: a case of a calculus class in which the professor used standards-based grading (SBG). This was contributed by Joshua Bowman at Pepperdine University. This case study is published at the Grading for Growth blog.

One of the tools that I had not fully considered before is the gateway exam. Bowman gives a ten-question exam early in the semester. Students must pass at least nine questions to pass the exam. Students get five chances to retake the exam, and a passing grade is required for a B- or better grade. This is potentially useful to deal with some of the particular problems I have faced in CS222, where students come in with high variation in understanding of programming fundamentals while also suffering from second-order ignorance. A formalized assessment could very well help with this.

Another useful idea from the reading is the distinction between revision and new attempt. In my own teaching, I have allowed revisions, but I frequently in CS222 find myself suggesting that students begin assignments anew with new code or contexts. This was never a clear requirement but rather a strong suggestion. Separating these two ideas could increase clarity about the significance of error or misunderstanding. In particular, this could help with a particular error mode that I have seen in CS222: a student submits a source code evaluation, I critique the evaluation, and the student resubmits an evaluation that restates what I just pointed out in the critique. This masks the distinction between a student who has learned the material and one who can effectively parrot my commentary. The problem could be avoided if I required new attempts in cases where I am using my feedback to direct the student's attention to what they have missed rather than to point out small oversights.

Regular readers may recall that I experimented with specifications-based grading in my section of CS222 in Fall 2023. I only laid out cases for A, B, C, D, and F grades, similarly to how I have implemented specs grading in CS315. The reading suggested that +/- grades can also be laid out in a specification, using them for "in between" cases.

I regularly air my frustrations with the incoherent concept of "mid-semester grades," but a piece of advice from the book struck me as useful. There was a recommendation to only give A, C, or F grades at midsemester. This is probably the right level of granularity for the task. The alternative, which I also recently came across in a blog somewhere, was to have students write their own mid-semester evaluations as a reflective exercise.

Bowman and others separate their standards into core and auxiliary. This could be useful in both CS222 and CS315, where I tend to weave together content that students are required to know from the syllabus with those that I think are useful from my experience.

The authors directly address the problem that reassessments have to be meaningful. Unlimited resubmissions will inevitably lead to students' throwing mediocre attempts at the problem in hopes that it goes away. The authors suggest two techniques for ensuring assessments are meaningful. The first is to gate the possibility for reassessment behind meaningful practice, which probably works better in courses with more objective content such as mathematics courses. The other is to require a reflective cover sheet. I have required students to give memos explaining the resubmission, but I've never given them a format for what this entails. This has led to my accepting many "memos" that show little evidence of understanding, usually when my patience is exhausted. Formalizing the memo process would benefit everyone involved.

Those are all helpful ideas for this summer, when I will likely take elements of CS222 and CS315 back to the drawing board, but what about the resubmission rate issue that I was actually looking for? Well, I found quite a surprise. The authors suggest exactly what I have been doing for years: using a token-based system or throttling resubmissions. The real puzzle here then is what exactly they mean by "reattempts without penalty," since it's not what those words actually mean together. Only being able to reattempt a subset of substandard assignments is a penalty, since from a pure learning point of view, there's no essential reason to prevent it. That is, the penalty is coming from the practical matter that teachers cannot afford to teach every student as if they are their only responsibility. This finding was anticlimactic, but part of me expected that it would be what I had found. There's no silver bullet, and if I haven't seen nor invented something better in 20+ years of alternative grading experience, then it does not exist.

(It's funny to actually type out "20+ years of alternative grading experience," but it's true. It's also one of those things that's making me feel old lately.)

Tuesday, December 19, 2023

On the ethics of obscurity

Years ago, I experimented with what is now called "specifications grading" in CS222. I set up a table that explained to a student how their performance in each category would affect their final grade. These are not weighted averages of columns but declarations of minima. For example, to get an A may require earning at least a B in all assignments, an A on the final project, and a passing grade on the exam. This gave a clarity to the students that was lacking when using more traditional weighted averages. While publishing weighted average formulae for students technically makes it possible for them to compute their grade for themselves, in practice, I have rarely or never found a student willing to do that level of work. Hence, weighted averages, even public ones, leave grades obscure to the students, whereas specification tables make grades obvious.

What my experiment found was specifications grading made students work less than weighted averages. The simple reason for this is that if a student sees that their work in one category has capped their final grade, they have no material nor extrinsic (that is, grade-related) reason to work in other columns. Using the example above, if a student earns a C on an assignment and can no longer earn an A in the class, they see that they may as well just get a B in the final project, too, since an A would not affect their final grade.

This semester in CS222, I decided to try specifications-based final grades again. It probably does not surprise you, dear reader, that I got the same result: students lost motivation to do their best in the final project because their poor performance on another part of the class. It's worse than that, though: the final project is completed by teams, and some team members were striving for and could still earn top marks while other team members had this door closed to them. That's a bad situation, and I am grateful for the students who candidly shared the frustration this caused them.

The fact is that students can and do get themselves into this situation with weighted averages as well. A student's performance in individual assignments may have doomed them to a low grade in the class despite their performance on the final project, for example. However, as I already pointed out, this is obscured to them because of their unwillingness to do the math. What this means—and I have seen it countless times—is that students will continue to work on what they have in front of them in futile hope that it will earn them a better grade in the course.

And that's a good thing.

The student's ends may be unattainable, but the means will still produce learning. That is, the student will be engaged in the actual important part of the class.

Good teaching is about encouraging students to learn. That is why one might have readings, assignments, projects, quizzes, and community partners: these things help engage students in processes that result in learning. It is a poor teacher whose goal is to help students get grades rather than to help them learn. Indeed, every teacher who has endeavored to understand the science of learning at all knows that extrinsic rewards destroy intrinsic motivation.

What are the ethical considerations of choosing between a clear grading policy that yields less student learning and an obscure one that yields more? It seems to me that if learning is the goal, then there is no choice here at all. How far can one take this—how much of grading can we remove without damaging the necessary feedback loops? This is the essential question pursued by the ungrading movement, which I need to explore more thoroughly.

I also wonder, why exactly haven't we professors banded together and refused to participate in grading systems that destroy intrinsic motivation?

Saturday, April 27, 2019

Two missing specifications in HCI

I finished grading my students' final projects for the Spring 2019 HCI class (CS445, used to be CS345). Before the start of the semester, I wrote about how I would try specifications grading in the course. After the afternoon of grading, I realize that I missed two important specifications. I will share them here so that I have a better chance of remembering when planning Fall's class, since I've been assigned to teach the course again.

I should have had a specification requiring all non-trivial processing to be done off of the event thread. This is, of course, a requisite for any kind of multi-threaded UI programming. I specifically chose a data source that would require handling slow load times and long processing times so that my students could practice this technique. I developed a sample project in the first half of the semester based around this common practice, and I explained to them why it was important. However, I neglected to have a specification about it. Three hours before the final project was due, I had a student ask for some last-minute troubleshooting. He said that he added a spinner while some images loaded, but it wasn't showing up. Of course, it wasn't showing up because he was loading the image on the event thread. I showed him (again) the example from earlier in the semester and explained (again) why this pattern was necessary. From their final technical presentations, it was clear that he was the only person in the class of roughly twenty students who understood this crucial point. I believe this is an instance of the old standard motto: if it's important, make it worth points. I simply missed it in my specifications.

The other specification deals with acceptance testing. There are two relevant specifications in the evaluation plan, one at B-level and one at A-level. Specification B.5.R says that the final report " describes the methods by which the solution was evaluated," and A.2.R says that "The documented solution evaluation includes both quantitative and qualitative elements that explicitly align with this semester's readings." The B-level specification is designed to be broad: you can earn a B on the project by doing any kind of acceptance testing. The A-level specification is designed to be more focused: do a mixed methods evaluation based on a theory we studied this semester. None of the five teams explicitly aligned their evaluations with the semester's readings. This didn't stop two of the groups from marking that specification as complete in their respective checklists, casting serious doubt on the implicit claim that they had conducted the required self-assessment for which the checklist is a result. (Perhaps, then, I need to add more rigor to the self-assessment itself, requiring them to link their claim to the artifacts.)

The problem with the acceptance testing actually goes much deeper than dishonest claims of completion. Among those who conducted any kind of acceptance testing, there was no evidence of their having learned anything from the assigned readings and exercises relating to Steve Krug's and Jakob Nielson's theories. Instead, they followed ad hoc approaches that were poorly designed and yield unreliable results. They did actually use quantitative and qualitative approaches, in keeping with A.2.R, but they did not do these well. For example, many groups asked questions like "What did you think of the application?" and then reported "3/6 users say they liked it." I pointed out in my feedback that 50% of users claiming they liked it is different from 50% of users liking it. More importantly, "liking" the application was not one of our design goals: we were designing systems to be used. Yet, only one of the groups conducted a task-based evaluation, where the user was asked to accomplish a goal using their system. Task-based evaluation is what I expected, and task-based evaluation is what I wanted. However, I wanted the students to realize that this kind of evaluation was the right choice, so I left the specification open to other options. The other options were demonstrably worse. Hence, in the future, and particularly in this introductory HCI course, I should just require them to follow the best practice rather than give them the choice to shoot themselves in the proverbial feet.

I have to wonder if the students would have spontaneously met these criteria if they had taken notes during our discussions