RoboTeach: Semi-automatic method for grading a million homework assignments
From Strata:
Organize solutions into clusters and “force multiply” feedback provided by instructors
One of the hardest things about teaching a large class is grading
exams and homework assignments. In my teaching days a “large class” was only in the few hundreds (still a challenge for the TAs and instructor). But in the age of MOOCs, classes with a few (hundred) thousand students aren’t unusual.
Researchers at Stanford recently
combed through over one million homework submissions from a large MOOC
class offered in 2011. Students in the machine-learning course submitted
programming code for assignments that consisted of several small
programs (the typical submission was about 16 lines of code). While over
120,000 enrolled only about 10,000 students completed all homework
assignments (about 25,000 submitted at least one assignment).
The researchers were interested in figuring out ways to ease the
burden of grading the large volume of homework submissions. The premise
was that by sufficiently organizing the “space of possible solutions”,
instructors would provide feedback to a few submissions, and their feedback could then be propagated to the rest.
Domain specific metrics
Organizing the space of homework submissions required a bit of domain1 expertise. The researchers settled on two dimensions: functional variability and coding style (syntactic variability). Unit test results were used as a a proxy for functional variability. In the machine-learning course unit test results were numbers, and programs were considered functionally equal if resulting output vectors were the same. Abstract syntax trees (AST is a tree representation of code structure) and tree edit distance2 were used to measure stylistic similarity of code submissions....MORE
The above figure is the landscape of ~40,000 student
submissions to the same programming assignment on Coursera’s Machine
Learning course. Nodes represent submissions and edges are drawn between
syntactically similar submissions. Colors correspond to performance on a
battery of unit tests (with red submissions passing all unit tests). In
particular, clusters of similarly colored nodes correspond to multiple
similar implementations that behaved in the same way (under unit tests).