No more marking? Not quite

The principle behind Comparative Judgement sounds excellent: through a process of side-by-side comparisons a team of assessors arrive at a statistically robust evaluation of the relative quality of the various pieces of work. Human beings are notoriously bad at absolute judgements (such as applying a mark scheme to an essay, discussed perceptively by Daisy Christodolou in The Adverb Problem), but able to make comparative judgements of quality reliably and quickly. With this in mind, and inspired by David Didau’s excellent posts such as this, I was keen to explore what it had to offer to the work of a busy English department.

Getting data right is essential for me, and we’ve gone a long way to make our use of marks (as opposed to verbal or written feedback) more reliable through regular moderation exercises and the introduction of much simpler assessment terminology for quick feedback, similar to what Tom Sherrington is developing for KS3. Getting teachers to think in terms of rank order before they apply formal mark schemes is also very helpful: work out the mark for your best essay and your worst essay, and you’ve at least got the range of marks to work within.

However, what Comparative Judgement offers, according to the website, takes this to a new level of rigour and reliability. The question is, how useful is this system for a department where each teacher is ploughing through piles of books each week, and where getting everyone around a table to mark together is a scarce luxury?

The Trial

We decided to undertake a comparative judgement exercise on our Lower 6th mock exam, where all our Pre-U English Literature candidates were taking the same unseen critical appreciation paper. The logic here was that we would have a reasonable number of responses to a small range of questions, and that the data would be meaningful in giving a snapshot of student performance across a whole cohort.

The Procedure

As soon as the examination was completed, I collected the scripts and anonymised them, covering up student names with a sticky label (if you want to replicate our experiment, stock up on sticky labels – they play a very important role!) and identifying each with a random code of letters and numbers. The scripts I then scanned as .pdfs and uploaded them to the website.

Setting up the judging on the website took me to the limits of my capacity to understand statistics, and the guidance on their website presupposes quite a higher level of comfort with stats than, I think, most non-mathematicians would possess. In the end I used the parameters set up below:


The Scale field is set to the Median=50, Standard deviation=15 recommended by the guide to give a meaningful spread of results. I chose Distributed for Script Selection Type (recommended in the guidance), and left Anchor Scores off. The CJ Estimation button turned out to be a godsend: make sure you click it each time you change the number of judgements or make changes to the judging to re-calculate the number of judgements each item receives.

A word on the wording. I decided for this exercise to be very open in my wording, omitting any mention of assessment objectives or grades descriptors. Hence, it was purely the judgement of a professional English teacher on which essay had done a better job of responding to an unseen text. I believe we can make good judgements without the need for assessment objectives to introduce artificial hindrances: does the response do justice to the text should be our guiding principle.

Thereafter it was quite straightforward. The candidates being uploaded, I entered the e-mail addresses of my team of judges and sent the unique url for each judge’s allocation to the team. Then, I sat back and waited to see what would happen.

The Judging

The process of reading twenty pairs of essays took about three hours for each judge, with the median time per judgement varying lying around the five minute mark. I would note that the median time becomes considerably quicker as you go along: each judge soon begins to see essays repeated, so you can in some cases make instant judgements. From a leadership perspective, it’s important to encourage your team to stick with the judgement process and reassure them it will get quicker. Three hours seems a reasonable amount of time to mark thirty examination essays, so per person, it’s no more arduous than normal marking. However, with six people in the team, that’s a total of eighteen hours of department time invested in this exercise. The question is, was it worth it?

And here’s where the sticky labels came in again: I asked colleagues to make brief notes while they read an essay summing up their thoughts or particularly salient points about each one. These were to be done on a sticky label so that, once the judging was complete, we could stick the labels to each students’ work so they got feedback not from one judge, but from five or six. How well this worked, I will discuss later.

What we learnt

When you download the Candidates spreadsheet, you’ll see something which looks like this:


The first column is the anonymised id, and the scaled score uses (for this assessment) the median of 50 and a standard deviation of 15. The other interesting column is infit, which measures the level of agreement between judges.

The first thing we learned was that candidates performed as we expected them to: those with the best GCSE and ALIS scores did best, though there were one or two surprises where strong students had underperformed. So far, not worth eighteen hours.

We were able to tell that our two top candidates were noticeably better than the next four or five, being almost one whole standard deviation above the mean, with scaled scores of 75.7 and 74.3. That said, two standard deviations is the usual minimum measure of statistical significance, so we’re not going to get too excited, yet.

Slightly further down the list, we began to see interesting clusters emerging, groups of candidates whose scaled scores were spread by only two or three points. This told us that, in our collective view, these candidates had performed effectively identically. We also discovered that, with the exception of one outlier who had just had a bad day, the range of scores below the median was less widely spread than those above it.

The implications for marking are several. Firstly, I feel (and I’ve no statistical evidence to back this up beyond empirical observation) that teachers are often reluctant to give two very different pieces of work the same mark. We try to manufacture differences, giving one piece an 18 and another a 19 in order to satisfy our own prejudices or instincts. Undertaking comparative judgement led us to the view that there are not only different ways to achieve an 18, but that two 18’s may look very different and do very different things. Secondly, it also avoided the opposite problem of weaker essays being disproportionately harshly marked because our perception of their relative weakness is skewed by their status as outliers.

The ‘Infit’ column is also worth a close look. Where a candidate scores below 1 on this, it means that there is a high degree of agreement between judges on the quality of their work. Where it exceeds 1, the judges are more divided. What surprised me is how much we disagreed over our best candidate: their scaled score was 75.7 (one standard deviation above the mean), but their infit was 1.59. This prompted some productive discussions of what we were calling a ‘good’ essay.

Arriving at the marks

The process of marking was quite straightforward. I took the top essay and applied the mark scheme to it, arriving at a mark /25. Then I divided up the scaled scores into groups, deciding that where there were a cluster of essays with similar scaled scores (+/-5 was my rule of thumb), they would receive the same mark. I then decided on a mark for the bottom essay and worked down essays from top to bottom, assigning marks without again looking at the mark scheme, but using the comparative judging data as my guide. To test the efficacy of this, I gave the essays to colleagues and asked them, ‘does this feel like an X to you?’. While this falls foul of the anchoring effect, we did, as a team, agree that the marks were right.

Was it worth it?

In short, yes, because it brought us together as a team with everyone having spent a good chunk of time scrutinising a whole cohort’s set of exam essays. We could have a really meaningful discussion about teaching implications, and we knew that the data we were providing to tutors and heads of year was meaningful and it was right. The department meeting following the exercise was a really fruitful, challenging and illuminating session.

We also learned some interesting facts about ourselves and our marking. As HoD I got a breakdown of the reliability of my team’s judgements, (using an ‘infit’ score as described above), which will provide an interesting talking point when thinking about professional development (though I would never judge a teacher’s performance on this exercise alone: a busy week, an ill child, a winter cold could all throw someone’s performance off badly).

For our students? I’m less sure. I think it was interesting for some to get feedback from many different markers, and many were able to spot patterns of comments which the markers had made about their work. However, for those over whose work judges were less able to agree (resulting in a high infit score), the contradictions between the comments were confusing, not enlightening. Careful feedback from class teachers and opportunity for students to reflect meaningfully was essential for them to get the most out of it.

Will we do it again?

Yes, but I’m not sure when. We might use it for a sampling exercise for large-scale exams (such as fifth form mocks), and I could see it being an interesting way of getting classes to carry out peer-assessment: I could set them up as judges and assign them a number of judgements to make for homework. If the work were typed, then it would reduce the chance that they would let personal loyalty or animosity guide their views. However, the amount of time taken to generate data means that we will have to continue to find the elegant compromises which make the work of a teacher so endlessly rewarding.



No more marking? Not quite