Comparative Judgement, Part 2

Some weeks ago I wrote a detailed blog post on a comparative judgement exercise my department undertook to evaluate its effectiveness in assessing Year 12 English Literature unseen mock examinations. David Didau and Chris Wheadon both offered detailed, thoughtful comments on our methodology, and I resolved to try the process again, this time using my Year 11 students as markers. This blog post reports what I learned from the process, and I hope will prove useful to teachers interested in this powerful tool.

The method

All students had completed a 1 hour timed piece of writing (IGCSE English Language ‘Directed Writing’). I anonymised their work, assigning a random three-number code to each student. These I scanned, uploaded to and set up my class as judges. I briefed the students on what I wanted them to do, telling them they should average about 30 seconds per judgement. I included myself in the judging panel, to determine my infit with the students.

The results

were interesting.

Here are our top three candidates, according to the comparative judgement exercise:

CJ Top 3

There was a fair level of consensus on the best essay (standardised score of 86.23, comfortably outside 2 standard deviations from the mean of 50 (σ=15), whose fluent, sparky prose clearly caught our class’s eye. The next best candidate was also highly praised, but less uniformly: their infit score was above 1, though not by much. The third best was a long way back, and the top of a big cluster of results.

The thing was, our judges had got it wrong. The best candidate set up a line of argument in their introduction, and promptly completely contradicted themself on page two. Equally, after a purposeful start, our second-placed candidate lapsed into those tiresome ‘semi-sentences’ so beloved of teenagers (omitting verbs, or using commas and full-stops as one were only a sort of non-committal version of the other). Neither was our best. Our best got off to a slow start, but by the end of the essay had developed an argument of genuine sophistication and force.

My Conclusions:

  1. Judges can make reliable judgements quickly, but a quick judgement is not always a reliable of a complex argument;
  2. Teenagers are seduced by the meretricious, and not inclined to read carefully;
  3. This exercise taught me a lot more about what my students thought was good, than about the genuine quality of the work. Therefore, it was an unmitigated, unexpected success.
    1. If my class agrees that a flawed essay is brilliant, I need to address that in my teaching.

And so, my journey of comparative judgement continues.


I strongly recommend you have a look at Chris Wheadon’s blog and David Didau’s post on Rethinking Assessment for more on this fascinating development in how we assess our students’ work.

Comparative Judgement, Part 2

Reading aloud allowed

My first post on this blog was an attempt to put into words my thoughts having completed a comparative judgement trial based on Lower 6th essays completed for a timed examination. I am exceedingly grateful to David Didau and Chris Wheadon for their generous and thoughtful comments, which have greatly expanded my understanding of the process and its rewards.

One of the most telling criticisms made of our methodology was the time it took for a teacher to arrive at a judgement. Our median time was about five minutes; Chris Wheadon claims evidence that reliable judgements can be achieved in as little as seven seconds with a median of about half a minute. There are a number of factors which slowed us down, but it prompts me to ask: how often do we read a student’s work really closely?

I would suggest, not that often. The standard model of marking for an English teacher is about as bad as you can get: a pile of essays must be waded through, and each assigned a slot on an arbitrary scale. This makes our reading bad in two ways: firstly, we have a pile of essays to mark, so are disinclined to spend long on each essay; secondly, we tend to look for things we can tick, ‘analysis of language’, technical vocabulary and suchlike.

I’ve tried a range of moderation strategies in my department, from blind marking and submission of marks anonymously (try using Google Forms for this), ranking exercises, paired marking and suchlike. Each has their place, yet none come close in terms of real value to sitting down together around a table and reading students’ work aloud. I pick out a pretty random selection of essays (controlling for gender if it’s obviously skewed), anonymise them and then, we read them. Aloud.

Reading aloud is slow and effortful, horribly inefficient when it comes to that pile of essays. Yet it assures three things:

  • Every teacher around the table has taken in every word that candidate has written. Silent reading cannot guarantee this, even with trained professionals (our attention may be elsewhere, or the handwriting may just be too crabbed to read);
  • We have had to make sense of the work. To read aloud is to turn words into meaning, the voice articulating ideas and their relationship to one another;
  • We will arrive at a shared view of what the writer has actually said.

Having done this, we can begin to apply the mark scheme, working carefully on thoughts which are fresh in our mind. Going through four or five essays takes about three quarters of an hour, but it is time very well spent.

Does it make a difference? I think it does. It reminds us what we’re looking for and what we think is a good answer, as well as reminding us to look beyond the easily-ticked technical terms and suchlike. Equally importantly, it’s led to more consistent marking, with less variation between teachers, and, hence, more reliable data on which we can base our discussions.

Why slow reading matters

One of the most humbling experiences of my career occurred while at a standardisation meeting for Pre-U English Literature. We’d read essays aloud around the table, dissected their arguments and I’d enjoyed every moment of it. Later in the session, the team of examiners were divided into pairs to take away a small pile of scripts and mark in tandem, each moderating the other’s work. I was paired up with a colleague some decades my senior, and it was totally instructive to watch her work. When she read a student’s essay, it was as if she were in the room with that student, talking to them as they wrote, reading and re-reading to ensure she was completely certain of what they had said. When they misquoted, she knew instantly (these essays were on Hamlet, Measure for Measure, Henry IV Part 1, as well as works by Pinter, Churchill and Jonson); when they misrepresented the play, she picked them up on it; when they illuminated something, she praised them to the skies.

Ever since then, I’ve known I ought to replicate that care and attention. I’d like to say I attempt it with every pile of IGCSE essays, but I’d be lying through my teeth. Sometimes, though, I do lock myself in my classroom and read aloud, trying to put a voice to the words on the page. I listen better that way.

Reading aloud allowed

No more marking? Not quite

The principle behind Comparative Judgement sounds excellent: through a process of side-by-side comparisons a team of assessors arrive at a statistically robust evaluation of the relative quality of the various pieces of work. Human beings are notoriously bad at absolute judgements (such as applying a mark scheme to an essay, discussed perceptively by Daisy Christodolou in The Adverb Problem), but able to make comparative judgements of quality reliably and quickly. With this in mind, and inspired by David Didau’s excellent posts such as this, I was keen to explore what it had to offer to the work of a busy English department.

Getting data right is essential for me, and we’ve gone a long way to make our use of marks (as opposed to verbal or written feedback) more reliable through regular moderation exercises and the introduction of much simpler assessment terminology for quick feedback, similar to what Tom Sherrington is developing for KS3. Getting teachers to think in terms of rank order before they apply formal mark schemes is also very helpful: work out the mark for your best essay and your worst essay, and you’ve at least got the range of marks to work within.

However, what Comparative Judgement offers, according to the website, takes this to a new level of rigour and reliability. The question is, how useful is this system for a department where each teacher is ploughing through piles of books each week, and where getting everyone around a table to mark together is a scarce luxury?

The Trial

We decided to undertake a comparative judgement exercise on our Lower 6th mock exam, where all our Pre-U English Literature candidates were taking the same unseen critical appreciation paper. The logic here was that we would have a reasonable number of responses to a small range of questions, and that the data would be meaningful in giving a snapshot of student performance across a whole cohort.

The Procedure

As soon as the examination was completed, I collected the scripts and anonymised them, covering up student names with a sticky label (if you want to replicate our experiment, stock up on sticky labels – they play a very important role!) and identifying each with a random code of letters and numbers. The scripts I then scanned as .pdfs and uploaded them to the website.

Setting up the judging on the website took me to the limits of my capacity to understand statistics, and the guidance on their website presupposes quite a higher level of comfort with stats than, I think, most non-mathematicians would possess. In the end I used the parameters set up below:


The Scale field is set to the Median=50, Standard deviation=15 recommended by the guide to give a meaningful spread of results. I chose Distributed for Script Selection Type (recommended in the guidance), and left Anchor Scores off. The CJ Estimation button turned out to be a godsend: make sure you click it each time you change the number of judgements or make changes to the judging to re-calculate the number of judgements each item receives.

A word on the wording. I decided for this exercise to be very open in my wording, omitting any mention of assessment objectives or grades descriptors. Hence, it was purely the judgement of a professional English teacher on which essay had done a better job of responding to an unseen text. I believe we can make good judgements without the need for assessment objectives to introduce artificial hindrances: does the response do justice to the text should be our guiding principle.

Thereafter it was quite straightforward. The candidates being uploaded, I entered the e-mail addresses of my team of judges and sent the unique url for each judge’s allocation to the team. Then, I sat back and waited to see what would happen.

The Judging

The process of reading twenty pairs of essays took about three hours for each judge, with the median time per judgement varying lying around the five minute mark. I would note that the median time becomes considerably quicker as you go along: each judge soon begins to see essays repeated, so you can in some cases make instant judgements. From a leadership perspective, it’s important to encourage your team to stick with the judgement process and reassure them it will get quicker. Three hours seems a reasonable amount of time to mark thirty examination essays, so per person, it’s no more arduous than normal marking. However, with six people in the team, that’s a total of eighteen hours of department time invested in this exercise. The question is, was it worth it?

And here’s where the sticky labels came in again: I asked colleagues to make brief notes while they read an essay summing up their thoughts or particularly salient points about each one. These were to be done on a sticky label so that, once the judging was complete, we could stick the labels to each students’ work so they got feedback not from one judge, but from five or six. How well this worked, I will discuss later.

What we learnt

When you download the Candidates spreadsheet, you’ll see something which looks like this:


The first column is the anonymised id, and the scaled score uses (for this assessment) the median of 50 and a standard deviation of 15. The other interesting column is infit, which measures the level of agreement between judges.

The first thing we learned was that candidates performed as we expected them to: those with the best GCSE and ALIS scores did best, though there were one or two surprises where strong students had underperformed. So far, not worth eighteen hours.

We were able to tell that our two top candidates were noticeably better than the next four or five, being almost one whole standard deviation above the mean, with scaled scores of 75.7 and 74.3. That said, two standard deviations is the usual minimum measure of statistical significance, so we’re not going to get too excited, yet.

Slightly further down the list, we began to see interesting clusters emerging, groups of candidates whose scaled scores were spread by only two or three points. This told us that, in our collective view, these candidates had performed effectively identically. We also discovered that, with the exception of one outlier who had just had a bad day, the range of scores below the median was less widely spread than those above it.

The implications for marking are several. Firstly, I feel (and I’ve no statistical evidence to back this up beyond empirical observation) that teachers are often reluctant to give two very different pieces of work the same mark. We try to manufacture differences, giving one piece an 18 and another a 19 in order to satisfy our own prejudices or instincts. Undertaking comparative judgement led us to the view that there are not only different ways to achieve an 18, but that two 18’s may look very different and do very different things. Secondly, it also avoided the opposite problem of weaker essays being disproportionately harshly marked because our perception of their relative weakness is skewed by their status as outliers.

The ‘Infit’ column is also worth a close look. Where a candidate scores below 1 on this, it means that there is a high degree of agreement between judges on the quality of their work. Where it exceeds 1, the judges are more divided. What surprised me is how much we disagreed over our best candidate: their scaled score was 75.7 (one standard deviation above the mean), but their infit was 1.59. This prompted some productive discussions of what we were calling a ‘good’ essay.

Arriving at the marks

The process of marking was quite straightforward. I took the top essay and applied the mark scheme to it, arriving at a mark /25. Then I divided up the scaled scores into groups, deciding that where there were a cluster of essays with similar scaled scores (+/-5 was my rule of thumb), they would receive the same mark. I then decided on a mark for the bottom essay and worked down essays from top to bottom, assigning marks without again looking at the mark scheme, but using the comparative judging data as my guide. To test the efficacy of this, I gave the essays to colleagues and asked them, ‘does this feel like an X to you?’. While this falls foul of the anchoring effect, we did, as a team, agree that the marks were right.

Was it worth it?

In short, yes, because it brought us together as a team with everyone having spent a good chunk of time scrutinising a whole cohort’s set of exam essays. We could have a really meaningful discussion about teaching implications, and we knew that the data we were providing to tutors and heads of year was meaningful and it was right. The department meeting following the exercise was a really fruitful, challenging and illuminating session.

We also learned some interesting facts about ourselves and our marking. As HoD I got a breakdown of the reliability of my team’s judgements, (using an ‘infit’ score as described above), which will provide an interesting talking point when thinking about professional development (though I would never judge a teacher’s performance on this exercise alone: a busy week, an ill child, a winter cold could all throw someone’s performance off badly).

For our students? I’m less sure. I think it was interesting for some to get feedback from many different markers, and many were able to spot patterns of comments which the markers had made about their work. However, for those over whose work judges were less able to agree (resulting in a high infit score), the contradictions between the comments were confusing, not enlightening. Careful feedback from class teachers and opportunity for students to reflect meaningfully was essential for them to get the most out of it.

Will we do it again?

Yes, but I’m not sure when. We might use it for a sampling exercise for large-scale exams (such as fifth form mocks), and I could see it being an interesting way of getting classes to carry out peer-assessment: I could set them up as judges and assign them a number of judgements to make for homework. If the work were typed, then it would reduce the chance that they would let personal loyalty or animosity guide their views. However, the amount of time taken to generate data means that we will have to continue to find the elegant compromises which make the work of a teacher so endlessly rewarding.



No more marking? Not quite