No more marking? Not quite

The principle behind Comparative Judgement sounds excellent: through a process of side-by-side comparisons a team of assessors arrive at a statistically robust evaluation of the relative quality of the various pieces of work. Human beings are notoriously bad at absolute judgements (such as applying a mark scheme to an essay, discussed perceptively by Daisy Christodolou in The Adverb Problem), but able to make comparative judgements of quality reliably and quickly. With this in mind, and inspired by David Didau’s excellent posts such as this, I was keen to explore what it had to offer to the work of a busy English department.

Getting data right is essential for me, and we’ve gone a long way to make our use of marks (as opposed to verbal or written feedback) more reliable through regular moderation exercises and the introduction of much simpler assessment terminology for quick feedback, similar to what Tom Sherrington is developing for KS3. Getting teachers to think in terms of rank order before they apply formal mark schemes is also very helpful: work out the mark for your best essay and your worst essay, and you’ve at least got the range of marks to work within.

However, what Comparative Judgement offers, according to the website, takes this to a new level of rigour and reliability. The question is, how useful is this system for a department where each teacher is ploughing through piles of books each week, and where getting everyone around a table to mark together is a scarce luxury?

The Trial

We decided to undertake a comparative judgement exercise on our Lower 6th mock exam, where all our Pre-U English Literature candidates were taking the same unseen critical appreciation paper. The logic here was that we would have a reasonable number of responses to a small range of questions, and that the data would be meaningful in giving a snapshot of student performance across a whole cohort.

The Procedure

As soon as the examination was completed, I collected the scripts and anonymised them, covering up student names with a sticky label (if you want to replicate our experiment, stock up on sticky labels – they play a very important role!) and identifying each with a random code of letters and numbers. The scripts I then scanned as .pdfs and uploaded them to the website.

Setting up the judging on the website took me to the limits of my capacity to understand statistics, and the guidance on their website presupposes quite a higher level of comfort with stats than, I think, most non-mathematicians would possess. In the end I used the parameters set up below:


The Scale field is set to the Median=50, Standard deviation=15 recommended by the guide to give a meaningful spread of results. I chose Distributed for Script Selection Type (recommended in the guidance), and left Anchor Scores off. The CJ Estimation button turned out to be a godsend: make sure you click it each time you change the number of judgements or make changes to the judging to re-calculate the number of judgements each item receives.

A word on the wording. I decided for this exercise to be very open in my wording, omitting any mention of assessment objectives or grades descriptors. Hence, it was purely the judgement of a professional English teacher on which essay had done a better job of responding to an unseen text. I believe we can make good judgements without the need for assessment objectives to introduce artificial hindrances: does the response do justice to the text should be our guiding principle.

Thereafter it was quite straightforward. The candidates being uploaded, I entered the e-mail addresses of my team of judges and sent the unique url for each judge’s allocation to the team. Then, I sat back and waited to see what would happen.

The Judging

The process of reading twenty pairs of essays took about three hours for each judge, with the median time per judgement varying lying around the five minute mark. I would note that the median time becomes considerably quicker as you go along: each judge soon begins to see essays repeated, so you can in some cases make instant judgements. From a leadership perspective, it’s important to encourage your team to stick with the judgement process and reassure them it will get quicker. Three hours seems a reasonable amount of time to mark thirty examination essays, so per person, it’s no more arduous than normal marking. However, with six people in the team, that’s a total of eighteen hours of department time invested in this exercise. The question is, was it worth it?

And here’s where the sticky labels came in again: I asked colleagues to make brief notes while they read an essay summing up their thoughts or particularly salient points about each one. These were to be done on a sticky label so that, once the judging was complete, we could stick the labels to each students’ work so they got feedback not from one judge, but from five or six. How well this worked, I will discuss later.

What we learnt

When you download the Candidates spreadsheet, you’ll see something which looks like this:


The first column is the anonymised id, and the scaled score uses (for this assessment) the median of 50 and a standard deviation of 15. The other interesting column is infit, which measures the level of agreement between judges.

The first thing we learned was that candidates performed as we expected them to: those with the best GCSE and ALIS scores did best, though there were one or two surprises where strong students had underperformed. So far, not worth eighteen hours.

We were able to tell that our two top candidates were noticeably better than the next four or five, being almost one whole standard deviation above the mean, with scaled scores of 75.7 and 74.3. That said, two standard deviations is the usual minimum measure of statistical significance, so we’re not going to get too excited, yet.

Slightly further down the list, we began to see interesting clusters emerging, groups of candidates whose scaled scores were spread by only two or three points. This told us that, in our collective view, these candidates had performed effectively identically. We also discovered that, with the exception of one outlier who had just had a bad day, the range of scores below the median was less widely spread than those above it.

The implications for marking are several. Firstly, I feel (and I’ve no statistical evidence to back this up beyond empirical observation) that teachers are often reluctant to give two very different pieces of work the same mark. We try to manufacture differences, giving one piece an 18 and another a 19 in order to satisfy our own prejudices or instincts. Undertaking comparative judgement led us to the view that there are not only different ways to achieve an 18, but that two 18’s may look very different and do very different things. Secondly, it also avoided the opposite problem of weaker essays being disproportionately harshly marked because our perception of their relative weakness is skewed by their status as outliers.

The ‘Infit’ column is also worth a close look. Where a candidate scores below 1 on this, it means that there is a high degree of agreement between judges on the quality of their work. Where it exceeds 1, the judges are more divided. What surprised me is how much we disagreed over our best candidate: their scaled score was 75.7 (one standard deviation above the mean), but their infit was 1.59. This prompted some productive discussions of what we were calling a ‘good’ essay.

Arriving at the marks

The process of marking was quite straightforward. I took the top essay and applied the mark scheme to it, arriving at a mark /25. Then I divided up the scaled scores into groups, deciding that where there were a cluster of essays with similar scaled scores (+/-5 was my rule of thumb), they would receive the same mark. I then decided on a mark for the bottom essay and worked down essays from top to bottom, assigning marks without again looking at the mark scheme, but using the comparative judging data as my guide. To test the efficacy of this, I gave the essays to colleagues and asked them, ‘does this feel like an X to you?’. While this falls foul of the anchoring effect, we did, as a team, agree that the marks were right.

Was it worth it?

In short, yes, because it brought us together as a team with everyone having spent a good chunk of time scrutinising a whole cohort’s set of exam essays. We could have a really meaningful discussion about teaching implications, and we knew that the data we were providing to tutors and heads of year was meaningful and it was right. The department meeting following the exercise was a really fruitful, challenging and illuminating session.

We also learned some interesting facts about ourselves and our marking. As HoD I got a breakdown of the reliability of my team’s judgements, (using an ‘infit’ score as described above), which will provide an interesting talking point when thinking about professional development (though I would never judge a teacher’s performance on this exercise alone: a busy week, an ill child, a winter cold could all throw someone’s performance off badly).

For our students? I’m less sure. I think it was interesting for some to get feedback from many different markers, and many were able to spot patterns of comments which the markers had made about their work. However, for those over whose work judges were less able to agree (resulting in a high infit score), the contradictions between the comments were confusing, not enlightening. Careful feedback from class teachers and opportunity for students to reflect meaningfully was essential for them to get the most out of it.

Will we do it again?

Yes, but I’m not sure when. We might use it for a sampling exercise for large-scale exams (such as fifth form mocks), and I could see it being an interesting way of getting classes to carry out peer-assessment: I could set them up as judges and assign them a number of judgements to make for homework. If the work were typed, then it would reduce the chance that they would let personal loyalty or animosity guide their views. However, the amount of time taken to generate data means that we will have to continue to find the elegant compromises which make the work of a teacher so endlessly rewarding.



No more marking? Not quite

13 thoughts on “No more marking? Not quite

  1. Wow! This is certainly thorough and I’m really impressed you approached the task in this way. I’m the process of doing a CJ project with Chis Wheadon and the approach we’re taking is nothing like as time consuming. We’re using objective, anonymous ‘experts’ who have no connection to us or our students. (This costs a very well spent £100!) I’ll shre the results as we get them.

    Many thanks, DD


  2. Great to hear about your trial! You pick up on some of the advantages and disadvantages well. The gains relate to sharing of standards, differentiation of marks, feedback on judges. As David says your judging was very slow – the transition away from marking takes time. Judging is instinctive, so should be quick and easy. We’ve found that English teachers can make good judgements about GCSE essays in 7 seconds, with a median time of 30 seconds. The process is hugely slowed when you ask judges to take notes – which is extraneous to CJ. The problem with note taking is that teachers tend to slip back into a marking mindset rather than staying in the judging zone. However, you are right to say this is not an every day activity – but really suited to mock exams or baseline or induction testing.

    Liked by 1 person

    1. Hi Chris, thanks for taking the time to comment. I’ve given some thought to your points (and David made some very similar ones), and you’ll find my response to his comment. On the subject of making notes, some colleagues commented that they found it slowed them down. However, I would disagree quite strongly that judging should be instinctive: judging a literary argument should require deep engagement and reflection on the ideas put forward.


  3. Jon hide says:

    I, like you, was interested to see how CJ worked. I’m a maths teacher with a Grade 10 class which recently completed a “real-life” problem, answered as a short report. I used the students themselves as the judges. Each time they made a judgement they had to write two bullet points as to why they made the judgement they did. Of course, they occasionally had to compare there own work with another person’s work and I encouraged them to be as honest as they could. The infit results were good – with a very few exceptions. My biggest concern was that one or two judges took very little time to make each of their judgements. When I do this again I will endeavour to incentivise them somehow to do the job as honestly as possible. (It is possible, within the programme to ignore judgements made by individual judges if there results give cause for concern).


  4. Will – just had a chat with Chris Wheadon about your blog. He said, “looks like they were re-marking not judging. I won’t allow a judge to take longer than 30 seconds to make a decision.” Interesting, no?


    1. This is a really interesting point, but I would use ‘marking’ mean to apply a mark scheme, that is, to arrive at a final mark out of whatever. Chris’ comment that no judgment should take longer than 30 seconds makes me wonder how attentively a judge is reading the work in front of them. For a literary essay at Sixth-Form level the reader has to let the writer make their case, reserving their opinion until they have read the whole essay. It often happens that a writer makes an apparently unfruitful point early on, only to return to it later and develop it in an interesting direction. I frequently found myself re-reading whole passages in order to make sure I’d fully understood what the writer was telling me (I’m also a great advocate of reading students’ work out loud, but that’s another blog post…)

      For me, the value in CJ came from the scrutiny of essays for and in themselves, not the sort of ‘marking’ process whereby one looks for AOs to tick and tries to find a band to put it in. Reading a student’s essay should be exhaustive and exhausting, especially when dealing with students who may be making very subtle points that require time and effort from us to fully absorb what they are saying about a text.


      1. ” Reading a student’s essay should be exhaustive and exhausting, especially when dealing with students who may be making very subtle points that require time and effort from us to fully absorb what they are saying about a text.” That depends on why you’re reading the essay. If it’s to give formative feedback then you you’re right of course. But if it’s to make a summative judgement this in unnecessary in CJ.

        But what a great discussion to have


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s