David Stickland: What's Wrong with the Freestyle Scoring?

Opinions

In her editorial of 29 December 2012, “Mechelen Music Massacre,” Astrid asks the question: “Why does the same panel of judges, whose score deviation normally … stays within a 5% limit in the Grand Prix, bounce all over the place with their scores when it comes to the Kur?” It's a good question and mathematics has some of the answers.

I hope you will struggle through the first few paragraphs of numbers so you see where the last paragraphs of conclusions come from.

If a Dressage test consisted of just one movement then the precision of scoring would be 5% (0.5 points accuracy on a 10 point scale). In the standard Grand Prix there are 37 marks given, we can think of these as 37 independent measurements of the Dressage quality and the more measurements you have the better it can be measured. If you ignore correlations between movements - there are of course many, for example all the Passage marks are likely to be about the same if there are no mistakes, and of course the collective marks are normally correlated with the figure scores - then the final theoretical score precision for one judge can be shown to be about 0.8%.

In reality we measure more like 1.5% as the variation between judges when there are 5-7 judges in CDI events. The increase from 0.8 to 1.5 is probably due to a combination of factors: the actual correlations between figures (this reduces the number of genuinely independent marks from 37 to a smaller number); the different judges views; and to occasional judging mistakes. Anyway, with 1.5% precision per judge, in a 5-judge event this results in a final score precision of about 0.7%. That's for the Grand Prix and Special.

The important thing to note here is that the more marks that are given the more precise the final measurement could be.

So what happens in the Freestyle?

Well for one thing there are a lot less marks to give out. In the Technical part there are 16 marks, and in the Artistic part there are 5 marks, and each of these parts is worth 50% of the final score. If we assume that the judges can do just as well per mark in the Freestyle as in the Grand Prix, then:

For the Technical score the 1.5% single judge precision would become 2.3%
For the Artistic score the accuracy will become about 4%

These calculations are supported by the data where the final score precision per judge is about 2-2.5% for freestyle tests. This has nothing to do with good or bad judging, it is a simple result of the fact that there are only 16 marks in the Technical side and 5 marks in the Artistic side to be given out with a 0.5 point precision per mark.

Astrid asks: why are there so many differences bigger than 5% in Freestyles?

Well 5% is a sort of special number for the Grand Prix because when the precision is about 1.5%, we should get a 5% deviation less than 1% of the time (technically, it is called a 3 standard deviation effect). That means, in every 20 score sheets (5 judges x 20 riders=100 judgments), we should see 1 or fewer sheets where the judges are 5% or more in disagreement. In fact, it does happens about 1% of the time so reality matches pretty well with theory.

But when judging precision becomes more like 2.5%, as it does in the Technical side of the Freestyle, then we will see 5% deviations between judges about 5% of the time. That means we will see judges disagreeing 1 in every 20 judgments – i.e. once every 4 participants for a 5-judge event. The likelihood of the Artistic score being more than 5% different is MUCH bigger, more like 20% of the time i.e. once per competitor for a 5-judge event! In Mechelen with 8 riders in the Freestyle, 5 of them had at least one Artistic mark more than 5% from the average of the other judges – this should be expected from our current system - it is not bad judging.

Is it judging quality?

We have just seen that 5% score differences in Artistic marks should be happening very often – and they do – and we have not had to call into question anything about the judging quality. It is a simple feature of the small number of marks assigned and of 0.5 point judging. It is inescapable in the current system of freestyle measurement.

In fact, IF judges did not disagree this much on average, then something would be wrong! They would be beating statistics, and you can’t do that all of the time. Simple mathematics does not allow it, if you do better you must be in some way correlating your scores with your colleagues – for example you adjust your Artistic scores to get the % that you want to give. This does not have to be in any way dishonest, but it does have to go against the principle that each score is measuring something different and that the final score is a simple combination of those different observations.

So how can freestyle measurement get more precise?

I’m not a judge and I’m not even a dressage rider and I’m certainly not a musician, so I’m not going to answer for them, I’m just going to answer from the mathematics. You would have to give more marks or you would have to have even finer judging precision, say 1/10 points. But today there is nowhere near enough guidance to ask a judge to judge a figure with 0.1 pt. precision. The system is not structured enough to permit that. We have a big problem, with the measurement system we use we cannot do better, so the only possible option is to change the system.

Let’s examine the big picture again. The Task Force and the Aachen Test Event has allowed Dressage to put in place three key improvements 1) 0.5 point scoring, 2) 7 judges in championships 3) a Judges Supervisory Panel at top events. This really did result in real improvements in London judging compared to Hong-Kong for the Technical tests (GP, GPS). My personal observation is that at least the 0.5 point scoring is also trickling down into the lower levels of Dressage and having a similar positive impact. Not every event can afford 7 judges, but the knowledge that more judges improves the stability of the result is now generally accepted. The JSP in London was really effective, they genuinely helped the judges in the arena.

But the flagship test in the discipline has not really benefited from these changes. (For example, the JSP cannot really act in the Freestyle – because the judgments are much more subjective.) The fact that 50% of the final score is decided by just 5 marks inexorably leads to large differences between judges. Compound this with the fact that some of these 5 marks are themselves not very rigorously defined so that judges are forced to make their best personal interpretation of very subjective observations – Harmony, Music, Choreography, Degree of Difficulty and you end up measuring the top event with tools that are not up to the task.

I want to reiterate that the defect in freestyle scoring is in the measurement system, not the judges:

Too few marked components;
A single mark precision of 0.5 points;
Inherently subjective criteria that are not precisely defined.

Let’s make a New Year’s resolution to talk about the measurement system, not the judging. Then as a community we can focus on how to change that measurement system to improve our sport. Between the last two Olympics the measurement of the Technical tests was radically improved; we should address the Freestyle before Rio and give the athletes clear goals they can confidently prepare for.

So my conclusion, arrived at in a quite different manner, is actually not so different from Astrid’s: If you want better Freestyle classification you have to give judges more precise guidance so they can ascribe marks more precisely. The Freestyle judging system would need to be totally rewritten, I do not see how to fix it without radical change.

Happy 2013 to all my Dressage friends!

by David Stickland for Eurodressage