One of the most anticipated sessions of this year’s forum was David Stickland’s presentation of his mathematical analyses of dressage results. In 2007 nuclear physicist Stickland read a column on Eurodressage about the advantage of half marks for judging and it prompted him to investigate in depth scoring patterns. In 2008 the FEI assigned him to analyse scores from Hong Kong but since then his research has carried much further. The Princeton professor has gone back five years analysing more than 13,000 CDI tests to create an objective tool to measure judging. His session at the Forum was his first public speech about his findings.
Stickland noticed that there is a 1,6% difference in the final score between one judge and the average of the four other judges. In the Kur the difference is 2%. This implies that overall 72% of the final ranking is changed due to one judge (especially at the middle of the ranking) and 34% of the podium ranks are changed by one judge. These are huge numbers which prove the major influence of one judge on the panel of five. In the top third of ranks, 59% gets changed by one judge even when it is a panel of 4 or more O-judges.
The 1,6% difference between one judge and his four colleagues is ultimate proof of biased judging. Stickland verified this by creating a "randomized test". He took the 2009 CDIO Aachen Grand Prix and created fictitious rides. He used real scores from the judges but assigned a random rider for each movement. A computer program took, for instance, Cornelissen's five points for the first movement (entry and halt), Carl Hester's scores for the proceed in collected trot, Fiona Bigwood's for the extended trot and that all the way down to the collectives, each time with a computer picking the real scores of a random rider.
The result of these randomized tests is staggering. Judges are much more accurate in giving scores when they don't know the rider. In Aachen, they achieved an accuracy of only 1,0% difference between one judge and the average of the four other judges. "Judges do better on fake tests than real tests," Stickland cleverly noted. If you convert these numbers and statistics, it means that judges judge three times worse when they know the combination. "Judges are less consistent when they know the rider because of 'combination bias'.'"
Judging is a team sport and every judge on the panel has an equal voice, which means that every judge needs to be at his peak. It implies that judges need excellent education to be at their best. Without combination bias, judges should be able to achieve only a 1,0% difference between one another.
Stickland zoomed in on the judging at the 2009 European Championships in Windsor and concluded that at this event the judges were much more consistent than at any other event he had dug into. There was only a 1,1% difference between the judges in the Grand Prix, while at the 2009 CDIO Aachen it was 1,6%. For the Windsor Kur to Music Finals, the difference was 1,8%. Stickland noted differences between the judges. Dutch Francis Verbeek was on average 1,1% lower than her colleagues while Swedish Eric Lette was 0,7% higher. Fortunately, both of them did not change the end ranking if either one of them were taken out. These shocking numbers makes one wonder what the difference would be at a lower level like the European Junior/Young Rider or Pony Championships. How accurate is judging at youth riders' events?
One of the weirdest judging cases in Windsor was Imke Schellekens' Grand Prix ride on Sunrise. She scored 69% with Lette and 77% with Polish Markowski but both judges cancelled each other out and they did not change her average score. Stickland focused on these inconsistencies and identified three causes: a mistake of viewpoint due to poor view or a lapse in concentration, an error (the judge simply gave the wrong score but this would be corrected in the future by the Judging Supervisory Panel), and bias when big deviations occur.
With his analyses Stickland detected patterns of nationalistic judging. Most judges do not upscore their own or a favourite country, but instead they “downscore” the rival ones. At the European Championships in Windsor it was noticed that Verbeek, for instance, downscored Great Britain and Germany (the two other team medallists) while Clarke downscored Germany and Sweden. Lette, however, significantly upscored Sweden and Germany. Coincidence? Stickland, however, stressed that because the judging in Windsor was quite accurate, no individual judge changed the final ranking.
As solutions for inconsistent and inaccurate judging, Stickland brought to the fore the Judging Supervisory Panel to correct errors. The use of half points would increase the accuracy tremendously as well as component judging (which was tested at the System Trials in Aachen but at the moment seemed too radical to adapt in the near future). He also spoke of the implementation of a code of points based on the FEI Handbook. The code would describe exactly the point a certain execution of a movement would earn, even down to a decimal number.
On the level of judges development it is important that judges get formal feedback on their performance. The training can improve by doing global e-learning and training seminars for (aspiring) judges. Furthermore there has to be proper testing for judges, who get appointed and in-service training and testing. The Dressage Task Force even suggested demotion when regular poor performances occur.
Stickland's session was very thought provoking and it was a relief that it finally has been proven statistically what so many riders, trainers, journalists and dressage lovers have been saying for years: biased and nationalistic judges takes place. At least now the FEI can work structurally in a constructive way on strengthening its judges' corps.
Text and photos copyrighted Astrid Appels/Eurodressage.com
No Reproduction allowed without explicit permission