Now that the WTC has passed there is much discussion about what the results mean for whether Warmachine All New War is balanced. In Mark 2 I estimated a matrix of caster pair scores to use as the home advantage weight. For this WTC I did not have the time or experience to guess the outcomes, so used a blank pairing matrix, corresponding to no home advantage. This performed reasonably well (write-up to follow), but also gave me the opportunity to see whether any casters were performing beyond expectation.
After registering the player ratings from 2015 with the cleaned results from 2016, I was able to estimate the probability of each player winning. I used the probability of each player with the caster of interest winning each game. To visually check whether any casters were performing above or below expectation I divided the results into five equally sized bins and compared the actual result of these games with the expected value.
The key concept is that for binomial data like this with two outcomes, it is impossible to measure prediction accuracy from one trial. If a player is rated as having a 60% match-up against a certain opponent, I’d want to observe 10 or 20 games to see if I’m in the right ball park with my predicted estimate. This analysis pools the game results of match-ups with a similar predicted outcome. If players in this bin win 60% of their games, there is no special effect for this caster and our predictions are accurate. The corresponding point in the plot would be on the line. If the players win more than 60%, there may be some effect, such as a powerful caster, influencing the outcome. The corresponding point would be above the line. If the players win less than 60% of their games, the caster may be weak. The point would be below the line.
The graphic above shows the observed versus expected win rate, taking into account (estimated) player skill. If a caster is more powerful than average, the points are more likely to lie above the line of unity. If a caster is less powerful than average, the points are more likely to lie below the line.
Since each point contains a fifth of the data, we can assume that the less popular casters (such as Amon 1) could have trends in their data points explained by chance alone. For Amon, 8 games where weaker players played stronger players resulted in a better than 50% outcome. While this could be a real effect, it could equally be caused by chance.
For the most popular casters (Wurmwood 1, Madrak 2, Haley 2) the tailing off in the last point reflects the outcomes of 14-20 games. Good players appear to be performing less well against weaker players than expected. This could be caused by the Elo system itself. The rating assumes that the outcome of a game follows a normal distribution. In reality, game outcomes for one-sided games are less certain in favour of the strong player than a normal distribution would predict.
The other signals in the data away from the line may be caused by only examining the effect of one caster at a time. In my previous analyses, I have estimated the relationship between all casters at once. Powerful casters may have very good as well as very bad match-ups. The detail of the individual games my reveal that a particular warcaster or warlock has a smaller or larger penalty.
As with any analysis, many assumptions have to be made to attempt to gain an understanding of the system. However, at first glance, it appears that these most commonly played casters are well balanced, and the effect on the average game outcome beyond player skill is not completely off kilter.
Edit: A summarization of these plots could be taken as the area under the curve for the trendline for these results. A balanced caster would have a score of 0.5. A caster which gives players a better than 50% chance of winning would be greater than 0.5. Based on this metric, Baldur 2, Caine 2 and Vyros 2 were the most advantageous casters. Madrak 2 was the least advantageous of the 16 most popular casters.
|3||High Reclaimer 1||0.56|