I previously presented calibration plots for the 16 most popular casters taken to the WTC. I cut to top 16 to make sure that there was enough data in each bin. I was convinced by Privateer Press forum user Fluffiest to try running the same analysis on the factions instead. I was not expecting much to leap out from the aggregated data.
Due to the larger number of games observed in each group I spit the results into 10 bins each. I wanted to summarize these plots in a single easily digestible metric. To rank the factions I used the linear model trendline to calculate the area under the curve within the one by one box that makes up each plot. If the number is greater than 0.5, players in that faction are winning more games than expected. If it is less than 0.5 they are winning fewer games than expected. The area under the curve of the line of best fit can be calculated from the gradient and intercept. I would consider a faction to be balanced if the area under the Wins versus Predicted plot is 0.5. The results of this metric are somewhat unexpected.
Faction | Score | |
---|---|---|
1 | Retribution | 0.56 |
2 | Circle | 0.55 |
3 | Legion | 0.55 |
4 | Cygnar | 0.54 |
5 | Convergence | 0.51 |
6 | Mercenaries | 0.50 |
7 | Protectorate | 0.49 |
8 | Minions | 0.48 |
9 | Khador | 0.48 |
10 | Cryx | 0.45 |
11 | Skorne | 0.45 |
12 | Trollbloods | 0.41 |
While widely acclaimed faction Retribution is right at the top, Legion performed well above what is expected based on player experience, and while reviled faction Skorne is at the bottom, favoured faction Trollboods is in dead last.
Of course I cannot claim that these results definitively suggest that the balance is off for Trollbloods, this is a startling result, and one that may require further investigation.
Edit: The calibration plot for the entire WTC dataset explains the misspecification shown by many of the caster and faction calibration plots. There is an overall discrepancy between the predicted and observed win rate for the largest ratings differences. This is likely due to the small training dataset available (6-12 games for most players), as well as needing to impute around 30% of the field.
While this shows that the ratings still need more training data, I believe that this approach will be useful for considering the effectiveness of casters relative to player skill.