Battle Reporter and the WITC 2016


I was excited to discover Battle Reporter while listening to This Deathclock has 60 Minutes. The site is hosted by the Trollblood Scrum and the Deff Head Dice blog. This is a database interface that allows anyone to record their Warmachine games. And best of all, the data provided is available to download as a live-updated CSV text file. The results so far are interesting, with entries from more than 450 unique names, including Warmachine celebrity MenothJohn, and two WTC players.

This file currently contains records of over 1300 games. It is also possible to track your games by getting in touch with the administrator should you wish to use the website for game tracking. I will certainly be using it from now on; I urge you to use it also. Currently 164 casters have been recorded, so a little way to go.There have been zero games recorded with Bradigus for example. Of the casters observed, there are 13366 unique combinations, so many more games are needed to get a complete representation of caster rating.


Bradigus Thorle the Runecarver by Treetownpainting

As Deff Head Dice already has descriptive statistics, I thought I’d use the dataset to initialise some caster ratings for WTC. Players can self assess their own ability. However, as players are not invited to assess their opponents’ ability I will not take this into account. On average, players likely mostly have opponents of similar skill, and if not, with large enough records unequal games should average out.

Players can also state what kind of game it was; casual, local tournament or national tournament. I chose to give a slightly higher weighting to the small number of tournament games as these may more closely align with how the casters will be played at the WTC.

About a dozen records had no opponent caster. Interestingly, all of these games had been entered by Retribution players (multiple users!). Presumably these haughty elves care not for their human prey. I discarded these results along with draws.

I added all games for each caster pairing to a new blank matrix for Mark 3, adding wins and subtracting losses, and recorded the number of games observed. At this point I wanted to scale this table of win and loss so that it corresponded to player ratings.


The Wisconsin Team Tournament is a WTC-style team tournament hosted by Privateer Press Judges Nathan Hoffman (from the Crippled System podcast) and Travis Marg. The organizers have made a very similar dataset to the WTC available as an HTML table. I was able to scrape the data from this using the R package rvest. Twelve teams entered and played out over 4 rounds. Eight of the 60 players were American WTC players for whom I had ratings. For the others I had no information, other than that they had not previously played in the WTC. I arbitrarily assigned these players as having a rating of 1400 as they are likely of lower skill than players that had passed through the selection process to join a team (initialised at 2200 in my initial analysis).

To allow the caster ratings to be scaled, I needed to add some information to these naive player ratings. I ran two rounds of the team with the unscaled caster ratings to start to split up the players. I then optimized the results of the third and fourth rounds by scaling the caster ratings to maximise r-squared (a measure of correlation) and minimize difference of calibration gradient from 1 for the predicted wins with the proportion of wins observed (Gist). The scaling number was 9.3, so most games were worth 9.3 points for each win.

This is a very… hmm… let’s kindly say heuristic… approach which has given me some approximate caster ratings. I will test these caster ratings on the WTC data. Still this approach is much more direct than my previous method, and can be performed in the absence of CP and AP information. If these penalties tally with broader player experience, that would be reassuring. The following penalty plots show penalties for the caster named in the title against each of a range of casters. The line shows the position of the rating estimate, and the coloured bands give an estimate of the number of records observed. Narrower bands mean more games have been recorded for that pairing. Plots are only shown here for casters with more than three games against more than three casters.

For reference, if two equally skilled players played a game using list1 against list2 with a penalty of -50, player 1 would win 44.7% of such games. If they played with penalties 0, 10, 20 and 50, player 1 would win 50.0%, 51.1%, 52.1% and 55.3% of their games respectively. This means that the process described here has observed 3 wins for Helynna1 against Agathia1, and proposed that this corresponds to a ~5 percentage point advantage. The 4 observed defeats at the twisted hands of Mordikaar1 corresponds to a ~6 percentage point disadvantage. Kozlov1 was observed defeating Syntherion1, but was defeated by Helynna1. Madrak2 had a good record against Tanith1, but lost several games against Butcher3.

Of course, these values are driven by relatively small numbers within a caster pair, but this is a relatively objective approach to allow me to initialise the caster penalties. In the meantime, keep reporting your games!

Rate My Captain WTC 2015

Previously I looked at the performance of simple strategies in a simplified version of the WTC Pairing Game. The game is played by teams A and B. Team B reveals one player initially. The other team reveals two players, and team B selects a pairing. Each team then takes it in turns to reveal an additional player and select a pairing. It is widely considered an advantage to be team A since their final choice also determines the fifth pairing. However, I was not able to find any broadly applicable effect in favour of team A.

While the pairing game information was not collected for WTC 2015, the Enter the Crucible stream included footage of the pairing. So while this has not yet allowed me to identify strategies that the 12 captains were using, I did spend some time looking for patterns. One fruitful line of investigation was a checkout of what pairings were possible. Five players in two teams can result in 120 pairing combinations. Since I have estimates of player ratings and Mark 2 caster penalties, I was able to estimate the likely outcome for each player permutation. Given that each captain only has themselves and their four players to put into matchups, we should not assess them on probability of victory, but quantile of probability of victory. Were captains able to gain matchups that was better than 50% of the possible player combinations?

The following graphics display two types of information. Team A are listed on the right hand side of the plot. Team B are listed along the bottom. The x-axis consists of 120 small panels corresponding to each possible combination of the 10 players. The 5 rows of bars correspond to team A. The colours from purple to light green correspond to team B. For each permutation there is an estimated probability of outcome from 0 (team A certain to lose) to 1 (team A certain to win). The fat black line is the median estimated probability of the outcome. The thin black lines are the 10th and 90th percentiles of the estimated probability. These probabilities were calculated by simulation from the player ratings for random list selections for each player. Since the estimates are the average of the best and worst list combinations, list rating will have the biggest effect when there is an advantage between two players over both of their opponent’s lists. The permutations were then sorted from lowest to highest probability of team A winning. We can visually look for patterns in these plots that indicate pairing that would more likely lead to success for a team of interest.

In Round 1, Australia Platypus was playing Ireland Craic as favourites. Team B (Ireland Craic) will do best when Philip Johnston is paired into Dyland Simmer or Jeff Galea.  Team A (Australia Platypus) will do best when Dyland or Jeff are paired into Dan or Mike Porter.


In Round 2, Belgium Blonde was playing England Lions as underdogs. Team A (Belgium Blonde) need to get Laurens Tanguy into Paul North. Team B (England Lions) need to get Paul into Dirk Pintjens.


In Round 3, USA Stripes was playing Germany Dichter & Denker. Team A (USA Stripes) need to get Jay Larsen into Robin Maukisch. Team B (Germany Dichter & Denker) need to get Sascha Maisel into Jeremy Lee.


In Round 4, Australia Wombat was playing Canada Goose as favourites. Team A (Australia Wombat) should be aiming to get Chandler Davidson into Aaron Thompson or Ben Leeper. Team B (Canada Goose) should be aiming to get Charles Soong into Ben Leeper.


In Round 5, Australia Platypus was playing Finland Blue. Team A (Australia Platypus) should be aiming to get Jeff Galea into Jaakko Uusitupa and Sheldon Pace into Henry Hyttinen. Team B (Finland Blue) should be aiming to get Jaakko into David Potts.


In Round 6, Australia Wombat was playing Finland Blue. Team A (Australia Wombat) should be aiming to get Joshua Bates into Jaakko and James Moorhouse into Pauli Lehtoranta or Tatu Purhonen. Team B (Finland Blue) want to get Henry into Joshua.


Given these estimates, how did the captains perform? The following plot shows the 10th, 50th and 90th percentiles of the median estimated probability of team A winning. The purple blobs show the quantile of the permutation selected at the event. If the blob is to the right of the median (50th percentile), team A performed better than the opposing captain. If the blob is to the right of the median, team B outplayed team A in the pairing game. For rounds 1, 3 and 6, the pairings are consistent with equally skilled captains or no advantage for team A. For rounds 2, 4 and 5, the captains for team B outplayed the captains for team A. This information suggests that for these six pairing games there is no evidence that team A is advantaged.


All of this analysis is based on crude estimates of caster rating and player rating. Many assumptions are made, and the subset of the data presented here is very small. However, I consider that these results are additional evidence that it is not advantageous to be team A. It is my intuitive belief that although it appears that team A are choosing more matchups, in fact, the choice is limited in its degrees of freedom. While team A may on occasion be able to choose the optimum combination of two pairings, they may have already sacrificed the ability to have good matchups in the other three selections. Team B are in the position of selecting first, and make two selections before team A make their final selection. The additional gain imposed by the rules of the WTC packet of being able to pick tables makes me believe that there is likely a real advantage to being team B.

Balancing Factions

I previously presented calibration plots for the 16 most popular casters taken to the WTC. I cut to top 16 to make sure that there was enough data in each bin. I was convinced by Privateer Press forum user Fluffiest to try running the same analysis on the factions instead. I was not expecting much to leap out from the aggregated data.


Proportion of observed wins for each Warmachine and Hordes factions in WTC 2016, plotted against the expected outcome given relative player skill as estimated by Elo rating (number of observations in brackets)

Due to the larger number of games observed in each group I spit the results into 10 bins each. I wanted to summarize these plots in a single easily digestible metric. To rank the factions I used the linear model trendline to calculate the area under the curve within the one by one box that makes up each plot. If the number is greater than 0.5, players in that faction are winning more games than expected. If it is less than 0.5 they are winning fewer games than expected. The area under the curve of the line of best fit can be calculated from the gradient and intercept. I would consider a faction to be balanced if the area under the Wins versus Predicted plot is 0.5. The results of this metric are somewhat unexpected.

Faction Score
1 Retribution 0.56
2 Circle 0.55
3 Legion 0.55
4 Cygnar 0.54
5 Convergence 0.51
6 Mercenaries 0.50
7 Protectorate 0.49
8 Minions 0.48
9 Khador 0.48
10 Cryx 0.45
11 Skorne 0.45
12 Trollbloods 0.41

While widely acclaimed faction Retribution is right at the top, Legion performed well above what is expected based on player experience, and while reviled faction Skorne is at the bottom, favoured faction Trollboods is in dead last.

Of course I cannot claim that these results definitively suggest that the balance is off for Trollbloods, this is a startling result, and one that may require further investigation.

Edit: The calibration plot for the entire WTC dataset explains the misspecification shown by many of the caster and faction calibration plots. There is an overall discrepancy between the predicted and observed win rate for the largest ratings differences. This is likely due to the small training dataset available (6-12 games for most players), as well as needing to impute around 30% of the field.


Proportion of observed wins for all games in WTC 2016, plotted against the expected outcome given relative player skill as estimated by Elo rating (number of observations in brackets)

While this shows that the ratings still need more training data, I believe that this approach will be useful for considering the effectiveness of casters relative to player skill.

Scoring WTC Forecast Performance

Last week I made some predictions for WTC 2016’s Teams based solely on Elo style rankings calculated from previous years’ data. I previously posted a metric for scoring rankings based on difference from the true position per prediction. My R implementation of this is available through my package WTCTools.This metric allows different numbers of predictions to be compared, although when increasing the number of rankings, the likelihood of a low (good) score falls. My implementation also allows pundits to predict country only, in which case the score for that position is the average of all teams from that country. The difficulty of this ranking is similar, and so comparable to that of selecting individual teams.

Last year I scored a mean distance of 9.8 places for 50 teams. This year my ratings are based on more years of Mark 2 data, but do not have any information about caster strength.

I also found this article by Klaw. He had collected predictions from some of the finest minds in Warmachine. These players know the field and may have even played games against some of their rivals. Their knowledge of player skill at the top of the field should give them great intuitive insight into teams well placed to win. They also picked dark horse teams, which I did not include, as I suspect that these were considered under-rated teams, rather than 7th placed teams. Jeff Galea only presented four picks, everyone else presented 6. Martin Hornacek only presented nationalities, so was scored against all matching teams. How do my predictions compare to these illustrious competitors?

Name Top4 Top6
1 Tom Guan 1.00 3.33
2 Jeff Galea 1.00 NA
3 Rickard Nilsson 1.75 2.17
4 Billy Cruickshanks 2.00 1.83
5 Sascha Maisel 2.75 4.33
6 Lacerto 3.00 6.33
7 Matthieu Vega 5.00 6.00
8 Ryan Evans 5.00 6.83
9 Hermanni Raatevaara 6.75 5.33
10 Marcin Mycek 6.75 5.33
11 Don Martin Hornacek 15.50 17.33

For a top 4 pick I was in the middle of the pack. For top 6 I was in the bottom half. My score for all 64 teams was 8.84. Definitely room for improvement, but not an embarrassing showing either. If I am to improve my forecasts I need to keep track of more tournaments and attempt to update my ratings where possible. I can also use this year’s data, plus perhaps some estimates for list performance.

Balancing Casters

Now that the WTC has passed there is much discussion about what the results mean for whether Warmachine All New War is balanced. In Mark 2 I estimated a matrix of caster pair scores to use as the home advantage weight. For this WTC I did not have the time or experience to guess the outcomes, so used a blank pairing matrix, corresponding to no home advantage. This performed reasonably well (write-up to follow), but also gave me the opportunity to see whether any casters were performing beyond expectation.

After registering the player ratings from 2015 with the cleaned results from 2016, I was able to estimate the probability of each player winning. I used the probability of each player with the caster of interest winning each game. To visually check whether any casters were performing above or below expectation I divided the results into five equally sized bins and compared the actual result of these games with the expected value.


Proportion of observed wins for the 16 most played Warcasters and Warlocks from WTC 2016, plotted against the expected outcome given relative player skill as estimated by Elo rating (number of observations in brackets)

The key concept is that for binomial data like this with two outcomes, it is impossible to measure prediction accuracy from one trial. If a player is rated as having a 60% match-up against a certain opponent, I’d want to observe 10 or 20 games to see if I’m in the right ball park with my predicted estimate. This analysis pools the game results of match-ups with a similar predicted outcome. If players in this bin win 60% of their games, there is no special effect for this caster and our predictions are accurate. The corresponding point in the plot would be on the line. If the players win more than 60%, there may be some effect, such as a powerful caster, influencing the outcome. The corresponding point would be above the line. If the players win less than 60% of their games, the caster may be weak. The point would be below the line.

The graphic above shows the observed versus expected win rate, taking into account (estimated) player skill. If a caster is more powerful than average, the points are more likely to lie above the line of unity. If a caster is less powerful than average, the points are more likely to lie below the line.

Since each point contains a fifth of the data, we can assume that the less popular casters (such as Amon 1) could have trends in their data points explained by chance alone. For Amon, 8 games where weaker players played stronger players resulted in a better than 50% outcome. While this could be a real effect, it could equally be caused by chance.

For the most popular casters (Wurmwood 1, Madrak 2, Haley 2) the tailing off in the last point reflects the outcomes of 14-20 games. Good players appear to be performing less well against weaker players than expected. This could be caused by the Elo system itself. The rating assumes that the outcome of a game follows a normal distribution. In reality, game outcomes for one-sided games are less certain in favour of the strong player than a normal distribution would predict.

The other signals in the data away from the line may be caused by only examining the effect of one caster at a time. In my previous analyses, I have estimated the relationship between all casters at once. Powerful casters may have very good as well as very bad match-ups. The detail of the individual games my reveal that a particular warcaster or warlock has a smaller or larger penalty.

As with any analysis, many assumptions have to be made to attempt to gain an understanding of the system. However, at first glance, it appears that these most commonly played casters are well balanced, and the effect on the average game outcome beyond player skill is not completely off kilter.

Edit: A summarization of these plots could be taken as the area under the curve for the trendline for these results. A balanced caster would have a score of 0.5. A caster which gives players a better than 50% chance of winning would be greater than 0.5. Based on this metric, Baldur 2, Caine 2 and Vyros 2 were the most advantageous casters. Madrak 2 was the least advantageous of the 16 most popular casters.

Warcaster Score
1 Baldur 2 0.66
2 Vyros 2 0.64
3 High Reclaimer 1 0.56
4 Ossyan 1 0.56
5 Vayl 2 0.55
6 Amon 1 0.55
7 Wurmwood 1 0.52
8 Haley 2 0.50
9 Severius 2 0.49
10 Lylyth 3 0.44
11 Irusk 2 0.44
12 Karchev 1 0.43
13 Butcher 3 0.41
14 Issyria 1 0.39
15 Madrak 2 0.38
16 Caine 2 0.35

WTC 2016 Predictions

I’ve been looking forward to forecasting the WTC results all year. But I also tried to script clean the WTC team rosters. I did not succeed, although I did make some progress. I used the troop-creator army data JSON files from the project GitHub account. Grabbing War Room or Conflict Chamber format was fine, but picking out player names was tricky. In addition, not everyone submitted their lists in War Room format, so there would need to be some manual intervention to get the complete lists. If this is worth picking up next year I would brute force pattern match on War Room entries, then manually sort through the unmatched elements. Validating lists would probably be an additional amount of effort.

Due to the lack of time, I hammered this analysis out on the night prior to the WTC. For this forecast I registered this year’s player names with ratings from 2013, 2014 and 2015. I also calculated the 35th percentile rating for each competing country. I selected this value arbitrarily. My rationale was that first timers are likely less experienced at high levels of play compared to previous players. This was then used to initialize new players to the WTC. For countries which have never entered before I assumed that there would be fewer top-level players to choose from, and so gave them a rating of 1800 (lower than my earlier initialization of 2200).

Since casters have certainly changed in power level since 2015, I decided to not use any home advantage, even for high rated casters such as Madrak 2 and Wormwood 1.

Based only on the rating of individual players and with new players weighted by the past performance of their countries teams:

Rank Team Score.Moment
1 1 Australia Koala 9.65
2 2 Australia Echidna 10.78
3 3 Poland Wisents 14.71
4 4 Scotland Irn 16.23
5 5 Finland Väinämöinen 16.31
6 6 USA Blue 18.22
7 7 Italy Michelangelo 18.60
8 8 Australia Wallaby 19.45
9 9 Sweden Nobel 21.26
10 10 Germany Black 21.31
11 11 Denmark Jotunheim 23.33
12 12 Canada Goose 24.13
13 13 England Knights 26.39
14 14 England Lions 27.24
15 15 USA White 28.59
16 16 Austria Schnitzel 30.63
17 17 England Roses 31.07
18 18 Sweden Bofors 32.34
19 19 Poland Storks 32.81
20 20 Sweden Dynamite 32.82
21 21 Belgium Prinzesschen 34.63
22 22 USA Red 34.88
23 23 Poland Marmots 34.96
24 24 Germany Gold 37.64
25 25 Canada Moose 39.51
26 26 Ireland Ceol 40.39
27 27 France Obelix 40.81
28 28 Austria Apfelstrudel 41.02
29 29 Germany Red 41.91
30 30 Italy Leonardo 45.88
31 31 Finland Joukahainen 47.26
32 32 France Asterix 49.92
33 33 Denmark Asgaard 50.73
34 34 Finland Ilmarinen 52.22
35 35 Netherlands VanGogh 52.76
36 36 Russian Bears 54.17
37 37 Spain Red 56.72
38 38 Portugal Primal 58.96
39 39 Norway Red 59.73
40 40 Scotland Bru 62.43
41 41 Ireland Craic 62.94
42 42 Netherlands Rembrandt 63.46
43 43 Wales Dant 64.05
44 44 Netherlands Vermeer 66.38
45 45 Greece Prime 67.60
46 46 Hungary 70.36
47 47 Middle East 72.36
48 48 Norway Blue 72.83
49 49 Wales Crafanc 73.40
50 50 Belgium Victorious 75.89
51 51 Greece Epic 76.04
52 52 Spain Rogue 76.21
53 53 Switzerland Cheese 76.34
54 54 Russian Wolves 78.98
55 55 Switzerland Chocolate 79.35
56 56 Czech Rep White 80.87
57 57 Latvia 81.19
58 58 Portugal Prime 82.08
59 59 Czech Rep Red 82.77
60 60 China Baizhan 87.82
61 61 Northern Ireland 88.16
62 62 China Egg Roll 88.30
63 63 Slovenia 94.01
64 64 UAE 96.21

You can follow all of the action here.

the pairing game

Exciting news for Warmachine and Hordes players. A new edition will be announced in a matter of weeks, bringing in a more balanced and streamlined gaming system. This is awesome for gamers but horrible for my WTC analysis. While I will still have good player rankings (assuming that players remain at the same skill ranking in the new edition), it seems likely that the caster pairing advantage will be dramatically changed. Until data becomes available for the new edition, it will be tricky to make sound predictions.

For now, I’ve been investigating the team pairing process itself. While it is difficult to predict how a team will behave, it is worthwhile to consider how different team behaviours will affect round outcomes.


The WTC pairing process for between two teams, A and B, for a round is as follows:

  1. team B reveals their first player,
  2. team A reveals two players,
  3. team B selects one of team A’s revealed players to match with their revealed player,
  4. team B reveals their next two players,
  5. team A selects one of team B’s revealed players to match with their revealed player,
  6. team A reveals their next two players,
  7. team B selects one of team A’s revealed players to match with their revealed player,
  8. team B reveals their next two players,
  9. team A selects one of team B’s revealed players to match with their revealed player,
  10. the remaining two players are matched.

The widely held opinion is that being team A is an advantage, since this team makes the selections that define the outcomes in three of the five matchups. But how much effect does this really have?

In my initial look at this I’m going to simplify this to a player being a number, and a game being won by the player with the highest number. Each team will have players with rating 1, 2, 3, 4, and 5. When a player with rating 1 plays a player with rating 2, the latter player always wins. When a player with rating 1 plays a player with rating 1, the result is always a draw. If a team can engineer their opponents’ strongest players to play their weaker players, they will have opportunities to win the match by having the advantage in multiple games.

In this analysis:

  • there is exactly one value for each player,
  • both teams know the value of all players,
  • when two players are matched, the player with the highest score wins.

This simple model is not taking into account the fact that each player has two lists; therefore the matrix of possible outcomes is a little more complex. This is also not attempting to model the rock-paper-scissors of individual list pairings. Nonetheless, the results for this simple system are interesting.

Each of these picking strategies follows slightly different rules:

  • Pick First means select the first player presented. This is a naive approach which
    does not react in any way to the behaviour of the other team.
  • Pick Random means select any of the players at random. This is also a naive approach,
    and for randomly sorted players, should behave in the same way as picking first.
  • Pick Max means select the player which wins by the most, unless a player cannot
    win, in which case have the player defeated by the largest margin.
  • Pick Just Max means select the player which wins by the least, unless a player cannot
    win, in which case have the player defeated by the largest margin.

Which do you think is the most effective strategy? If you were to play this game, what strategy would you play? One of these, or some other? Do determine the best, I implemented each of these strategies, and the pairing game itself in R and have published this code as the package throwdown on GitHub.

Using these algorithms I was able to simulate 100,000 pairing games between teams with player value 1 to 5. A team won when it had more games where the player value was greater. A game was a draw (could have gone either way) when two players of the same value are matched. A match is a draw when both teams have the same number of wins.

Since the player values were randomly sorted, but Pick First and Pick Random should perform in the same manner. It is commonly stated that team B has an advantage in the pairing game, since they are able to dictate the last two matchups. When teams pick randomly, this advantage is lost, since the teams are as likely to pick the unfavourable matchups as the best matchups. The following tables show the proportion of 100,000 games drawn, and the proportion of non-drawn games won by team B. Pick First and Pick Random strategies have a likelihood of 50% of winning against itself or the other algorithm as Team B or A. This suggests that extra benefits given to team A may be unnecessary. But what about following a more reactive strategy?

Team B Team A Team B Wins Draws
pickerFirst pickerFirst 0.51 0.37
pickerRandom pickerFirst 0.50 0.38
pickerFirst pickerRandom 0.49 0.38
pickerRandom pickerRandom 0.50 0.38

Things look different when one of the teams plays the Pick Max strategy. Assuming that draws are 50% in favour of each player, Team B is only 40% chance of winning. This is because the Pick Max strategy is unnecessarily aggressive when revealing first player. When Team A plays the Pick Max strategy, they only attempts to beat Team B when it is possible to. When one Team is playing the Max strategy, and one Random/First strategies, it is preferred to be Team A.

Team B Team A Team B Wins Draws
pickerMax pickerFirst 0.35 0.43
pickerMax pickerRandom 0.37 0.42
pickerFirst pickerMax 0.35 0.43
pickerRandom pickerMax 0.38 0.41

The outcomes go entirely crazy when you look at the Pick Just Max strategy. When team B uses the Pick Just Max strategy against a team playing the Pick First strategy, the Pick Just Max team wins 70% of non-drawn matches. When team A uses the Just Max
strategy against a team picking First, they win 95% of non-drawn matches. Again, being team A is preferred. When team B uses the Pick Just Max strategy against a team playing the Pick Random strategy, the Pick Just Max team only wins 30% of non-drawn matches. When team A uses the Just Max strategy against a team picking Randomly, second they only win 55% of non-drawn matches. In this instance, the team using the Just Max strategy prefers to be team A, but their choices leave them open to defeat against a Random selection.

Team B Team A Team B Wins Draws
pickerJustMax pickerFirst 0.72 0.53
pickerJustMax pickerRandom 0.30 0.53
pickerFirst pickerJustMax 0.06 0.40
pickerRandom pickerJustMax 0.45 0.49

Just Max is performing very well against Pick First, but poorly against Pick Random. At the reveal step, Just Max is revealing the player which just beats the other team’s choice first, and then either the player which beats that choice more thoroughly, or if no others win, which loses by the most. This means that Pick First is always selecting the best option for Pick Just Max, whereas Pick Random is sometimes winning, sometimes losing. For the Pick Just Max strategy, Pick First is losing optimally, hence the far worse performance than Pick Random. The default opening for Pick Just Max is to reveal the strongest player. This is because the strongest player has the most possible wins. Perhaps this is a weakness, and a lower, or the lowest ranked player should open. That would provide less occasions where the other team can play the weakest player into the strongest player, and offer fewer winning combinations.

Playing combinations of Max and Just Max is perhaps the least interesting. Team B picking the Max rated player available to them against Team A picking the Max rated player always results in a win for Team B. This is because team B picks 3 games, selecting the Max victory, and loses 2 games.

When Team B picks Just Max and A picks Max, every game is a draw, since Team B selects a draw between the 5 rated players, then each team wins games in turn as a player is left exposed while the other team has a higher rated player in hand.

When Team B picks Max and A picks Just Max, team A wins every game, with no draws. At each reveal, B is too greedy, allowing A to win 3 games each match.

When both teams play the Just Max strategy, all games result in a draw, as both teams are cautious with how they reveal their players.

Team B Team A Team B Wins Draws
pickerMax pickerMax 1.00 0.00
pickerJustMax pickerMax 1.00
pickerMax pickerJustMax 0.00 0.00
pickerJustMax pickerJustMax 1.00

These simulations show that the behaviour of these simple strategies can be unexpected, even for this very simple representation of the pairing game. So far I have not found evidence that team A is advantaged. While I will be investigating this system further, these early results suggest that the organizers should not give too much benefit to team B, as while it may appear that A has an advantage, it may not be possible for them to capitalize on it.

These simulations also show that it is not trivial to create a strategy that is better than picking matchups randomly, particularly as the strategy of the opposing team may not be known, or evolve during the tournament, or even pairing process.

world team championships 2015 part 4


Excitement is already building about WTC 2016! With the extra data released by the WTC team, there’s a great opportunity to improve predictions for next year. To improve my predictions for next year I first need to assess how well I predicted this year’s results. I was very proud when Italy Michelangelo beat Poland Grunts in round 1, but I had winners Finland Blue as one of the lowest rated teams.

First off, I ran out of time when making predictions previously. After generating rank distributions for each team, I simply used the maximum score for the team order. This does not take into account the shape of the distribution profiles. A better estimate of the typical value of a ranked distribution is the moment. This then gives a better summary of the simulation data.




R Code for Summarizing Rank Profile Moment
moment <- function(x) {
    mean(x * seq_along(x))

mmax <- apply(X = res, MARGIN = 1L, FUN = moment)
reso <- res[order(mmax), ]
mmax <- mmax[order(mmax)]

#           England Lions                USA Stars 
#                 14.7988                  14.8010 
#      Italy Michelangelo         Australia Wombat 
#                 15.3058                  15.4640 

Rank Team Score.Moment
1 England Lions 14.80
2 USA Stars 14.80
3 Italy Michelangelo 15.31
4 Australia Wombat 15.46
5 Sweden Nobel 19.94
6 Poland Grunts 21.25
7 Poland Leaders 28.15
8 Scotland Irn 28.97
9 Sweden Dynamite 31.17
10 Germany Dichter & Denker 33.49
11 Ireland Craic 33.54
12 Australia Platypus 35.61
13 Germany Bier & Brezel 38.68
14 England Roses 38.73
15 Belgium Blonde 39.92
16 Netherlands Lion 44.30
17 Canada Goose 45.94
18 Finland Blue 48.45
19 Canada Moose 49.32
20 Denmark Red 50.20
21 Finland White 52.47
22 USA Stripes 52.93
23 China 52.97
24 Belgium Brown 53.75
25 Denmark White 54.19
26 Ireland Ceol 54.46
27 Greece Prime 55.21
28 Greece Epic 55.49
29 Spain North 55.52
30 Middle East 55.87
31 Spain South 56.22
32 United Nations 56.77
33 Wales Storm 57.01
34 Wales Fire 57.15
35 Italy Leonardo 61.20
36 Ukraine 62.63
37 Scotland Bru 63.44
38 France Asterix 64.29
39 Russia Wolves 64.58
40 Russia Bears 66.00
41 France Obelix 66.84
42 Northern Ireland 1 69.35
43 Netherlands Hero 71.47
44 Portugal Prime 72.55
45 Norway Blue 72.58
46 Northern Ireland 2 73.64
47 Norway Red 73.82
48 Czech Republic 78.65
49 Switzerland Red 80.90
50 Portugal KULT 89.98




Next up is to create a statistic that summarizes goodness of a rank prediction. I created a function to score a guessed sequence. The score is the mean distance a guess position is from the actual result. The score is the sum of the absolute distance between the predictions and the actual results, divided by the number of guesses. So for example, guessing exactly right has a score of 0. Guessing the first three teams in the reverse order scores (2+0+2)/3 = 1.33. I also added partial matching to allow guesses just for countries (for example if the team names change), in which case the average difference for each team is returned.

> library(WTCTools)
> scoreSequence(guess = letters[1:3], result = letters)
[1] 0
[1] 3
> scoreSequence(guess = c("c", "b", "a"), result = letters)
[1] 1.333333
[1] 3
> scoreSequence(guess = "Australia", result = leaderboard15$Team)
[1] 2
[1] 1

The probability of getting a certain score given a number of guesses can be estimated by simulation.




R Code for Estimating Probability Density
> # get all n
> NN <- 2e3
> scor <- seq(from = 0, to = 50, by = 0.1)
> rres <- matrix(0, nrow = NN, ncol = 50)
> dens <- matrix(0, nrow = length(scor), ncol = 50)
> for (j in 50:1) {
+      for (i in seq_len(NN)) {
+           # score random guess
+           gg <- c(sample(lead$Team, size = j), rep("", times = 50 - j))
+           rres[i, j] <- scoreSequence(guess = gg[order(ind[i, ])], 
+                result = lead$Team)
+      }
+      # sequence of scores
+      rres[, j] <- sort(rres[, j])
+      # cumulative probability
+      # percentage of scores that are less than score threshold
+      for (qu in seq_along(scor)) {
+           dens[qu, j] <- 100 * sum(rres[, j] < scor[qu]) / NN
+      }
+ }




These probabilities can be visualized as a series of traces. When guessing just one team’s rank a score could be any value between 7 and 24 (a 50% chance of being in the range 1-39). As we increase the number of guesses we are more likely for the score to be a middle value. So when making 5 guesses, there is a 50% chance of being in the range 13-20. When making 25 guesses, there is a 50% chance of being in the range 15-18.





R Code for Plotting Cumulative Probability
  cols <- diverge_hsv(n = 50)
  matplot(x = scor, y = dens, 
+     xlab = "Score", ylab = "Cumulative Probability (%)", 
+     col = cols, type = "l", lty = 1)
  legend(x = "bottomright", legend = c(1, 10, 40, 50), 
+     col = cols[c(1, 10, 40, 50)], lty = 1)




The benchmark for place prediction is using the 2014 ranking to place the 2015 teams. Of the 52 teams in WTC2014, 17 teams returned with the same name, and 43 nation teams returned. Guessing this year’s team result based on last year’s performance gives a score of between 9.8 and 10, depending on which two non-present teams are excluded. My score for 2015 based on my posted prediction is 11.3, which is a poor showing! Using the moment rather than the outcome distribution maximum gives me a score of 9.8, as good as the best prediction based on past performance (and for 49 teams, rather than 42). This is a relief, but obviously can be improved!

So why did my analysis rate Finland Blue so poorly? My analysis was based solely on the 2014 results. Of the three players in Finland Blue who played in 2014, Jaakko had a good record (5-1), but Tatu and Mikko had an average record (3-3). With the two unrated players, Henry and Pauli, also being assumed as average players, this put the Finnish team right in the middle of the pack (18/50). However, all three veteran players had a good record in 2013. Including this data shows that the team was stronger than their 2014 performance suggested. They were not the favourites, although their caster selection looks strong. With three year’s WTC data I believe my predictions for 2016 will be better scoring!

world team championships 2015 part 3

The Elo rating gave some predictiveness for game outcomes. But given the small number of games present even in three years of WTC data means that there was not sufficient sensitivity to describe the players. Elo made a number of assumptions for ease of computation. Some of these can assumptions can be relaxed due to the power of a desktop PC. One assumption is that the variability in a player’s rating is the same for all players is the same for all players. This assumption is relaxed by the Glicko and Stephenson method, as implemented by the steph function in R.

Using three years’ data we can apply the pair lookup method for predicting the WTC outcome. Since then I further restricted how far each pair penalty could travel each updateLookup call. To recreate this analysis, create the ‘scorefrac’ column as before.

> # create scorefrac as before
> wtc <- na.omit(wtc)
> wtc$allrounds <- wtc$round + (wtc$year - 2013) * 6 
> head(wtc)
   game_id round TP year victory_condition        scenario
1       1     1  1 2013       Caster Kill Into the Breach
2       2     1  0 2013       Caster Kill Into the Breach
3       3     1  1 2013       Caster Kill Into the Breach
4       4     1  1 2013       Caster Kill Into the Breach
5       5     1  0 2013          Scenario Into the Breach
6       6     1  1 2013       Caster Kill Into the Breach
               player1            team1  list1
1   Pär-Ola Nilsson Team Epic Sweden  Vlad2
2 Robert Willemstein Team Netherlands Feora2
3      Christian Aas Team Epic Sweden  Vayl2
4      Jeppa Resmark Team Epic Sweden Haley2
5      Harm Kleijnen Team Netherlands  Vayl2
6     Alexander Grob Team Austria Red Skarre
                  faction1 CP1        AP1             player2
1                 Khador   0 0.57377049          Aat Niehot
2 Protectorate of Menoth   0 0.08196721         Joakim Rapp
3   Legion of Everblight   0 0.91803279         Tom Starren
4                 Cygnar   0 0.90163934      Casper Jellema
5   Legion of Everblight   1 0.26229508 Christoffer Wedding
6                   Cryx   4 0.98360656          David Kane
               team2    list2       faction2 CP2        AP2
1 Team Netherlands    Borka     Trollblood   2 0.03278689
2 Team Epic Sweden  Bartolo    Mercenaries   4 0.06557377
3 Team Netherlands Butcher2         Khador   1 0.01639344
4 Team Netherlands Krueger2 Circle Orboros   0 0.00000000
5 Team Epic Sweden Krueger2 Circle Orboros   5 0.08196721
6    Team Scotland    Vlad2         Khador   1 0.01639344
   scorefracCP scorefracAP scorefrac allrounds
1   0.3000000   0.9459459 0.6229730         1
2   0.1000000   0.5555556 0.3277778         1
3   0.4000000   0.9824561 0.6912281         1
4   0.5000000   0.9508197 0.7254098         1
5   0.1666667   0.7619048 0.4642857         1
6   0.8000000   0.9836066 0.8918033         1

From this dataset we can perform the analysis for each year. To compare this method with the Elo rating, we can perform three cycles of analysis.

> pairLookup <- initializeLookup(
+   data = unique(c(wtc$list1, wtc$list2)))
> wtc2013 <- wtc[wtc$year == 2013, ]
> # sequentially optimize restricted 
> pairLookup13 <- updateLookup(data = wtc2013, 
+     pairlookup = pairLookup, 
+     penalty = 10, result = "TP")
> # ratings based on pairings with selected caster pairings
> rating2013 <- steph(
+   x = wtc2013[, c("round", "player1", "player2", "TP")], 
+       gamma = getMatrixVal(
+            list1 = wtc2013[, "list1"], 
+            list2 = wtc2013[, "list2"], 
+            x = pairLookup))
> rating2013$ratings[1:8, 
+     c("Player", "Rating", "Deviation", "Games", "Win", "Loss")]
                  Player   Rating Deviation Games Win Loss
1  Andrzej Kasiewicz 2647.358  165.8147     5   5    0
2         Will Pagani 2610.678  172.8383     5   5    0
3      Moritz Riegler 2586.560  175.4116     5   5    0
4 Keith Christianson 2585.981  170.8883     5   5    0
5       Johan Persson 2570.921  174.0227     5   5    0
6             Enno May 2512.887  168.4785     5   4    1
7         Tomek Tutaj 2510.662  163.5944     5   4    1
8       Lewis Johnson 2509.211  187.4594     5   5    0

Each round the pair lookup table is created, then the player ratings are created. The lookup table and ratings are then initialization inputs for the following year.

> wtc2014 <- wtc[wtc$year == 2014, ]
> pairLookup14 <- updateLookup(data = wtc2014, 
+      pairlookup = pairLookup13, 
+      penalty = 10,
+      result = "TP")
> rating2014 <- steph(
+   x = wtc2014[, c("allrounds", "player1", "player2", "TP")], 
+      status = rating2013$ratings,
+      gamma = getMatrixVal(
+            list1 = wtc2014[, "list1"], 
+           list2 = wtc2014[, "list2"], 
+           x = pairLookup14))
> rating2014$ratings[1:8, 
+      c("Player", "Rating", "Deviation", "Games", "Win", "Loss")]
            Player   Rating Deviation Games Win Loss
1           Brian White 2663.619  126.0339    11  10    1
2           Will Pagani 2656.168  132.2770    11  10    1
3  Andrzej Kasiewicz 2648.529  132.5830    11  10    1
4            Colin Hill 2624.061  161.8688     6   6    0
5            Ben Leeper 2613.997  152.6823     6   6    0
6 Keith Christianson 2585.981  170.8883     5   5    0
7       Johan Persson 2570.921  174.0227     5   5    0
8      Jake Van Meter 2564.071  125.5282    11   9    2

This approach means that at the start of 2015, many players already have a rating, and matchups already have a home advantage.

> wtc2015 <- wtc[wtc$year == 2015, ]
> pairLookup15 <- updateLookup(data = wtc2015, 
+       pairlookup = pairLookup14, 
+      penalty = 10,
+      result = "TP")
> rating2015 <- steph(
+   x = wtc2015[, c("allrounds", "player1", "player2", "TP")], 
+     status = rating2014$ratings,
+      gamma = getMatrixVal(
+            list1 = wtc2015[, "list1"], 
+            list2 = wtc2015[, "list2"], 
+            x = pairLookup15))
> rating2015$ratings[1:8, 
+     c("Player", "Rating", "Deviation", "Games", "Win", "Loss")]
                   Player   Rating Deviation Games Win Loss
1             Brian White 2639.599  112.2389    17  15    2
2              Jay Larsen 2638.713  151.8721     6   6    0
3            Sheldon Pace 2635.978  149.9138     6   6    0
4              Colin Hill 2624.061  161.8688     6   6    0
5 William Cruickshanks 2599.129  125.5863    12  11    1
6              Aaron Wale 2598.380  156.4778     6   5    1
7       Jaakko Uusitupa 2597.534  108.0831    17  15    2
8             Peter Bates 2596.015  181.0864     6   6    0

This new rating penalizes Jaakko for playing all of his games with a high advantage caster,
Haley2 dropping him down to seventh place. If Jaakko and Brian play a game, this method predicts that Jaakko has a 42% chance of winning if Brian plays Harbinger and 46% chance of winning if he plays High Reclaimer. If Jaakko plays the player rated lowest by the Elo method, he is estimated as having a 98% chance of winning.

> unique(c(wtc2015$list1[wtc2015$player1 == "Jaakko Uusitupa"],
+      wtc2015$list2[wtc2015$player2 == "Jaakko Uusitupa"]))
[1] "Haley2"
> unique(c(wtc2015$list1[wtc2015$player1 == "Brian White"],
+      wtc2015$list2[wtc2015$player2 == "Brian White"]))
[1] "Harbinger"      "High Reclaimer"
> # Harbinger as list2 is slightly favoured into Haley2
> gm <- getMatrixVal(x = pairLookup15, 
+      list1 = "Haley2", 
+      list2 = c("Harbinger", "High Reclaimer"))
> gm
[1] -21.38652  18.10739
> predict(object = rating2015, 
+      newdata = data.frame(c(19, 19), 
+     "Jaakko Uusitupa", "Brian White"), 
+      gamma = gm)
# [1] 0.4188713 0.4691343

I performed the method calibration as for the Elo rating, by not including the final round from 2015 in the pair lookup calculation, or for creating the player ratings. This is a much better prediction of the final round results. Each point is the average of 12 or 13 results. The points are not completely on the line partly because these are the average of a small number of results. This is a good result and suggests that this method is worth pursuing further.

> ggplot() + 
+       geom_point(data = data.frame(
+             Predicted_Quantile = pQuantiles, 
+             Result_Quantile = rQuantiles),
+       aes(x = Result_Quantile, y = Predicted_Quantile)) + 
+       geom_abline() +
+       theme_economist()

For comparison with the Elo rating here are the top 16 players from 2015. This method has allowed players with fewer ratings to place higher up the results table.

> players2015 <- unique(
+      unlist(
+            wtc2015[, c("player1", "player2")]))
> head(rating2015$ratings[
+      rating2015$ratings$Player %in% players2015, 
+      c("Player", "Rating", "Deviation", "Games", "Win", "Loss")], n = 16)
Player Rating Deviation Games Win Loss
Brian White 2639.599 112.2389 17 15 2
Jay Larsen 2638.713 151.8721 6 6 0
Sheldon Pace 2635.978 149.9138 6 6 0
William Cruickshanks 2599.129 125.5863 12 11 1
Aaron Wale 2598.38 156.4778 6 5 1
Jaakko Uusitupa 2597.534 108.0831 17 15 2
Peter Bates 2596.015 181.0864 6 6 0
Sascha Maisel 2580.45 155.1466 6 6 0
Bubba Dalton 2554.847 172.07 6 6 0
Konrad Sosnowski 2552.253 130.6567 12 11 1
Ben Leeper 2536.705 123.9089 12 10 2
Alessandro Montagnani 2528.729 126.5368 12 11 1
Adam Bell 2517.283 156.1071 6 5 1
Tomek Tutaj 2515.555 112.9234 17 14 3
Patrick Dunford 2513.166 125.5094 12 10 2
David Potts 2512.103 124.0504 12 10 2

world team championships 2015 part 2

The 2015 data from the Warmachine World Team Championships was recently made available from the WTC blog. I’ve done a little cleanup of the dataset from the raw file and added it to WTCTools.

> library(WTCTools)
> head(wtc, n = 3)
  game_id round TP year victory_condition        scenario
1       1     1  1 2013       Caster Kill Into the Breach
2       2     1  0 2013       Caster Kill Into the Breach
3       3     1  1 2013       Caster Kill Into the Breach
             player1            team1  list1
1   Pär-Ola Nilsson Team Epic Sweden  Vlad2
2 Robert Willemstein Team Netherlands Feora2
3      Christian Aas Team Epic Sweden  Vayl2
                faction1 CP1 AP1     player2
1                 Khador   0  35  Aat Niehot
2 Protectorate of Menoth   0   5 Joakim Rapp
3   Legion of Everblight   0  56 Tom Starren
             team2    list2    faction2 CP2 AP2
1 Team Netherlands    Borka  Trollblood   2   2 
2 Team Epic Sweden  Bartolo Mercenaries   4   4 
3 Team Netherlands Butcher2      Khador   1   1 

Since I want to provide the best description of game outcome, let’s first baseline predictability against the original statistically based measure of competitive game rankings, the Elo rating. The elo function from PlayerRatings calculates the Elo rating in R.

> library(PlayerRatings)
> wtc$allrounds <- wtc$round + (wtc$year - 2013) * 6
> ratingElo <- elo(
+      x = wtc[, c("allrounds", "player1", "player2", "TP")])
> players2015 <- unique(
+      unlist(
+          wtc[wtc$year == 2015, c("player1", "player2")]))
> head(ratingElo$ratings[
+      ratingElo$ratings$Player %in% players2015, 
+     c("Player", "Rating", "Games", "Win", "Loss")], n = 16)

Player Rating Games Win Loss
Jaakko Uusitupa 2347.5 17 15 2
Brian White 2343.2 17 15 2
Tomek Tutaj 2317.3 17 14 3
William Cruickshanks 2314.9 12 11 1
Alessandro Montagnani 2311.9 12 11 1
Konrad Sosnowski 2308.4 12 11 1
Robin Maukisch 2300.4 17 13 4
Patrick Dunford 2297.0 12 10 2
Christoffer Wedding 2296.0 17 13 4
Will Pagani 2291.4 17 13 4
Ben Leeper 2291.4 12 10 2
Matt Goligher 2289.4 12 10 2
David Potts 2288.9 12 10 2
Sietse Sommeling 2284.7 12 10 2
Andrzej Kasiewicz 2283.8 17 13 4
Marcin Mycek 2281.1 17 11 6

The default initial rating for this function is 2200. The Elo rating calculation has a tuning parameter, K, which determines how quickly players’ scores change. The elo has a default K of 27. Increasing K could allow players who had not played in previous years to move up the table faster.

The difference in rating between the two players is used to estimate the probability of player 1 winning. For example, Jaakko Uusitupa is estimated as having a 55% chance of beating Brian White. Jaakko has an 84% chance of beating the lowest rated player.

> predict(object = ratingElo, 
+   newdata = data.frame(19, "Jaakko Uusitupa", "Brian White"), 
+   gamma = 0)
[1] 0.5454877

The Elo ratings can be tested by seeing how predictive they are in the final round.

> # exclude final round
> ratingEloTest <- elo(
+       x = wtc[wtc$allrounds != 18, 
+           c("allrounds", "player1", "player2", "TP")])
> # predict probability of player 1 winning
> wtc$pplayer1 <- predict(object = ratingEloTest, 
+      newdata = wtc[, c("allrounds", "player1", "player2")], 
+      tng = 0, gamma = 0)
> # plot calibration
> lastRound <- wtc[wtc$allrounds == 18, ]
> # split into even sized groups 
> pBins <- quantile(
+      x = lastRound[["pplayer1"]], 
+      probs = seq(from = 0, to = 1, by = 0.1), 
+      na.rm = TRUE)
> lastRound$bins <- cut(lastRound$pplayer1, breaks = pBins)
> # summarize actual results
> rQuantiles <- tapply(
+      X = lastRound[["TP"]], 
+      INDEX = lastRound$bins, 
+      FUN = mean)
> # compare with predictions
> pQuantiles <- tapply(
+      X = lastRound[["pplayer1"]], 
+      INDEX = lastRound$bins, 
+      FUN = mean)

If we are predicting perfectly, the mean results should fall on a straight line with the mean prediction (i.e. we expect 40% of all games predicted as 40% chance of victory for player 1 to be a victory for player 1).

> library(ggplot2)
> library(ggthemes)
> ggplot() + 
+      geom_point(data = data.frame(
+           Predicted_Quantile = pQuantiles, 
+           Result_Quantile = rQuantiles), +
+      aes(x = Result_Quantile, y = Predicted_Quantile)) + 
+      geom_abline() +
+      theme_economist()

The Elo rating is not giving us a very wide range of probabilities, so we are overestimating low probability outcome games and underestimating high probability outcome games. Nevertheless, the Elo rating is providing some estimate of the match outcome. This is the benchmark that we need to beat to improve our predictions of match outcomes.