A better way to measure game balance using game theory

quantimschmitz

1 year ago

Game theory offers an excellent tool for thinking about game balance: the Nash Equilibrium. Nash Equilibria tell us how often players should pick each possible strategy in a competitive game if they are playing optimally.

But there’s a problem — describing how different options (characters, civilizations, etc.) in a game perform against each other in a game theory matrix then finding the Nash Equilibrium almost always results in a Nash Equilibrium that is too “picky” to match even high stakes competitive play — it will say that no one should ever choose options that people do in fact choose, and even win tournaments with!

I think there’s a simple reason why Nash Equilibria seem to be failing in these real world examples — we need to take into account that individual players of a game have unique strengths and weaknesses. Games aren’t best represented by one overall game theory matrix; they’re best represented by a bunch of slightly different game theory matrices representing different player. We will see that taking this into account lets us get a much more realistic picture of game balance.

How game theory measures game balance

Imagine you’re playing rock, paper, scissors against a friend. If you only ever played paper, your friend would pick scissors and beat you indefinitely. If your friend only ever chose between rock and paper, you could always play paper to win half the time and never lose. But if you both randomly choose rock, paper, and scissors without favoring any of them, neither your nor your friend could alter your approach to gain an advantage. Whenever neither player would be better off changing their strategy, Game Theory tells us we have a Nash Equilibrium.

Thankfully, most games are a fair bit more complicated than rock, paper, scissors. But the same principle can apply; if players have a set of options that perform differently against each other, then there will be a rate of how often each option should get played if each competitor is playing optimally to win.

These Nash Equilibria gives us a picture of how balanced a game is. If we know the matchups between all the options the game presents, the Nash Equilibrium will tell us how often each option should be picked in optimal play — or at least it should.

The Useless Character Problem

Even though the Nash Equilibrium is the right kind of thing to measure game balance in a lot of scenarios, it tends not to match what we’d expect in a lot of cases. It’s especially likely to declare options unviable when in the real metagame they are perfectly viable.

Despite Pokemon types having gone through a couple of balance improvements, the Nash Equilibrium for Pokemon types leaves out 7 of 18 types (and has two other options well under 1%), meaning less than half of the types get any real play in the Nash Equilibrium strategy. The situation is much worse in the original Super Smash Bros. and its sequel, Super Smash Bros. Melee, whose matchup charts result in only one or two characters being deemed competitively viable by their Nash Equilibria.

Nash Equilibria tend to be unwelcoming places for anything but the top options. They completely eliminate many or even most options, suggesting that no one should pick them in a competitive environment. If we take them literally, the actual metagames of a lot of games would be heavily dominated by a small number of top options.

But the metagames we actually see are almost always much more diverse than the Nash Equilibria suggest, even at the highest competitive levels; real competitive metagames are rarely as dominated by the top options as Nash Equilibria are. Measuring game balance with Nash Equilibria has a “Useless” Character Problem — it designates some options as useless in competitive settings when in real life they aren’t at all useless. Hundreds of Super Smash tournaments have been won by characters that the Nash Equilibrium suggests no one should ever play competitively. Nash Equilibria are missing what they’re supposed to capture.

Can we fix this?

Alex Jaffe offers one solution (the nitty-gritty details of which are in Chapter 8 of his dissertation): manually set the usage rate for a top character to a realistic value. For instance, if the Nash Equilibrium says the top character should be played 67% of the time, but they only actually get played 25% of the time, we just manually set the top player’s play rate to 25% and then find the best possible strategy given this constraint. This lets us immediately tamp down on the Nash Equilibrium’s overzealous love of top-tier options and get something closer to a realistic picture of the usage of different options.

This solution can be very helpful, but it has a couple of limitations. It is completely ad hoc; we aren’t changing our approach in a way that naturally gets us a more accurate answer. We can set the top strategy (or strategies) to realistic values, but this can only accurately reflect the metagame if we already know what these values should be. This means that its ability to predict usage from a matchup chart or identify whether some strategies are being over- or underutilized is limited.

Here’s an alternative approach.

Measuring game balance better

The primary insight that motivated my approach is that not everyone is playing the same game.

Well, in a literal sense they may be, but each player has their own strengths, weaknesses, and experiences. People differ in their ability to react quickly, time inputs, take many actions in a short period of time, track multiple sources of information, accurately make precise movements, quickly readjust to an opponent’s playstyle, and a whole host of other small skills.

If the different options the game provides award different skillsets, then not everyone is actually faced with the same choice. But matchup charts describe how favorable each matchup is on average, not how favorable it is from the perspective of individual players.

For a fighting game, Juan might be better at playing with a slow, defensive characters that try to wait out opponents and capitalize on their mistakes, while Will might be better at playing with a lightning-fast characters that overwhelm their opponents with a string of attacks. Whatever the Nash Equilibrium of the game’s matchup chart says, they probably would each be best off using different characters.

This means that the matchup chart that describes the metagame overall doesn’t actually describe the game each individual player is playing. For some people, playing with a character that doesn’t get any play in the Nash Equilibrium strategy for the overall matchup matrix might actually be their best strategy because that character works well with their skillset.

A game theoretic approach to measuring game balance should take into account that not every player is exactly average. Each player has their own strengths and weaknesses, so the right matchup chart differs for each player. A character that suits a reckless, all-out attacking playstyle will look better in Will’s matchup chart than Juan’s. Crucially, characters that get no play in the Nash Equilibrium for the overall matchup chart might get play in the Nash Equilibria for individual players. Taking into account that each player is unique can solve the Useless Character Problem.

A simple way to make matchup charts reflect the different game each player experiences is to randomize the matchup chart so that the overall matchup chart is just the average of a bunch of unique individual matrices. We can then find the Nash Equilibria for each individual’s matrix then average all these Nash Equilibria to get an estimate for how often each option should be picked in a competitive context. By giving each player a unique matchup chart, we can capture the diversity of the player base.

In simpler terms, we’ll assume that games are made up of a bunch of unique players whose ideal strategic choices may differ, then we’ll see what the overall metagame looks like. The matchup chart is the average of what the game looks life for different players, but doesn’t necessarily perfectly describe what the game looks like for any individual player.

The mathematical details, briefly

Don’t get too hung up on this if math isn’t your thing.

The matchup chart is converted into a zero sum game matrix by subtracting 50% from each value in the chart, so an even 50-50 matchup has an expected value of zero. We could find the Nash Equilibrium of this game to get an idea of how well balanced the game is for an exactly average player.

We then generate randomized versions of this game by randomizing each individual value in the game matrix. Each value is generated from a normal distribution (think Bell curve) where the center of the distribution is the value we just adjusted from the matchup chart and the standard deviation of the distribution (its width) is chosen by us. The bigger the standard deviation, the more individual randomized games will differ from each other.

So if the matchup chart was:

1	0	-1
0	-1	1
-1	1	0

A randomized matchup chart will look like:

normrnd(1,σ)	normrnd(0,σ)	normrnd(-1,σ)
normrnd(0,σ)	normrnd(-1,σ)	normrnd(1,σ)
normrnd(-1,σ)	normrnd(1,σ)	normrnd(0,σ)

Where normrnd(µ,σ) is a randomly generated value from a normal distribution with a mean of µ and a standard deviation of σ.

We then find the Nash Equilibria for each of these randomized games. Because they are all two player games, every matrix will have a Nash Equilibrium strategy for both players, telling each of them how often they should choose each option. We can average together all of these Nash Equilibria strategies to get an overall usage rate for each option in the matchup chart.

An example: The Super Smash Bros 64 matchup chart

Let’s look at the matchup chart of the classic Nintendo 64 platform fighter, the original Super Smash Bros. With only 12 characters and a clear best player, this game gives a good opportunity to show off this approach of measuring game balance.

Here’s Super Smash Bros’ matchup chart, from the wiki (each row represents a character’s likelihood of winning each matchup):

Pikachu dominates this game. His worst matchup is against himself (where he obviously has a 50% win rate), and his second worst matchup is against Captain Falcon, which he wins 55% of the time. If this were true for every player who gets good at the game, everyone should pick Pikachu any time they want to win a competitive game. Fortunately for anyone invested in the competitive balance of a 23 year old game, that’s not what actually happens.

If we add noise, our average Nash Equilibrium starts to suggest a more interesting metagame. With a standard deviation of just .05 (5%), we already see the makings of a much more balanced game. The average Nash Equilibrium (1000 simulations) looks like this:

Pikachu is still the top option, but he should only be played just over half of the time. Even with a fairly small standard deviation, we have a much more diverse metagame. If we use a standard deviation of .1 (10%), every character gets some play, though the worst few characters should be pretty rare.

Now we would just need to see how close to the actual competitive usage stats we can get by adjusting the standard deviation. My impression from a quick perusal of tournament results is that using a standard deviation somewhere between 5% and 10% might get pretty close, though the matchup chart hasn’t been updated since 2011 and an updated one would almost certainly do better. (If someone has competitive usage stats for characters in this game or any other similar game with a roughly agreed upon matchup chart, I’d love to hear from you.)

How different is a game for each player?

Bigger standard deviations represent games differing more from player to player. For instance, a fighting game with very similar characters might be best represented using a small standard deviation, because characters will mostly benefit from the same skillset and experience, so the overall matchup chart will be pretty close to the right matchup chart for each individual player. However, a fighting game with diverse characters with widely differing playstyles would likely be best represented by picking a larger standard deviation.

A larger standard deviation naturally leads to a more diverse metagame. We can see this with the Super Smash Bros. example; increasing the standard deviation leads to a more evenly balanced metagame simply because different players will have different counterplay options to the same strategy. If they’re playing Super Smash Bros., the reactive Juan might be best off choosing Jigglypuff against Captain Falcon, whereas the lightning-fast Will might be best off choosing Fox.

We now have a nice mathematical demonstration of how giving different players options that reward their strengths and cover their weaknesses can lead to a better balanced and more diverse metagame. The more variance among matchup charts for different players, the more evenly balanced the game will tend to be, assuming the average matchup chart stays the same.

Of course, there are tradeoffs to making a game that rewards a wide variety of skillsets. It’s hard, you have to do a lot of work building mechanics that reward different players, and particular matchups might become extremely difficult to balance. The game may also suffer in its appeal to both novice and expert players: learning the game can become harder for new players, and the balance of the game can vary wildly based on skill level. Giving options that award different individual strengths and weaknesses can put game designers in positions where they can either make the game balanced for low-level players at the cost of balance for high-level players or vice versa. Even Age of Empires 2, a ridiculously well-balanced game after over 20 years of updates, still has this last problem, where the Goths tend to be overpowered at low skill levels and relatively bad at high skill levels and the Chinese tend to be the opposite.

Some things to consider about measuring game balance

This treatment assumed that each matchup for each individual player was statistically independent from all the others. For instance, being good at beating beating Kirby with Fox meant nothing about how good a player would be at beating Jigglypuff with Fox. Realistically, these would be strongly correlated, and we might be better off in some situations by adding or subtracting the same amount to each value in a row or column.
Being good or bad at one option might be strongly correlated to being good or bad at another option. For instance, someone who is well-suited for playing Falco in Super Smash Bros. Melee is likely also well-suited for playing the extremely similar Fox. Some separate options might also be strongly correlated.
Some games require specialization. Top level Super Smash Bros. players are usually highly specialized at playing one or two characters. This changes the dynamics of your choice somewhat. Because you’re committing to a character, you need it to work in a variety of circumstances, and it having an occasional terrible matchup is really bad news for you, since picking an alternative to cover a terrible matchup requires tons of time to master that backup character. Every matchup in Starcraft needs to be as fairly balanced as possible because mastering multiple races isn’t usually feasible.
Some skills might be rare. Maybe 99% of players in a game simply aren’t physically able to fully take advantage of a mechanic, and the 1% of players who can will almost always win against those who can’t. If only some options allow players to take advantage of that mechanic, then those options will dominate the very top levels but not dominate among average players.
Herding: If people hear that X character is the best, they might pick that character even if that character isn’t the best for them personally, so the top options might get played more than they should.
Anti-herding: People have different tastes and like being creative and unique, so they might pick less popular options not because they are optimal, but because they are less popular.
Real play isn’t going to be exactly optimal, even at the highest competitive levels of the most established and explored games. Alex Jaffe’s approach of constraining the play frequency of top options, herding, and anti-herding all reflect this in different ways.

Try it for yourself!

If you want to use this method to measure the balance of a game or just to tinker around with Super Smash Bros. 64’s matchup chart, the code is here. And if you find it useful or have any questions, I’d love to here from you.

Top image is from SSC 2022 Grand Finals.