Level 8: Metrics and Statistics

Readings/Playings

See “additional resources” at the end of this blog post for a number of supplemental readings.

This Week

One of the reasons I love game balance is that different aspects of balance touch all these other areas of game development. When we were talking about pseudorandom numbers, that’s an area where you get dangerously close to programming. Last week we saw how the visual design of a level can be used as a game reward or to express progression to the player, which is game design but just this side of art. This week, we walk right up to the line where game design intersects business.

This week I’m covering two topics: statistics, and metrics. For anyone who isn’t familiar with what these mean, ‘metrics’ just means measurements, so it means you’re actually measuring or tracking something about your game; leaderboards and high score lists are probably the best-known metrics because they are exposed to the players, but we also can use a lot of metrics behind the scenes to help design our games better. Once we collect a lot of metrics, once we take these measurements, they don’t do anything on their own until we actually look at them and analyze them to learn something. ‘Statistics’ is just one set of tools we can use to get useful information from our metrics. Even though we collect metrics first and then use statistics to analyze them, I’m actually going to talk about statistics first because it’s useful to know how your tools work before you decide what data to capture.

Statistics

People who have never done statistics before think of it as an exact science. It’s math, math is pure, and therefore you should be able to get all of the right answers all the time. In reality, it’s a lot messier, and you’ll see that game designers (and statisticians) disagree about the core principles of statistics even more than they disagree about the core principles of systems design, if such a thing is possible.

What is statistics, and how is it different from probability?

In probability, you’re given a set of random things, and told exactly how random they are and what the nature of that randomness is, and your goal is to try to predict what the data will look like when you set those random things in motion. Statistics is kind of the opposite: here you’re given the data up front, and you’re trying to figure out the nature of the randomness that caused that data.

Probability and statistics share one important thing in common: neither one is guaranteed. Probability can tell you there’s a 1/6 chance of rolling a given number on 1d6, but it does not tell you what the actual number will be when you roll the die for real. Likewise, statistics can tell you from a bunch of die rolls that there is probably a uniform distribution, and that you’re 95% sure, but there’s a 5% chance that you’re wrong. That chance never goes to zero.

Statistical Tools

This isn’t a graduate-level course in statistical analysis, so all I’ll say is that there are a lot more tools than this that are outside the scope of this course. What I’m going to put down here is the bare minimum I think every game designer should know to be useful when analyzing metrics in their games.

Mean: when someone asks for the “average” of something, they’re probably talking about the mean average (there are two other kinds of average that I know of, and probably a few more that I don’t). To get the mean of a bunch of values, you add them all up and then divide by the number of values. This is sort of like “expected value” in probability, except that you’re computing it based on real-world die-rolls and not a theoretically balanced set of die-rolls. Calculating the mean is incredibly useful; it tells you what the ballpark expected value is of something in your game. You can think of the mean as a Monte Carlo calculation of expected value, except you’re using real-world playtest data rather than a computer simulation.

Median: this is another kind of average. To calculate it, take all your values and sort them from smallest to largest, then pick the one in the center. So, if you have five values, the third one is the median. (If you have an even number of values so that there are two in the middle rather than one, you’re supposed to take the mean of those, in case you’re curious.) On its own, the median isn’t all that useful, but it tells you a lot when you compare it with the mean, about whether your values are all weighted to one side, or if they’re basically symmetric. For example, in the US, the median household income is a lot lower than the mean, which basically means we’ve got a lot of people making a little, and a few people making these ridiculously huge incomes that push up the mean. In a classroom, if the median is lower than the mean, it means most of the students are struggling and one or two brainiacs are wrecking the curve (although more often it’s the other way around, where most students are clustered around 75 or 80 and then you’ve got some lazy kid who’s getting a zero which pulls down the mean a lot). If you’re making a game with a scoreboard of some kind and you see a median that’s a lot lower than the mean, it probably means you’ve got a small minority of players that are just obscenely good at the game and getting these massive scores, while everyone else who is just a mere mortal is closer to the median.

Standard deviation: this is just geeky enough to make you sound like you’re good at math if you use it in normal conversation. You calculate it by taking each of your data points, subtracting it from the mean, squaring the result (that is, multiply the result by itself), add all of those squares together, divide by the total number of data points, then take the square root of the whole thing. For reasons that you don’t really need to know, going through this process gives you a number that represents how spread out the data is. Basically, about two-thirds of your data is within a single standard deviation from the mean, and nearly all of your data is within two standard deviations, so how big your SD is ends up being relative to how big your mean is. A mean of 50, SD of 25 looks a lot more spread out than a mean of 5000, SD of 25. A relatively large SD means your data is all over the place, while a really small SD means your data is all clustered together.

Examples

To give you an example, let’s consider two random variables: 2d6, and 1d11+1. Like we talked about in the week on probability, both of these will give you a number from 2 to 12. But they have a very different nature; the 2d6 clusters around the center, while the 1d11+1 is spread out among all outcomes evenly. Now, statistics doesn’t have anything to say about this, but let’s just assume that I happen to roll the 2d6 thirty-six times and get one of each result, and I roll 1d11 eleven times and get one of each result… which is wildly unlikely, but it does allow us to use statistical tools to analyze probability.

The mean of both of these is 7, which means if you’re trying to balance either of these numbers in your game, you can use 7 as the expected value. What about the range? The median is also 7 for both, which means you’re just as likely to be above or below the mean, which makes sense because both of these are symmetric. However, you’ll see the standard deviations are a lot different: for 2d6, the SD is about two-and-a-half, meaning that most of the time you’ll get a result in between 5 and 9; for 1d11+1, the SD is about three-and-a-half, so you’ll get about as many rolls in the 4 to 10 range here, as you did in the 5 to 9 range for 2d6. Which doesn’t actually sound like that big a deal, until you start rolling.

As a different example, maybe you’re looking at the time it takes playtesters to get through your first tutorial level in a video game you’re designing. Your target is that it should take about 5 minutes. You measure the mean at 5 minutes, median at 6 minutes, standard deviation at 2 minutes. What does that tell us? Most people take between 3 and 7 minutes, which might be good or bad depending on just how much of the level is under player control, but in a lot of games the tutorial is meant to be a pretty standardized, linear experience so this would actually feel like a pretty huge range. The other cause for concern is the high median, which suggests most people actually take longer than 5 minutes, you just have a few people who get through the level really fast and they bring down the mean. This is good news in that you know you’re not having anyone taking four hours to complete it or whatever (otherwise the mean would be a lot higher than the median instead!), but it’s potentially bad news in that some players might have found an unintentional shortcut or exploit, or else they’re just skipping through all your intro dialogue or something which is going to get them stuck and frustrated in level 2, or something else.

This suggests another lesson: statistics can tell us that something is happening, but it can’t tell us why, and sometimes there are multiple explanations for the why. This is one area where statistics is often misused or flat out abused, by finding one logical explanation for the numbers and ignoring that there could be other explanations as well. In this case, we have no way of knowing why the median is shorter than the mean, or its implications of game design… but we could spend some time thinking about all the possible answers, and then we could collect more data that would help us differentiate between them. For example, if one fear is that players are skipping through the intro dialogue, we could actually measure the time spent reading dialogues in addition to the total level time. We’ll come back to this concept of metrics design later today.

There’s also a third lesson here: I didn’t tell you how many playtesters it took to get this data! The more tests you have, the more accurate your final analysis will be. If you only had three tests, these numbers are pretty meaningless if you’re trying to predict general trends. If there were a few thousand tests, that’s a lot better. (How many tests are required to make sure your analysis is good enough? Depends what “good enough” means to you. The more you have, the more sure you can be, but it’s never actually 100% no matter how many tests you do. People who do this for a living have “confidence intervals” where they’ll tell you a range of values and then say something like they’re 95% sure that the actual mean in reality is within such-and-such a range. This is a lot more detail than most of us need for our day-to-day design work.)

Outliers

When you have a set of data with some small number of points that are way above or below the mean, the name for those is outliers (pronounced like the words “out” and “liars”). Since these tend to throw off your mean a lot more than the median, if you see the mean and median differing by a lot it’s probably because of an outlier.

When you’re doing a statistical analysis, you might wonder what to do with the outliers. Do you include them? Do you ignore them? Do you put them in their own special group? As with most things, it depends.

If you’re just looking for normal, usual play patterns, it is generally better to discard the outliers because by definition, those are not arising from normal play. If you’re looking for edge cases then you want to leave them in and pay close attention; for example, if you’re trying to analyze the scores people get so you know how to display them on the leaderboards, realize that your top-score list is going to be dominated by outliers at the top.

In either case, if you have any outliers, it is usually worth investigating further to figure out what happened. Going back to our earlier example of level play times, if most players take 5 to 7 minutes to complete your tutorial but you notice a small minority of players that get through in 1 or 2 minutes, that suggests those players may have found some kind of shortcut or exploit, and you want to figure out what happened. If most players take 5 to 7 minutes and you have one player that took 30 minutes, that is probably because the player put it on pause or had to walk away for awhile, or they were just having so much fun playing around in the sandbox that they didn’t care about advancing to the next level or whatever, and you can probably ignore that if it’s just one person. But if it’s three or four people (still in the vast minority) who did that, you might investigate further, because there might be some small number of people who are running into problems… or, players who find one aspect of your tutorial really fun, which is good to know as you’re designing the other levels.

Population samples

Here’s another way statistics can go horribly wrong: it all comes down to what and who you’re sampling.

I already mentioned one frequent problem, which is not having a large enough sample. The more data points you have, the better. I’ll give you an example: back when I played Magic: the Gathering regularly, this one time I put together a tournament deck for a friend, for a tournament that I couldn’t play in but they could. To tell if I had the right ratio of land to spells, I shuffled and dealt an opening hand and played a few mock turns to see if I was getting enough. I’d do this a bunch of times going through most of the deck, then I’d take some land out or put some in depending on how many times I had too much or too little, and then I’d reshuffle and do it again. At the time I figured this was a pretty good quick-and-dirty way to figure out how much land I needed. But it just so happened that I wasn’t noticing that the land was actually very evenly distributed and not clustered, so most of the time it seemed like I was doing okay by the end… but I never actually stopped to count. After the tournament, which my friend lost badly, they reported to me that they were consistently not drawing enough land, and when we actually went through the deck and counted, there were only 16 lands in a deck of 60 cards! I took a lot of flak from my friend for that, and rightly so. The real problem here was that I was trying to analyze the number of land through statistical methods, but my sample size was way too small to draw any meaningful conclusions.

Here’s another example: suppose you’re making a game aimed at the casual market. You have everyone on the development team play through the game to get some baseline data on how long it takes to play through each level and how challenging each level is. Problem: the people playing the game are probably not casual gamers, so this is not really a representative sample of your target market. I’m sure this has happened before at some point in the past.

A more recent example: in True Crime: Hong Kong, publisher Activision allegedly demanded that the developers change the main character from male to female, because their focus group said they preferred a male protagonist. The problem: the focus group was made up of all males, or the questions were inherently biased by the person setting it up, as a deliberate attempt to further their agenda rather than to actually find out the real-world truth. Activision denies all of this, of course, but that hasn’t stopped it from being the subject of many industry conversations… not just about the role of women in games, but about the use of focus groups and statistics in game design. You also see things like this happening in the rest of the world, particularly in governmental politics, where a lot of people have their own personal agenda and they’re willing to warp a study and use statistics as a way of proving their point.

Basically, when you’re collecting playtest data, you want to do your best to recruit playtesters who are as similar as possible to your target market, and you want to have as many playtests as possible so that the random noise gets filtered out. Your analysis is only as good as your data!

Even if you use statistics “honestly” there’s still problems every game designer runs into, depending on the type of game.

  • For video games, you are at the mercy of your programmers, and there’s nothing you can do about that. The programmers are the ones who need to spend time coding the metrics you ask for. Programming time is always limited, so at some point you’ll have to make the call between having your programming team implement metrics collection… or having them implement, you know, the actual game mechanics you’ve designed. And that’s if the decision isn’t made for you by your producer or your publisher. This is easier in some companies than others, but in some places “metrics” falls into the same category as audio, and localization, and playtesting: tasks that are pushed off towards the end of the development cycle until it’s too late to do anything useful.
  • For tabletop games, you are at the mercy of your playtesters. The more data points you collect, the better, of course. But in reality, a video game company can release an early beta and get hundreds or thousands of plays, while you might realistically be able to do a fraction of that with in-person tabletop tests. With a smaller sample, your playtest data is a lot more suspect.
  • For any kind of game, you need to be very clear ahead of time what it is you need measured, and in what level of detail. If you run a few hundred playtests and only find out afterwards that you need to actually collect certain data from the game state that you weren’t collecting before, you’ll have to do those tests over again. The only thing to do about this is to recognize that just like design itself, playtesting with metrics is an iterative process, and you need to build that into your schedule.
  • Also for any kind of game, you need to remember that it’s very easy to mess things up accidentally and get the wrong answer, just like probability. Unlike probability, there aren’t as many sanity checks to make the wrong numbers look wrong, since by definition you don’t always know exactly what you’re looking for or what you expect the answer to be. So you need to proceed with caution, and use every method you can find of independently verifying your numbers. It also helps if you try to envision in advance what the likely outcomes of your analysis might be, and what they’ll look like.

Correlation and causality

Finally, one of the most common errors with statistics is when you notice some kind of correlation between two things. “Correlation” just means that when one thing goes up, another thing always seems to go up (which is a positive correlation) or down (a negative correlation) at the same time. Recognizing correlations is useful, but a lot of times people assume that just because two things are correlated, that one causes the other, and that is something that you cannot tell from statistics alone.

Let’s take an example. Say you notice when playing Puerto Rico that there’s a strong positive correlation between winning, and buying the Factory building; say, out of 100 games, in 95 of them the winner bought a Factory. The natural assumption is that the Factory must be overpowered, and that it’s causing you to win. But you can’t draw this conclusion by default, without additional information. Here are some other equally valid conclusions, based only on this data:

  • Maybe it’s the other way around, that winning causes the player to buy a Factory. That sounds odd, but maybe the idea is that a Factory helps the player who is already winning, so it’s not that the Factory is causing the win, it’s that being strongly in the lead causes the player to buy a Factory for some reason.
  • Or, it could be that something else is causing a player both to win and to buy a Factory. Maybe some early-game purchase sets the player up for buying the Factory, and that early-game purchase also helps the player to win, so the Factory is just a symptom and not the root cause.
  • Or, the two could actually be uncorrelated, and your sample size just isn’t large enough for the Law of Large Numbers to really kick in. We actually see this all the time in popular culture, where two things that obviously have no relation are found to be correlated anyway, like the Redskins football game predicting the next Presidential election in the US, or an octopus that predicts the World Cup winner, or a groundhog seeing its shadow supposedly predicting the remaining length of Winter. As we learned when looking at probability, if you take a lot of random things you’ll be able to see patterns; one thing is that you can expect to see unlikely-looking streaks, but another is that if you take a bunch of sets of data, some of them will probably be randomly correlated. If you don’t believe me, try rolling two separate dice a few times and then computing the correlation between those numbers; I bet it’s not zero!

Statistics in Excel

Here’s the good news: while there are a lot of math formulas here, you don’t actually need to know any of them. Excel will do this for you, it has all these formulas already. Here are a few useful ones:

  • AVERAGE: given a range of cells, this calculates the mean. You could also take the SUM of the cells and then divide by the number of cells, but AVERAGE is easier.
  • MEDIAN: given a range of cells, this calculates the median, as you might guess.
  • STDEV: given a range of cells, this gives you the standard deviation.
  • CORREL: you give this two ranges of cells, not one, and it gives you the correlation between the two sets of data. For example, you could have one column with a list of final game scores, and another column with a list of scores at the end of the first turn, to see if early-game performance is any kind of indicator of the final game result (if so, this might suggest a positive feedback loop in the game somewhere). The number Excel gives you from the CORREL function ranges between -1 (perfect negative correlation) to 0 (uncorrelated) to +1 (perfect positive correlation).

Is there any good news?

At this point I’ve spent so much time talking about how statistics are misused, that you might be wondering if they’re actually useful for anything. And the answer is, yes. If you have a question that can’t be answered with intuition alone, and it can’t be answered just through the math of your cost or progression curves, statistics let you draw useful conclusions… if you ask the right questions, and if you collect the right data.

Here’s an example of a time when statistics really helped a game I was working on. I worked for a company that made this online game, and we found that our online population was falling and people weren’t playing as many games, because we hadn’t released an update in awhile. (That part was expected. With no updates, I’ve found that an online game loses about half of its core population every 6 months or so, at least that was my experience.)

But what we didn’t expect, was one of our programmers got bored one day and made a trivia bot, just this little script that would log into our server with its own player account, send a trivia question every couple of minutes, and then parse the incoming public chat to see if anyone said the right answer. And it was popular, as goofy and stupid and simple as it was, because it was such a short, immediate casual experience.

Now, the big question is: what happened to the player population, and what happened to the actual, real game that players were supposed to be playing (you know, the one where they would log in to the chat room to find someone to challenge, before they got distracted by the trivia bot)?

Some players loved the trivia bot. It gave them something to do in between games. Others hated the trivia bot; they claimed that it was harder to find a game, because everyone who was logged in was too busy answering dumb trivia questions to actually play a real game. Who was right? Intuition failed, because everyone’s intuition was different. Listening to the players failed, because the vocal minority of the player base was polarized, and there was no way to poll those who weren’t in the vocal minority. Math failed, because the trivia bot wasn’t part of the game, let alone part of the cost curve. Could we answer this with statistics? We sure could, and we did!

This was simple enough that it didn’t even require much analysis. Measure the total number of logins per day. Measure total number of actual games played. Since our server tracked every player login, logout and game start already, we had this data, all we had to do was some very simple analysis, tracking how these things changed over time. As expected, the numbers were all falling gradually since the time of the last real release, but the trivia bot actually caused a noticeable increase in both total logins and number of games played. It turned out that players were logging in and playing with the trivia bot, but as long as they were there, they were also playing games with each other! That was a conclusion that would have been impossible to reach in any kind of definitive way, without analysis of the hard data. And it taught us something really important about online games: more players online, interacting with each other, is better… even if they’re interacting in nonstandard ways.

Metrics

Here’s a common pattern in artistic and creative fields, particularly things like archaeology or art preservation or psychology or medicine where it requires a certain amount of intuition but at the same time there is still a “right answer” or “best way” to do things. The progression goes something like this:

  1. Practitioners see their field as a “soft science”; they don’t know a whole lot about best principles or practices. They do learn how things work, eventually, but it’s mostly through trial and error.
  2. Someone creates a technology that seems to solve a lot of these problems algorithmically. Practitioners rejoice. Finally, we’re a hard science! No more guesswork! Most younger practitioners abandon the “old ways” and embrace “science” as a way to solve all their field’s problems. The old guard, meanwhile, sees it as a threat to how they’ve always done things, and eyes it skeptically.
  3. The limitations of the technology become apparent after much use. Practitioners realize that there is still a mysterious, touchy-feely element to what they do, and that while some day the tech might answer everything, that day is a lot farther off than it first appeared. Widespread disillusionment occurs as people no longer want to trust their instincts because theoretically technology can do it better, but people don’t want to trust the current technology because it doesn’t work that great yet. The young turks acknowledge that this wasn’t the panacea they thought; the old guard acknowledge that it’s still a lot more useful than they assumed at first. Everyone kisses and makes up.
  4. Eventually, people settle into a pattern where they learn what parts can be done by computer algorithms, and what parts need an actual creative human thinking, and the field becomes stronger as the best parts of each get combined. But learning which parts go best with humans and which parts are best left to computers is a learning process that takes awhile.

Currently, game design seems to be just starting Step 2. We’re hearing more and more people anecdotally saying why metrics and statistical analysis saved their company. We hear about MMOs that are able to solve their game balance problems by looking at player patterns, before the players themselves learn enough to exploit them. We hear of Zynga changing the font color from red to pink which generates exponentially more click-throughs from players to try out other games. We have entire companies that have sprung up solely to help game developers capture and analyze their metrics. The industry is falling in love with metrics, and I’ll go on record predicting that at least one company that relies entirely on metrics-driven design will fail, badly, by the time this whole thing shakes out, because they will be looking so hard at the numbers that they’ll forget that there are actually human players out there who are trying to have fun in a way that can’t really be measured directly. Or maybe not. I’ve been wrong before.

At any rate, right now there seems to be three schools of thought on the use of metrics:

  • The Zynga model: design almost exclusively by metrics. Love it or hate it, 60 Million monthly active unique players laugh at your feeble intuition-based design.
  • Rebellion against the Zynga model: metrics are easy to misunderstand, easy to manipulate, and are therefore dangerous and do more harm than good. If you measure player activity and find out that more players use the login screen than any other in-game action, that doesn’t mean you should add more login screens to your game out of some preconceived notion that if a player does it, it’s fun. If you design using metrics, you push yourself into designing the kinds of games that can be designed solely by metrics, which pushes you away from a lot of really interesting video game genres.
  • The moderate road: metrics have their uses, they help you tune your game to find local “peaks” of joy. They help you take a good game and make it just a little bit better, by helping you explore the nearby design space. However, intuition also has its uses; sometimes you need to take broad leaps in unexplored territory to find the global “peaks,” and metrics alone will not get you there, because sometimes you have to make a game a little worse in one way before it gets a lot better in another, and metrics won’t ever let you do that.

Think about it for a bit and decide where you stand, personally, as a designer. What about the people you work with on a team (if you work with others on a team)?

How much  to measure?

Suppose you want to take some metrics in your game so you can go back and do statistical analysis to improve your game balance. What metrics do you actually take – that is, what exactly do you measure?

There are two schools of thought that I’ve seen. One is to record anything and everything you can think of, log it all, mine it later. The idea is that you’d rather collect too much information and not use it, than to not collect a piece of critical info and then have to re-do all your tests.

Another school of thought is that “record everything” is fine in theory, but in practice you either have this overwhelming amount of extraneous information from which you’re supposed to find this needle in a haystack of something useful, or potentially worse, you mine the heck out of this data mountain to the point where you’re finding all kinds of correlations and relationships that don’t actually exist. By this way of thinking, instead you should figure out ahead of time what you’re going to need for your next playtest, measure that and only that, and that way you don’t get confused when you look at the wrong stuff in the wrong way later on.

Again, think about where you stand on the issue.

Personally, I think a lot depends on what resources you have. If it’s you and a few friends making a small commercial game in Flash, you probably don’t have time to do much in the way of intensive data mining, so you’re better off just figuring out the useful information you need ahead of time, and add more metrics later if a new question occurs to you that requires some data you aren’t tracking yet. If you’re at a large company with an army of actuarial statisticians with nothing better to do than find data correlations all day, then sure, go nuts with data collection and you’ll probably find all kinds of interesting things you’d never have thought of otherwise.

What specific things do you measure?

That’s all fine and good, but whether you say “just get what we need” or “collect everything we can,” neither of those is an actual design. At some point you need to specify what, exactly, you need to measure.

Like game design itself, metrics is a second-order problem. Most of the things that you want to know about your game, you can’t actually measure directly, so instead you have to figure out some kind of thing that you can measure that correlates strongly with what you’re actually trying to learn.

Example: measuring fun

Let’s take an example. In a single-player Flash game, you might want to know if the game is fun or not, but there’s no way to measure fun. What correlates with fun, that you can measure? One thing might be if players continue to play for a long time, or if they spend enough time playing to finish the game and unlock all the achievements, or if they come back to play multiple sessions (especially if they replay even after they’ve “won”), and these are all things you can measure. Now, keep in mind this isn’t a perfect correlation; players might be coming back to your game for some other reason, like if you’ve put in a crop-withering mechanic that punishes them if they don’t return, or something. But at least we can assume that if a player keeps playing, there’s probably at least some reason, and that is useful information. More to the point, if lots of players stop playing your game at a certain point and don’t come back, that tells us that point in the game is probably not enjoyable and may be driving players away. (Or if the point where they stopped playing was the end, maybe they found it incredibly enjoyable but they beat the game and now they’re done, and you didn’t give a reason to continue playing after that. So it all depends on when.)

Player usage patterns are a big deal, because whether people play, how often they play, and how long they play are (hopefully) correlated with how much they like the game. For games that require players to come back on a regular basis (like your typical Facebook game), the two buzzwords you hear a lot are Monthly Active Uniques and Daily Active Uniques (MAU and DAU). The “Active” part of that is important, because it makes sure you don’t overinflate your numbers by counting a bunch of old, dormant accounts belonging to people who stopped playing. The “Unique” part is also important, since one obsessive guy who checks FarmVille ten times a day doesn’t mean he counts as ten users. Now, normally you’d think Monthly and Daily should be equivalent, just multiply Daily by 30 or so to get Monthly, but in reality the two will be different based on how quickly your players burn out (that is, how much overlap there is between different sets of daily users). So if you divide MAU/DAU, that tells you something about how many of your players are new and how many are repeat customers.

For example, suppose you have a really sticky game with a small player base, so you only have 100 players, but those players all log in at least once per day. Here your MAU is going to be 100, and your average DAU is also going to be 100, so your MAU/DAU is 1. Now, suppose instead that you have a game that people play once and never again, but your marketing is good, so you get 100 new players every day but they never come back. Here your average DAU is still going to be 100, but your MAU is around 3000, so your MAU/DAU is about 30 in this case. So that’s the range, MAU/DAU goes between 1 (for a game where every player is extremely loyal) to 28, 30 or 31 depending on the month (representing a game where no one ever plays more than once).

A word of warning: a lot of metrics, like the ones Facebook provides, might use different ways of computing these numbers so that one set of numbers isn’t comparable to another. For example, I saw one website that listed the “worst” MAU/DAU ratio in the top 100 applications as 33-point-something, which should be flatly impossible, so clearly the numbers somewhere are being messed with (maybe they took the Dailies from a different range of dates than the Monthlies or something). And then some people compute this as a %, meaning on average, what percentage of your player pool logs in on a given day, which should range from a minimum of about 3.33% (1/30 of your monthly players logging in each day) to 100% (all of your monthly players log in every single day). This is computed by taking DAU/MAU (instead of MAU/DAU) and multiplying by 100 to get a percentage. So if you see any numbers like this from analytics websites, make sure you’re clear on how they’re computing the numbers so you’re not comparing apples to oranges.

Why is it important to know this number? For one thing, if a lot of your players keep coming back, it probably means you’ve got a good game. For another, it means you’re more likely to make money on the game, because you’ve got the same people stopping by every day… sort of like how if you operate a brick-and-mortar storefront, an individual who just drops in to window-shop may not buy anything, but if that same individual comes in and is “just looking” every single day, they’re probably going to buy something from you eventually.

Another metric that’s used a lot, particularly on Flash game portals, is to go ahead and ask the players themselves to rate the game (often in the form of a 5-star rating system). In theory, we would hope that higher ratings mean a better game. In theory, we’d also expect that a game with high player ratings would also have a good MAU/DAU ratio, that is, that the two would be correlated. I don’t know of any actual studies that have checked this, though I’d be interested to see the results, but if I had to guess I’d assume that there is some correlation but not a lot. Users that give ratings are not a representative sample; for one thing, they tend to have strong opinions or else they wouldn’t bother rating (seriously, I always had to wonder about those opinion polls that would say something like 2% of poll respondents said they had no opinion… like, who calls up a paid opinion poll phone line just to say they have no opinion?), so while actual quality probably falls along a bell curve you tend to have more 5-star and 1-star ratings than 3-star, which is not what you’d expect if everyone rated the game fairly. Also, there’s the question of whether player opinion is more or less meaningful than actual play patterns; if a player logs into a game every day for months on end but rates it 1 out of 5 stars, what does that mean? Or if a player admits they haven’t even played the game, but they’re still giving it 4 out of 5 stars based on… I don’t know… its reputation or something? Also, players tend to not rate a game while they’re actively playing, only (usually) after they’re done, which probably skews the ratings a bit (depending on why they stopped playing). So it’s probably better to pay attention to usage patterns than player reporting, especially if that reporting isn’t done during the game from within the game in a way that you can track.

Now, I’ve been talking about video games, in fact most of this is specific to online games. The equivalent in tabletop games is a little fuzzier, but as the designer you basically want to be watching people’s facial expressions and posture to see where in the game they’re engaged and where they’re bored or frustrated. You can track how these correlate to certain game events or board positions. Again, you can try to rely on interviews with players, but that’s dangerous because player memory of these things is not good (and even if it is, not every playtester will be completely honest with you). For video games that are not online, you can still capture metrics based on player usage patterns, but actually uploading them anywhere is something you want to be very clear to your players about, because of privacy concerns.

Another example: measuring difficulty

Player difficulty, like fun, is another thing that’s basically impossible to measure directly, but what you can measure is progression, and failure to progress. Measures of progression are going to be different depending on your game.

For a game that presents skill-based challenges like a retro arcade game, you can measure things like how long it takes the player to clear each level, how many times they lose a life on each level, and importantly, where and how they lose a life. Collecting this information makes it really easy to see where your hardest points are, and if there are any unintentional spikes in your difficulty curve. I understand that Valve does this for their FPS games, and that they actually have a visualizer tool that will not only display all of this information, but actually plot it overlaid on a map of the level, so you can see where player deaths are clustered. Interestingly, starting with Half-Life 2 Episode 2 they actually have live reporting and uploading from players to their servers, and they have displayed their metrics on a public page (which probably helps with the aforementioned privacy concerns, because players can see for themselves exactly what is being uploaded and how it’s being used).

Yet another example: measuring game balance

What if instead you want to know if your game is fair and balanced? That’s not something you can measure directly either. However, you can track just about any number attached to any player, action or object in the game, and this can tell you a lot about both normal play patterns, and also the relative balance of strategies, objects, and anything else.

For example, suppose you have a strategy game where each player can take one of four different actions each turn, and you have a way of numerically tracking each player’s standing. You could record each turn, what action each player takes, and how it affects their respective standing in the game.

Or, suppose you have a CCG where players build their own decks, or a Fighting game where each player chooses a fighter, or an RTS where players choose a faction, or an MMO or tabletop RPG where players choose a race/class combination. Two things you can track here are which choices seem to be the most and least popular, and also which choices seem to have the highest correlation with actually winning. Note that this is not always the same thing; sometimes the big, flashy, cool-looking thing that everyone likes because it’s impressive and easy to use is still easily defeated by a sufficiently skilled player who uses a less well-known strategy. Sometimes, dominant strategies take months or even years to emerge through tens of thousands of games played; the Necropotence card in Magic: the Gathering saw almost no play for six months or so after release, until some top players figured out how to use it, because it had this really complicated and obscure set of effects… but once people started experimenting with it, they found it to be one of the most powerful cards ever made. So, both popularity and correlation with winning are two useful metrics here.

If a particular game object sees a lot more use than you expected, that can certainly signal a potential game balance issue. It may also mean that this one thing is just a lot more compelling to your target audience for whatever reason – for example, in a high fantasy game, you might be surprised to find more players creating Elves than Humans, regardless of balance issues… or maybe you wouldn’t be that surprised. Popularity can be a sign in some games that a certain play style is really fun compared to the others, and you can sometimes migrate that into other characters or classes or cards or what have you in order to make the game overall more fun.

If a game object sees less use than expected, again that can mean it’s underpowered or overcosted. It might also mean that it’s just not very fun to use, even if it’s effective. Or it might mean it is too complicated to use, it has a high learning curve relative to the rest of the game, and so players aren’t experimenting with it right away (which can be really dangerous if you’re relying on playtesters to actually, you know, playtest, if they leave some of your things alone and don’t play with them).

Metrics have other applications besides game objects. For example, one really useful area is in measuring beginning asymmetries, a common one being the first-player advantage (or disadvantage). Collect a bunch of data on seating arrangements versus end results. This happens a lot with professional games and sports; for example, I think statisticians have calculated the home-field advantage in American Football to be about 2.3 points, and depending on where you play the first-move advantage in Go is 6.5 or 7.5 points (in this latter case, the half point is used to prevent tie games). Statistics from Settlers of Catan tournaments have shown a very slight advantage to playing second in a four-player game, on the order of a few hundredths of a percent; normally we could discard that as random variation, but the sheer number of games that have been played gives the numbers some weight.

One last example: measuring money

If you’re actually trying to make money by selling your game, in whole or part, then at the end of the day this is one of your most important considerations. For some people it’s the most important consideration: they’d rather have a game that makes lots of money but isn’t fun or interesting at all, than a game that’s brilliant and innovative and fun and wonderful but is a “sleeper hit” which is just a nice way of saying it bombed in the market but didn’t deserve to. Other game designers would rather make the game fun first, so one thing for each of you to consider is, personally, which side of the fence you’re on… because if you don’t know that about yourself, someone else is going to make the call for you some day.

At any rate, money is something that just about every commercial game should care about in some capacity, so it’s something that’s worth tracking. Those sales tell you something related to how good a job you did with the game design, along with a ton of other factors like market conditions, marketing success, viral spread, and so on.

With traditional games sold online or through retail, this is a pretty standard curve: big release-day sales that fall off over time on an exponentially decreasing curve, until they get to the point where the sales are small enough that it’s not worth it to sell anymore. With online games you don’t have to worry about inventory or shelf space so you can hold onto it a bit longer, which is where this whole “long tail” thing came from, because I guess the idea is that this curve looks like it has a tail on the right-hand side. In this case the thing to watch for is sudden spikes, when those are, and what caused them, because they don’t usually happen on their own.

Unfortunately, that means sales metrics for traditional sales models aren’t all that useful to game designers. We see a single curve that combines lots of variables, and we only get the feedback after the game is released. If it’s one game in a series it’s more useful because we can see how the sales changed from game to game and what game mechanics changed, so if the game took a major step in a new direction and that drastically increased or reduced sales, that gives you some information there.

If instead your game is online, such as an MMO, or a game in a Flash portal or on Facebook, the pattern can be a bit different: sales start slow (higher if you do some marketing up front), then if the game is good it ramps up over time as word-of-mouth spreads, so it’s basically the same curve but stretched out a lot longer. The wonderful thing about this kind of release schedule is that you can manage the sales curve in real-time: make a change to your game today, measure the difference in sales for the rest of the week, and keep modifying as you go. Since you have regular incremental releases that each have an effect on sales, you’re getting constant feedback on the effects that minor changes have on the money your game brings in. However, remember that your game doesn’t operate in a vacuum; there are often other outside factors that will affect your sales. For example, I bet if there’s a major natural disaster that’s making international headlines, that most Facebook games will see a temporary drop in usage because people are busy watching the news instead. So if a game company made a minor game change the day before the Gulf oil spill and they noticed a sudden decrease in usage from that geographical area, the designers might mistakenly think their game change was a really bad one if they weren’t paying attention to the real world.

Ideally, you’d like to eliminate these factors, so you know what you’re measuring, controlling for outside factors. One way of doing this, which works in some special cases, is to actually have two separate versions of your game that you roll out simultaneously to different players, and then you compare the two groups. One important thing about this is that you do need to select the players randomly (and not, say, giving one version to the earliest accounts created on your system and the other version to the most recent adopters). Of course, if the actual gameplay itself is different between the two groups, that’s hard to do without some players getting angry about it, especially if one of the two groups ends up with an unbalanced design that can be exploited. So it’s better to do this with things that don’t affect balance: banner ads, informational popup dialog text, splash screens, the color or appearance of the artwork in your game, and other things like that. Or, if you do this with gameplay, do it in a way that is honest and up front with the players; I could imagine assigning players randomly to a faction (like World of Warcraft’s Alliance/Horde split, except randomly chosen when an account is created) and having the warring factions as part of the backstory of the game, so it would make sense that each faction would have some things that are a little bit different. I don’t know of any game that’s actually done this, but it would be interesting to see in action.

For games where players can either play for free or pay – this includes shareware, microtransactions, subscriptions, and most other kinds of payment models for online games – you can look at not just how many users you have, or how much money you’re getting total, but also where that money is coming from on a per-user basis. This is very powerful, but there are also a lot of variables to consider.

First, what counts as a “player”? If some players have multiple accounts (with or without your permission) or if old accounts stay around while dormant, the choice of whether to count these things will change your calculations. Typically companies are interested in looking at revenue from unique, active users, because dormant accounts tend to not be spending money, and a single player with several accounts should really be thought of as one entity (even if they’re spending money on each account).

Second, there’s a difference between players who are playing for free and have absolutely no intention of paying for your game ever, versus players who spend regularly. Consider a game where you make a huge amount of money from a tiny minority of players; this suggests you have a great game that attracts and retains free players really well, and that once players can be convinced to spend any money at all they’ll spend a lot, but it also says that you have trouble with “conversion” – that is, convincing players to take that leap and spend their first dollar with you. In this case, you’d want to think of ways to give players incentive to spend just a little bit. Now consider a different game, where most people that play spend something but that something is a really small amount. That’s a different problem, suggesting that your payment process itself is driving away players, or at least that it’s giving your players less incentive to spend more, like you’re hitting a spending ceiling somewhere. You might be getting the same total cash across your user base in both of these scenarios, but the solutions are different.

Typically, the difference between them is shown with two buzzwords, ARPU (Average Revenue Per User) and ARPPU (Average Revenue Per Paying User). I wish we called them players rather than users, but it wasn’t my call. At any rate, in the first example with a minority of players paying a lot when most people play for free, ARPPU will be really high; in the second case, ARPPU will be really low, even if ARPU is the same for both games.

Of course, total number of players is also a consideration, not just the average. If your ARPU and ARPPU are both great but you’ve got a player base of a few thousand when you should have a few million, then that’s probably more of a marketing problem than a game design problem. It depends on what’s happening to your player base over time, and where you are in the “tail” of your sales curve. So these three things, sales, ARPU and ARPPU, can give you a lot of information about whether your problem is with acquisition (that is, getting people to try your game the first time), conversion (getting them to pay you money the first time), or retention (getting players to keep coming back for more). And when you overlap these with the changes you make in your game and the updates you offer, a lot of times you can get some really useful correlations between certain game mechanics and increased sales.

Another interesting metric to look at is the graph of time-vs-money for the average user. How much do people give you on the day they start their account? What about the day after that, and the day after that? Do you see a large wad of cash up front and then nothing else? A decreasing curve where players try for free for awhile, then spend a lot, then spend incrementally smaller amounts until they hit zero? An increasing curve where players spend a little, then a bit more, then a bit more, until a sudden flameout where they drop your game entirely? Regular small payments on a traditional “long tail” model? What does this tell you about the value you’re delivering to players in your early game à mid-game à late game à elder game progression?

While you’re looking at revenue, don’t forget to take your costs into account. There are two kinds of costs: up-front development, and ongoing costs. The up-front costs are things like development of new features, including both the “good” ones that increase revenue and also the “bad” ones that you try out and then discard; keep in mind that your ratio of good-to-bad features will not be perfect, so you have to count some portion of the bad ideas as part of the cost in developing the good ones (this is a type of “sunk cost” like we discussed in Week 6 when we talked about situational balance). Ongoing costs are things like bandwidth and server costs and customer support, which tend to scale with the number of players. Since a business usually wants to maximize its profits (that is, the money it takes in minus the money it spends) and not its revenue (which is just the money it takes in), you’ll want to factor these in if you’re trying to optimize your development resources.

A word of warning (gosh, I seem to be giving a lot of warnings this week): statistics are great at analyzing the past, but they’re a lot trickier if you try to use them to predict the future. For example, a really hot game that just launched might have what initially looks like an exponentially-increasing curve. It’s tempting to assume, especially if it’s a really tight fit with an exponential function, that the trend will continue. But common sense tells us this can’t continue indefinitely: the human population is finite, so if your exponential growth is faster than human population growth it has to level off eventually. Business growth curves are usually not exponential, but instead what is called “S-shaped” where it starts as an exponentially increasing curve and eventually transitions to a logarithmically (that is, slowly) increasing curve, and then eventually levels off or starts decreasing. A lot of investors get really burned when they mistake an S curve for an exponential increase, as we saw (more or less) with the dot-com crash about 10 years ago. Illegal pyramid schemes also tend to go through this kind of growth curve, with the exception that once they reach the peak of the “S” there’s usually a very sudden crash.

A Note on Ethics

This is the second time this Summer when talking about game balance that I’ve brought up an issue of professional ethics. It’s weird how this comes up in discussions of applied mathematics, isn’t it? Anyway…

The ethical consideration here is that a lot of these metrics look at player behavior but they don’t actually look at the value added (or removed) from the players’ lives. Some games, particularly those on Facebook which have evolved to make some of the most efficient use of metrics of any games ever made, have also been accused (by some people) of being blatantly manipulative, exploiting known flaws in human psychology to keep their players playing (and giving money) against their will. Now, this sounds silly when taken to the extreme, because we think of games as something inherently voluntary, so the idea of a game “holding us prisoner” seems strange. On the other hand, any game you’ve played for an extended period of time is a game you are emotionally invested in, and that emotional investment does have cash value. If it seems silly to you that I’d say a game “makes” you spend money, consider this: suppose I found all of your saved games and put them in one place. Maybe some of these are on console memory cards or hard disks. Maybe some of them are on your PC hard drive. For online games, your “saved game” is on some company’s server somewhere. And then suppose I threatened to destroy all of them… but not to worry, I’d replace the hardware. So you get free replacements of your hard drive and console memory cards, a fresh account on every online game you subscribe to, and so on. And then suppose I asked you, how much would you pay me to not do that. And I bet when you think about it, the answer is more than zero, and the reason is that those saved games have value to you! And more to the point, if one of these games threatened to delete all your saves unless you bought some extra downloadable content, you would at least consider it… not because you wanted to gain the content, but because you wanted to not lose your save.

To be fair, all games involve some kind of psychological manipulation, just like movies and books and all other media (there’s that whole thing about suspending our disbelief, for example). And most people don’t really have a problem with this; they still see the game experience itself as a net value-add to their life, by letting them live more in the hours they spend playing than they would have lived had they done other activities.

But just like difficulty curves, the difference between value added and taken away is not constant; it’s different from person to person. This is why we have things like MMOs that enhance the lives of millions of subscribers, while also causing horrendous bad events in the lives of a small minority that lose their marriage and family to their game obsession, or that play for so long without attending to basic bodily needs that they keel over and die at the keyboard.

So there is a question of how far we can push our players to give us money, or just to play our game at all, before we cross an ethical line… especially in the case where our game design is being driven primarily by money-based metrics. As before, I invite you to think about where you stand on this, because if you don’t know, the decision will be made for you by someone else who does.

If You’re Working on a Game Now…

If you’re working on a game now, as you might guess, my suggestion for any game you’re working on is to ask yourself what game design questions could be best answered through metrics:

  • What aspects of your design (especially relating to game balance) do you not know the answers to, at this point in time? Make a list.
  • Of those open questions, which ones could be solved through playtesting, taking metrics, and analyzing them?
  • Choose one question from the remaining list that is, in your opinion, the most vital to your gameplay. Figure out what metrics you want to use, and how you will use statistics to draw conclusions. What are the different things you might see? What would they mean? Make sure you know how you’ll interpret the data in advance.
  • If you’re doing a video game, make sure the game has some way of logging the information you want. If it’s a board game, run some playtests and start measuring!

Homework

This is going to be mostly a thought experiment, more than practical experience, because I couldn’t think of any way to force you to actually collect metrics on a game that isn’t yours.

Choose your favorite genre of game. Maybe an FPS, or RTS, CCG, tabletop RPG, Euro board game, or whatever. Now choose what you consider to be an archetypal example of such a game, one that you’re familiar with and preferably that you own.

Pretend that you were given the rights to do a remake of this game (not a sequel), that is, your intention was to keep the core mechanics basically the same but just to possibly make some minor changes for the purpose of game balance. Think of it as a “version 2.0” of the original. You might have some areas where you already suspect, from your designer’s instinct, that the game is unbalanced… but let’s assume you want to actually prove it.

Come up with a metrics plan. Assume that you have a ready supply of playtesters, or else existing play data from the initial release, and it’s just a matter of asking for the data and then analyzing it. Generate a list:

  • What game balance questions would you want answers to, that could be answered with statistical analysis?
  • What metrics would you use for each question? (It’s okay if there is some overlap here, where several questions use some of the same metrics.)
  • What analysis would you perform on your metrics to get the answers to each question? That is, what would you do to the data (such as taking means, medians and standard deviations, or looking for correlations)? If your questions are “yes” or “no,” what would a “yes” or “no” answer look like once you analyzed the data?

Additional Resources

Here are a few links, in case you didn’t get enough reading this week. Much of what I wrote was influenced by these:

http://chrishecker.com/Achievements_Considered_Harmful%3F

and

http://chrishecker.com/Metrics_Fetishism

Game designer Chris Hecker gave a wonderful GDC talk this year called “Achievements Considered Harmful” which talks about a different kind of metric – the Achievements we use to measure and reward player performance within a game – and why this might or might not be such a good idea. In the second article, he talks about what he calls “Metrics Fetishism,” basically going into the dangers of relying too much on metrics and not enough on common sense.

http://www.gamasutra.com/view/news/29916/GDC_Europe_Playfishs_Valadares_on_Intuition_Versus_Metrics_Make_Your_Own_Decisions.php

This is a Gamasutra article quoting Playfish studio director Jeferson Valadares at GDC Europe, suggesting when to use metrics and when to use your actual game design skills.

http://www.lostgarden.com/2009/08/flash-love-letter-2009-part-2.html

Game designer Dan Cook writes on the many benefits of metrics when developing a Flash game.

http://www.gamasutra.com/features/20070124/sigman_01.shtml

Written by the same guy who did the “Orc Nostril Hair” probability article, this time giving a basic primer on statistics rather than probability.

Advertisements

2 Responses to “Level 8: Metrics and Statistics”

  1. Darius K. Says:

    Hi Ian, here’s a bit of a clarification on an aside in this (great) article.

    “For example, I saw one website that listed the “worst” MAU/DAU ratio in the top 100 applications as 33-point-something, which should be flatly impossible, so clearly the numbers somewhere are being messed with (maybe they took the Dailies from a different range of dates than the Monthlies or something).”

    The reason you can get a 33+ MAU/DAU is because they are calculating from Facebook’s numbers. It’s widely known among FB devs that Facebook’s published MAU and DAU numbers are just plain wrong. We calculate our own internal MAU and DAU from our server logins, and Facebook’s numbers can be off by sometimes as much as 50%-100%, and they also don’t track the real numbers very well either in terms of rises and falls.

    Here’s a typical example of 3 days of tracking these numbers, real (internal) vs Facebook’s reported values:

    Real MAU/DAU
    Mon: 100/25 = 4
    Tue: 125/25 = 5
    Wed: 150/50 = 3

    Facebook MAU/DAU
    Mon: 160/35 = 4.57
    Tue: 150/28 = 5.35
    Wed: 220/70 = 3.14

    I’ve made up the numbers but the basic trend is that FB’s reported DAU and MAU numbers are regularly off by 10%-50% and often decrease when real numbers increase and vice versa. But the FB MAU/DAU ratios are only off by 5%-14%, and they tend to track up/down with the real ratios.

    This greater degree of reliability is one of the main reasons people use ratios of MAU and DAU. But like I said, in my experience ratios calculated from FB’s numbers can be off by as much as 15%, which means it’s not impossible that a MAU/DAU ratio calculated from FB numbers could be as bad as 34!

    Anyway, you wrote a great article. Hopefully this provides a little more context for the error you saw.

  2. Loopholes in Game Design » devmag.org.za Says:

    […] Metrics are a powerful tool for finding loopholes, especially if you have a large test group. A simple statistic such as number of players that choose a certain character may point to a balance issue: if a disproportionate number of players choose a certain character or combination, you know there is a problem. The course blog Game Balance Concepts gives an extensive overview of the use of metrics in game design. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: