Darts, Dice, and Coins: Sampling from a Discrete Distribution

Last Major Update: December 29, 2011

Earlier this year, I asked a question on Stack Overflow about a data structure for loaded dice. Specifically, I was interested in answering this question:

"You are given an n-sided die where side i has probability p_i of being rolled. What is the most efficient data structure for simulating rolls of the die?"

This data structure could be used for many purposes. For starters, you could use it to simulate rolls of a fair, six-sided die by assigning probability $\frac{1}{6}$ to each of the sides of the die, or a to simulate a fair coin by simulating a two-sided die where each side has probability $\frac{1}{2}$ of coming up. You could also use this data structure to directly simulate the total of two fair six-sided dice being thrown by having an 11-sided die (whose faces were 2, 3, 4, ..., 12), where each side was appropriately weighted with the probability that this total would show if you used two fair dice. However, you could also use this data structure to simulate loaded dice. For example, if you were playing craps with dice that you knew weren't perfectly fair, you might use the data structure to simulate many rolls of the dice to see what the optimal strategy would be. You could also consider simulating an imperfect roulette wheel in the same way.

Outside the domain of game-playing, you could also use this data structure in robotics simulations where sensors have known failure rates. For example, if a range sensor has a 95% chance of giving the right value back, a 4% chance of giving back a value that's too small, and a 1% chance of handing back a value that's too large, you could use this data structure to simulate readings from the sensor by generating a random outcome and simulating the sensor reading in that case.

The answer I received on Stack Overflow impressed me for two reasons. First, the solution pointed me at a powerful technique called the alias method that, under certain reasonable assumptions about the machine model, is capable of simulating rolls of the die in $O(1)$ time after a simple preprocessing step. Second, and perhaps more surprisingly, this algorithm has been known for decades, but I had not once encountered it! Considering how much processing time is dedicated to simulation, I would have expected this technique to be better- known. A few quick Google searches turned up a wealth of information on the technique, but I couldn't find a single site that compiled together the intuition and explanation behind the technique.

This writeup is my attempt to give a quick survey of various approaches for simulating a loaded die, ranging from simple techniques that are highly impractical to the very optimized and efficient alias method. My hope here is to capture different intuitions about the problem and how each highlights some new aspect of simulating loaded dice. For each approach, my goal is to explore the motivating idea, core algorithm, correctness proof, and runtime analysis (in terms of time, memory, and randomness required).

Preliminaries

Before I go into any of the specific details of the different techniques, let's first standardize our notation and terminology.

In the introduction to this writeup, I used the term "loaded die" to describe a general scenario where there are is a finite set of outcomes, each of which has some associated probability. Formally, this is termed a discrete probability distribution, and the problem of simulating the loaded die is called sampling from a discrete distribution. To describe our discrete probability distribution (loaded die), we will assume that we are given a set of n probabilities $p_0, p_1, ..., p_{n - 1}$ associated with outcomes $0, 1, ..., n - 1$. Although the outcomes can be anything (heads/tails, numbers on a die, colors, etc.), for simplicity I'll assume that the outcome is some positive natural number that corresponds to the given index.

Dealing with real numbers on a computer is a bit of computation gray area. There are many fast algorithms whose speed is derived purely from the ability to, in constant time, compute the floor function of an arbitrary real number, and numerical inaccuracies in floating-point representations can entirely ruin certain algorithms. Consequently, before we embark on any discussion of algorithms for working with probabilities, which enter the dark world of real numbers, I should clarify what I assume the computer can and cannot do.

In what follows, I will assume that all of the following operations can be done in constant time:

It is worth wondering whether or not it's unreasonable to assume that we can do all of these operations efficiently. In practice, we rarely have probabilities specified to a level of precision where the rounding error inherent in an IEEE-754 double will cause significant problems, and so we can get all of the above for free by just doing everything with IEEE doubles. However, if we are in an environment where the probabilities are specified exactly as high-precision rational numbers, then these constraints may be unreasonable.

Simulating a Fair Die

Before we generalize to the special case of rolling an arbitrarily loaded die, let's begin by starting with a simpler algorithm that will serve as a building block for the later algorithms: simulating a fair, n-sided die. For example, we may be interested in rolling a fair 6-sided die to play Monopoly or Risk, or flipping a fair coin (a 2-sided die), etc.

In this special case, there is a simple, elegant, and efficient algorithm for simulating the outcome. The idea behind the algorithm is as follows. Suppose that we can generate truly random, uniformly-distributed real numbers in the range $[0, 1)$. We can visualize this range as follows:

Now, if we want to roll an $n$-sided die, one way to do so would be to segment the range $[0, 1)$ into $n$ evenly-sized smaller regions, each of which has length $\frac{1}{n}$. This looks as follows:

At this point, if we generate a randomly-chosen real number in the range $[0, 1)$, it will fall into exactly one of these smaller regions. From there, we can read off what the outcome of the die roll is by seeing what range it falls in. For example, if our randomly-chosen value fell at this location:

Graphically, it's easy to see in which bucket the random value fell, but how can we encode this as an algorithm? This is where we use the fact that the die is a fair die. Since all of the ranges have equal size, namely $\frac{1}{n}$, we can see what the largest value of i is such that $\frac{i}{n}$ is no greater than the randomly-generated value (call that value x). One observation is that if we are trying to find the maximum value of i such that $\frac{i}{n} \le x$, this is equivalent to finding the maximum value of $n$ such that $i \le xn$. But, by definition, this means that $i = \lfloor xn \rfloor$, the largest natural number no larger than xn. Consequently, this gives us the following (very simple) algorithm for simulating a fair, n-sided die:

Using our computational assumptions from above, this algorithm runs in $O(1)$ time.

This section has two takeaway points. First, we can segment the range $[0, 1)$ into pieces such that a uniformly-random real number in that range maps naturally back to one of the many discrete choices available to us. We will exploit this technique extensively throughout the rest of this writeup. Second, it can be complicated to determine which range a particular random value has fallen into, but if we know something about the partitions (in this case, that they all have the same size), it can be mathematically easy to determine which partition a particular point is in.

Simulating a Loaded Die with a Fair Die

Given an algorithm for simulating a fair die, can we adapt the algorithm to simulate a loaded die? The answer, interestingly, is yes, but it will come at a space premium.

The intuition of the previous section suggests that in order to simulate a roll of a loaded die, all we need to do is split the range $[0, 1)$ into pieces, then determine which piece we have fallen into. However, in general this can be much more difficult than it might seem. For example, suppose that we have a four-sided die with side probabilities of $\frac{1}{2}, \frac{1}{3}, \frac{1}{12},$ and $\frac{1}{12}$ (we can confirm that this is a legal probability distribution since $\frac{1}{2} + \frac{1}{3} + \frac{1}{12} + \frac{1}{12} = \frac{6}{12} + \frac{4}{12} + \frac{1}{12} + \frac{1}{12} = \frac{12}{12}$). If we partition the range $[0, 1)$ into four pieces of these given sizes, we get the following:

Unfortunately, at this point we're stuck. Even if we knew a random number in the range $[0, 1)$, there is no simple mathematical trick we can use to automatically determine which partition that number falls into. This is not to say that it's extremely difficult to do so - as you'll see, there are many great tricks we can use - but none of them necessarily has the mathematical simplicity of the algorithm for rolling a fair die.

However, we can adapt the technique for fair dice to work in this case. Let's take this particular case as an example. The probability of the sides of the dice coming up are $\frac{1}{2}, \frac{1}{3}, \frac{1}{12},$ and $\frac{1}{12}$. If we rewrite this such that all the terms have a common denominator, we get the values $\frac{6}{12}, \frac{4}{12}, \frac{1}{12},$ and $\frac{1}{12}$. We can therefore think about this problem as follows: rather than rolling a four-sided die with weighted probabilities, why not instead roll a 12-sided fair die where the faces have duplicate labels on them? Since we know how to simulate a fair die, this would be equivalent to cutting up the range $[0, 1)$ into twelve pieces like this:

Now, simulating a die roll is extremely simple - we just roll this new fair die, then see what side comes up and read off the value it contains. This first step can be accomplished using the algorithm from above, which will give us back an integer in the range $0, 1, ..., 11$. To map that integer back to one of the sides of the original loaded die, we will store an auxiliary array of size twelve that maps each of these numbers back to the original outcome. Graphically, we can see this as follows:

To formalize this as an algorithm, we will describe both the initialization step (coming up with the table) and the generation step (simulating throws of the random die). Both of these steps are important to consider in this algorithm and the algorithms that follow, since the setup time might be great.

In the initialization step, we begin by finding the least common multiple of all of the probabilities we are given for the sides of the dice (in our example, this was 12). The LCM is useful here because it corresponds to the smallest common denominator we could use for all of the fractions, and therefore the number of sides on the new, fair die that we will be rolling. Once we have this LCM (let's call it L), we need to determine how many sides of the new die will be distributed to each of the sides of the original, loaded die. In our example, the side with probability $\frac{1}{2}$ got 6 sides on the new die, since $\frac{1}{2} \times 12 = 6$. Similarly, the side with probability $\frac{1}{3}$ got 4 sides, since $\frac{1}{3} \times 12 = 4$. More generally, if L is the LCM of the probabilities and $p_i$ is the probability of side $i$ of the die coming up, we would allocate $L \cdot p_i$ sides of the fair die to side $i$ of the original loaded die.

This algorithm may be simple, but how efficient is it? The actual generation of die rolls is quite fast - each die roll requires $O(1)$ work to generate a random die roll using the earlier algorithm, plus an extra $O(1)$ work for the table lookup. This gives a total work requirement of $O(1)$.

However, the initialization step may be extremely costly. In order for this algorithm to work, we need to allocate space for an array as large as the LCM of the denominators of all of the input fractions. For our example ($\frac{1}{2}, \frac{1}{3}, \frac{1}{12}, \frac{1}{12}$), this was 12, but for other inputs it can be pathologically bad. As an example, consider the fractions $\frac{999999}{1000000}$ and $\frac{1}{1000000}$. The LCM of the denominators is one million, so our table would require one million entries!

Unfortunately, it gets worse than this. In the previous example, we could at least say that we "expected" the algorithm to use a lot of memory, since the denominators of the fractions were each one million. However, we may end up with a set of probabilities for which the LCM is substantially greater than any individual denominiator. As an example, consider the probabilities $\frac{1}{15}, \frac{1}{10}, \frac{5}{6}$. Here, the LCM of the denominators is 30, which is larger than any of the denominators themselves. The construction here works because $15 = 3 \times 5$, $10 = 2 \times 5$, and $6 = 2 \times 3$; in other words, each denominiator is the product of two primes chosen out of a pool of three. Their LCM is therefore the product of all of those primes together, since each denominator has to divide the LCM. If we generalize this construction and consider any set of $k$ primes, then if we pick one fraction for each pairwise product of those primes, the LCM may be much larger than any individual denominiator. In fact, one of the best upper bounds we can derive on the LCM would be $O(\prod_{i = 0}^n{d_i})$, where $d_i$ is the denominator of the $i$th probability. This precludes this algorithm from being used in any practical setting where the probabilities are not known in advance, since the memory usage required to hold a table of size $O(\prod_{i = 0}^n{d_i})$ can easily be beyond what can be held in RAM.

That said, in many cases this algorithm is well-behaved. If the probabilities are all identical, then all the probabilities we are given as input are $\frac{1}{n}$ for some $n$. The LCM of the denominators is then $n$, so the fair die we end up rolling will have $n$ sides and each of the sides of the original die will correspond to one side of the fair die. The initialization time is thus $O(n)$. Graphically, this would look as follows:

Another important detail about this algorithm is that it assumes that we're given the probabilities nicely and conveniently as fractions with well-behaved denominators. If the probabilities are specified as IEEE-754 doubles, then this approach is likely to fail catastrophically due to small rounding errors; imagine if we have 0.25 and 0.250000000001 as probabilities! Thus this approach is probably best not attempted except in special cases where the probabilities are known to be well-behaved and specified in a format conducive to operations on rational numbers.

Simulating a Biased Coin

Our explanation of one simple random primitive (the fair die) led to a simple but potentially disastrously inefficient algorithm for simulating a loaded die. Perhaps exploring other simple random primitives might shed some more light on different approaches for solving this problem.

A simple but surprisingly useful task is simulating a biased coin using a random generator. Given a coin with probability $p_{heads}$ of coming up heads, how would we simulate a flip of the biased coin?

One of the intuitions we developed earlier on was that we can partition the range $[0, 1)$ into a series of buckets such that when we pick a random value in the range, it ends up in some bucket with probability equal to the size of the bucket. To simulate a biased coin using a uniformly random value in the range $[0, 1)$, we could consider splitting the range $[0, 1)$ like this:

And then generating a uniformly-random value in the range $[0, 1)$ to see what bucket it's contained in. Fortunately, since there is only one splitting point, it's quite easy to determine which bucket the point is contained in; if the value is less than $p_{heads}$, then the coin came up heads, and otherwise it came up tails. As pseudocode:

Since we can generate a uniformly-random value in the range $[0, 1)$ in $O(1)$ time and can do a real-valued comparison in $O(1)$ as well, this algorithm runs in $O(1)$ time.

Simulating a Fair Die with Biased Coins

From our earlier discussion, we know that it's possible to simulate a loaded die with a fair die, assuming that we're willing to pay a potential premium in space usage. Since we can think of a biased coin as a loaded two-sided die, this means that it's possible to simulate a biased coin using a fair die. Interestingly, it's also possible to do the opposite, and we can simulate a fair die with a biased coin. The construction is straightforward, elegant, and can easily be generalized to simulate a loaded die using a set of biased coins.

The construction for simulating a biased coin worked by partitioning the range $[0, 1)$ into two regions - a "heads" region and a "tails" region - based on the probability of the die coming up heads. We have already seen a similar trick used above to simulate a fair, $n$-sided die that worked by splitting the region $[0, 1)$ into $n$ evenly-sized regions. For example, when rolling a four-sided die, we ended up with this partitioning:

Now, suppose that we were interested in simulating a roll of this fair die, given that we have available to us a collection of biased coins. One way that we might think about this would be to imaging marching across these buckets from the left to the right, at each time asking whether or not we want to stop in the bucket we're currently contained in or to move on. For example, let's suppose that we want to pick one of these four buckets randomly. Beginning in the leftmost bucket, we would flip a biased coin that would indicate whether we should stop in this bucket or continue moving forward. Since we want to pick all of these buckets uniformly with probability $\frac{1}{4}$, we could do this by flipping a biased coin that comes up heads with probability $\frac{1}{4}$. If it comes up heads, we stop in this bucket. Otherwise, we move on to the next bucket.

If the coins comes up tails, we end up in the second bucket and again want to ask whether or not we should choose this bucket or keep moving. Initially, you might think that we should flip another coin that comes up heads with probability $\frac{1}{4}$ to do this, but this would actually be incorrect! One way to see the flaw in this reasoning is to carry it to the extreme - if in each bucket we flip a coin that comes up heads with probability $\frac{1}{4}$, then there is a small chance that in each bucket the coin will come up tails and we will end up rejecting each bucket. Somehow, we need to keep increasing the probability of the coin coming up heads as we march across the buckets. At the extreme, if we end up in the last bucket, we need the coin to come up heads with probability $1$, since if we've rejected every preceding bucket the correct decision is to stop in the last bucket.

To determine the probability that our biased coin should come up heads once we've skipped the first bucket, notice that if we have indeed skipped the first bucket, there are only three buckets left. Since we're rolling a fair die we would want each of these three buckets to be chosen with probability $\frac{1}{3}$. Consequently, intuitively it seems like we should have the second die come up heads with probability $\frac{1}{3}$. Using a simmilar argument, if we flip tails on the second bucket, then the coin we toss for the third bucket should come up heads with probability $\frac{1}{2}$, and the coin we toss for the final bucket should come up heads with probability $1$.

Algorithm	Initialization Time	Generation Time	Memory Usage
	Best	Worst	Best	Worst	Best	Worst
Loaded Die from Fair Die	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$	$\Theta(1)$	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$

This intuition leads us to the following algorithm. Note that we have not argued why this algorithm is correct or even that it is correct; we'll do that in a second.

This algorithm is simple and in the worst-case runs in $O(n)$ time. But how do we know that it's actually correct? To see this, we will need the following theorem:

Theorem: The above algorithm outputs side $i$ with probability $\frac{1}{n}$ for any choice of $i$.

Proof: Consider any fixed $n \ge 0$. We prove by strong induction that each of the $n$ sides has probability $\frac{1}{n}$ of being chosen.

As our base case, we show that side $0$ of the die has probability $\frac{1}{n}$ of being chosen. But this is immediate from the algorithm - we choose side 0 if a biased coin with probability $\frac{1}{n}$ of coming up heads comes up heads, which means that we pick it with probability $\frac{1}{n}$.

For the inductive step, assume that for sides $0, 1, 2, ..., k - 1$ that those sides are chosen with probability $\frac{1}{n}$ and consider the probability that side $k$ is chosen. Side $k$ will be chosen iff the first $k$ sides are not chosen, then a coin that comes up heads with probability $\frac{1}{n - k}$ comes up heads. Because each of the first $k$ sides have probability $\frac{1}{n}$ each of being chosen, and since only one side is ever chosen, the probability that one of the first $k$ sides is chosen is given by $\frac{k}{n}$. This means that the probability that the algorithm does not pick one of the first $k$ sides is given by $1 - \frac{k}{n} = \frac{n}{n} - \frac{k}{n} = \frac{n - k}{n}$. This means that the probability that we choose side $k$ is given by $\frac{n - k}{n} \frac{1}{n - k} = \frac{1}{n}$ as required, completing the induction. Thus each side of the die is chosen uniformly at random.

Of course, this algorithm is fairly inefficient - we can simulate a roll of the fair die in $O(1)$ using our earlier technique! - but this algorithm can be used as a stepping stone into a reasonably efficient algorithm for simulating a loaded die with biased coins.

Simulating a Loaded Die with a Biased Coin

The above algorithm is interesting in that it gives a simple framework for simulating a die with a set of coins. We begin by flipping a coin to determine whether to pick the first side of the die or to move on to the remaining sides. In doing so, we have to be careful to scale up the remaining probabilities.

Let's see how we might use this technique to simulate a roll of a loaded die. Let's use our example from above, with probabilities $\frac{1}{2}, \frac{1}{3}, \frac{1}{12}, \frac{1}{12}$. This, if you'll recall, splits the range $[0, 1)$ as follows:

Now, let's think about how we might simulate this loaded die using biased coins. We could start by flipping a coin that has probability $\frac{1}{2}$ of coming up heads to determine whether or not we should output side 0. If this coin comes up heads, great! We're done. Otherwise, we need to flip another coin to determine whether or not to pick the next side. As with before, even though the next side has probability $\frac{1}{3}$ of coming up, we do not want to flip a coin that comes up heads with probability $\frac{1}{3}$, since half of the probability mass has been discarded when we didn't choose the $\frac{1}{2}$ side. In fact, since half of the probability mass is gone, if we renormalize the remaining probabilities at this point, we end up with the updated probabilities $\frac{2}{3}, \frac{1}{6}, \frac{1}{6}$. Thus the second coin should be flipped with probability $\frac{2}{3}$. If this coin also flips tails, then we have to choose between the two $\frac{1}{12}$ sides. Since at that point $\frac{5}{6}$ of the probability mass will be gone, we would renormalize the probabilities of the $\frac{1}{12}$ sides so that each has probability $\frac{1}{2}$ of coming up heads, so the third coin would have probability $\frac{1}{2}$ of coming up heads. The final coin, if it's ever flipped, would have to come up heads with probability $1$, since it's the very last bucket.

Although it may make intuitive sense where these numbers come from, to turn this into an algorithm we are going to need to come up with a formal construction for choosing the probabilities. The idea is as follows - at each step, we remember how much of the probability mass remains to be used up. At the beginning, before flipping any coins, this is $1$. After flipping the first coin, it's $1 - p_0$. After flipping the second coin, it's $1 - p_0 - p_1$. More generally, after flipping $k$ coins, the remaining probability mass is $1 - \sum_{i = 0}^{k - 1}{p_i}$. Whenever we flip a coin to determine whether or not to pick bucket $k$, we end up flipping a coin that comes up heads with probability equal to the fraction of the remaining probability occupied by the probability $p_k$, which is given by $\frac{p_k}{1 - \sum_{i = 0}^{k - 1}{p_i}}$. This gives us the following algorithm for simulating a loaded die with a set of biased coins (again, we'll prove correctness and runtime in a second):

While this may make some amount of intuitive sense, is it mathematically correct? Fortunately, the answer is yes due to a generalization of the above proof, which is given here:

Theorem: The above algorithm outputs side $i$ with probability $p_i$ for any choice of $i$.

Proof: Consider any fixed $n \ge 0$. We prove by strong induction that each of the $n$ sides has probability $p_i$ of being chosen.

As our base case, we show that side $0$ of the die has probability $p_0$ of being chosen. We choose side $0$ if the very first coin flip comes up heads, which occurs with probability $\frac{p_0}{mass}$. Since $mass$ is initially $1$, this probability is $\frac{p_0}{1} = p_0$, so side 0 is chosen with probability $p_0$ as required.

For the inductive step, assume that for sides $0, 1, ..., k - 1$ that those sides are chosen with probability $p_0, p_1, ..., p_{k-1}$ and consider the probability that side $k$ is chosen. Side $k$ will be chosen iff the first $k$ sides are not chosen, then a coin that comes up heads with probability $\frac{p_k}{mass}$ comes up heads. Because each of the first $k$ sides are all chosen with the correct probabilities, and since only one side is ever chosen, the probability that one of the first $k$ sides is chosen is given by $\sum_{i = 0}^{k - 1}{p_i}$. This means that the probability that the algorithm does not pick one of the first $k$ sides is given by $1 - \sum_{i = 0}^{k - 1}{p_i}$. Now, the probability that the coin for side $k$ comes up heads is given by $\frac{p_k}{mass}$, and after $k$ iterations we can see by a quick induction that $mass = 1 - \sum_{i = 0}^{k - 1}{p_i}$. This means that the total probability that we pick side $k$ is given by $(1 - \sum_{i = 0}^{k - 1}{p_i})\frac{p_k}{1 - \sum_{i = 0}^{k - 1}{p_i}} = p_k$ as required, completing the induction.

Now, let's consider the runtime complexity of this algorithm. We know that the initialization time will be either $\Theta(1)$ if we keep a shallow copy of the input probability array, but would probably be $\Theta(n)$ so that we can store our own version of the array (in case the caller wants to change it later on). Actually generating the result of the die roll might require us to flip $\Theta(n)$ coins in the worst-case, but only requires a single flip in the best case.

However, with a bit of thought it becomes clear that the input distribution heavily influences how many coin flips we will need. In the absolute best case, we have a probability distribution where all of the probability mass is centered on the first side of the die and the remaining probabilities are all zero. In that case, we need just one coin flip. In the absolute worst-case, all the probability mass is centered on the very last side of the die and is zero everywhere else, in which case we'll need to flip $n$ coins to find the outcome.

We can precisely and mathematically characterize the expected number of coin flips for this algorithm. Let's think of a random variable $X$ that represents the number of coin flips in any execution of this algorithm on some specific distribution. That is, $\mathbb{P}[X = 1]$ is the probability that the algorithm flips just one coin before terminating, $\mathbb{P}[X = 2]$ is the probability that the algorithm flips just two coins, etc. In this case, the expected number of coin flips for our algorithm is given by the expected value of $X$, denoted $\mathbb{E}[X]$. By definition, we have that $$\mathbb{E}[X] = \sum_{i = 1}^n{i \cdot \mathbb{P}[X = i]}$$

So what is the value of $\mathbb{P}[X = i]$? Well, the algorithm will terminate after it chooses some side of the die. If it chooses side $0$, it will flip one coin to determine to stop there. If it chooses side $1$, it will flip two coins - one to recognize that it doesn't want to pick side $0$, and one to recognize that it does want to pick side $1$. More generally, if the algorithm chooses side $i$, it will flip $i + 1$ coins, $i$ to decide not to pick the previous $i - 1$ sides, and one to decide to pick side $i$. Combined with the fact that we know that side $i$ is chosen with probability $p_i$, this means that $$\mathbb{E}[X] = \sum_{i = 1}^n{i \cdot \mathbb{P}[X = i]} = \sum_{i = 1}^n{i \cdot p_{i - 1}} = \sum_{i = 1}^n{((i - 1) p_{i - 1} + p_{i - 1})} = \sum_{i = 1}^n{((i - 1) p_{i - 1})} + \sum_{i = 1}^n{p_{i - 1}}$$

In the last simplification, notice that this first term is equivalent to $\sum_{i = 0}^{n-1}{i \cdot p_i}$, which is equal to $\mathbb{E}[p]$; the expected outcome of the die roll! Moreover, the second term is $1$, since it's the sum of the total probabilities. This means that $\mathbb{E}[X] = \mathbb{E}[p] + 1$. That is, the expected number of coin flips is one plus the expected value of the die roll!

Generalizing Biased Coins: Simulating Loaded Dice

In the above example, we were able to efficiently simulate a biased coin because there was only one partition point that we needed to consider. How efficiently can we generalize this idea up to loaded dice, where the number of sides may be arbitrarily large?

If you'll notice, a biased coin is exactly the same as a loaded die that just has two sides. Consequently, we can think of a biased coin as just a special case of the more general problem we're interested in solving. When solving the biased coin problem, we split the range $[0, 1)$ into two regions - one for "heads" and one for "tails" - and then used the fact that there was just one splitting point to find the bucket we belong to. If we have an n-sided die, then we will have multiple buckets and, therefore, multiple splitting points. For example, suppose that we have a seven-sided die with probabilities $\frac{1}{4}, \frac{1}{5}, \frac{1}{8}, \frac{1}{8}, \frac{1}{10}, \frac{1}{10}, \frac{1}{10}$. If we were to partition the range $[0, 1)$ into seven pieces, we would do so as follows:

Notice where these splits are located. The first partition begins at $0$ and ends at $\frac{1}{4}$. The second partition begins at $\frac{1}{4}$ and ends at $\frac{1}{4} + \frac{1}{5} = \frac{9}{20}$. More generally, if the probabilities are $p_0, p_1, ..., p_{n - 1}$, the partitions would be the ranges $[0, p_0), [p_0, p_0 + p_1), [p_0 + p_1, p_0 + p_1 + p_2), $ etc. That is, bucket $i$ is delimited by the range $$[\sum_{j = 0}^{i - 1}{p_j}, \sum_{j = 0}^{i}{p_j})$$ Notice that the difference between these two values is $p_i$, so the total area of the bucket is $p_i$ as required.

Now that we know where the partitions are, if we were to pick a uniformly-random value $x$ in the range $[0, 1)$, how would we determine which range it fell into? Using our biased coin algorithm as a starting point, one idea would be as follows: starting with endpoint of the first bucket, continuously walk upward across the partitions until we find an endpoint greater than the value of $x$. If we do this, we will have located the first bucket containing the point $x$, and we have our value. For example, if we picked the random value $x = \frac{27}{40}$, we would execute the following search:

Algorithm	Initialization Time	Generation Time	Memory Usage
	Best	Worst	Best	Worst	Best	Worst
Loaded Die from Fair Die	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$	$\Theta(1)$	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$
Loaded Die from Biased Coins	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$	$\Theta(n)$

This linear scan algorithm would give an $O(n)$-time algorithm for finding the side of the die that was rolled. However, we can dramatically improve this runtime by using the following observation: the sequence of endpoints of the buckets forms an ascending sequence (since we're always adding in more and more probabilities, none of which can be less than zero). Consequently, we are trying to answer the following question: given an ascending sequence of values and some test point, find the first value in the range strictly greater than the test point. This is a perfect spot to use a binary search! For example, here's an execution of binary search over the above array to find what bucket the value $x = \frac{39}{40}$ belongs to:

This gives us an $\Theta(\log n)$ algorithm for mapping a uniformly-random value in the range $[0, 1)$ to a side of the die that was rolled. Moreover, this requires only $\Theta(n)$ preprocessing time to build up the table of endpoints; we simply compute the partial sums of the probabilities all the way up.

This algorithm is sometimes called the roulette wheel selection because the algorithm picks a random bucket using a technique similar to a roulette wheel - throwing a ball into the range and seeing where it lands. In pseudocode, the algorithm looks like this:

It is clear that we now have a much better algorithm than what we started with. Discretizing the probabilities may have initially seemed promising, but this new approach based on a continuous value and binary search appears to be much better. However, it is still possible to improve upon these bounds by using a clever set of hybrid techniques, as we'll see in a minute.

One interesting detail about this algorithm is that while using a binary search guarantees worst-case $O(\log n)$ time for random generation, it also eliminates the possibility of faster lookups; that is, the generation time is now $\Omega(\log n)$ as well. Is it possible to do better than this? It turns out that the answer is yes.

Algorithm	Initialization Time	Generation Time	Memory Usage
	Best	Worst	Best	Worst	Best	Worst
Loaded Die from Fair Die	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$	$\Theta(1)$	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$
Loaded Die from Biased Coins	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$	$\Theta(n)$
Roulette Wheel Selection	$\Theta(n)$	$\Theta(\log n)$	$\Theta(n)$

Suppose that we switch from using a simple binary search over the list of cumulative probabilities to using a binary search tree. For example, given the above set of probabilities, we might build the following binary search tree over their cumulative distribution:

Now, if we wanted to simulate a roll of the die, we can generate a uniformly-distributed number in the range $[0, 1)$, then look up in which range it lies in this BST. Since this is a balanced binary search tree, the best-case lookup time is $O(1)$ and the worst-case lookup time is $O(\log n)$.

However, assuming that we know more about the probability distribution, it may be possible to do much better than this. For example, suppose that our probabilities are $\frac{99}{100}, \frac{1}{600}, \frac{1}{600}, \frac{1}{600}, \frac{1}{600}, \frac{1}{600}, \frac{1}{600}$. That is, the probability distribution is extremely skewed, with almost all of the probability mass concentrated on a single side. We could build a balanced BST for these probabilities, as shown here:

While this binary search tree is perfectly balanced, it's not a very good binary search tree for our application. Since we know that 99 times out of 100 the random value is going to be in the range $[0, \frac{99}{100})$, it doesn't make any sense to store the node for that range where it's currently located. In fact, doing this would mean that almost all of the time we'd end up doing two unnecessary comparisons against the blue and yellow nodes. Since with very high probability we would want the node for the large range to be checked first, it makes sense to unbalance the tree in a way that makes the average case substantially better at the expense of the remaining cases. This is shown here:

Now, we are very likely to terminate the search after immediately finding the bucket we want on our first try. In the very unlikely event that the bucket we want to find is contained in the sliver $(\frac{99}{100}, 1]$, then we end up degrading gracefully to the rest of the tree, which is indeed well-balanced.

Fortunately, this problem is extremely well-studied and is called the optimal binary search tree problem. There are many algorithms for solving this problem; it is known that an exact solution can be found in time $O(n^2)$ using dynamic programming, and there exist good linear-time algorithms that can find approximate solutions. Additionally, the splay tree data structure, a self-balancing binary search tree, can be used to get to within a constant factor of the optimal solution.

Interestingly, the best-case behavior for these optimized binary search trees occurs when the probability distributions are extremely skewed, since we can just move the nodes containing the bulk of the probability mass near the root of the tree, and their worst-case is when the distribution is balanced, since in that case the tree has to be wide and shallow. This is the opposite of the behavior of the earlier algorithm that uses a fair die to simulate a loaded die!

In the best-case, we have a loaded die in which one side always comes up (that is, it occurs with probability 1, and each other side occurs with probability 0). This is an extreme exaggeration of our earlier example, but would cause the search to always terminate after one lookup. In the worst-case, all the probabilities are even, and we have to do a standard BST lookup. This gives the following:

Throwing Darts

So far, we have seen two primitives that have, in some way, helped us build algorithms for simulating loaded dice: the fair die and the biased coin. Using purely a fair die, we came up with an (albeit impractical) algorithm for simulating a loaded die, and beginning with the intuition for biased coins we were able to invent a fast algorithm for simulating loaded dice. Is it possible to combine the two approaches together to build an algorithm built on both fair dice and biased coins? The answer is a resounding "yes," and in fact the resulting algorithm is superior to both of the above approaches.

Up to this point, we have been visualizing the range $[0, 1)$ and the probabilities of the dice faces as a one-dimensional range. Both of the above algorithms work by picking some point in the range $[0, 1)$ and mapping it onto a line segment whose length corresponds to some probability. The longer we make our line segments, the higher the probability that this segment is chosen. But what if instead of thinking in one dimension, we think in two? What if instead of thinking of the probability $p_i$ as the length of a line segment, we think of it as the area of a rectangle?

Let's begin by returning to our older example of the probabilities $\frac{1}{2}, \frac{1}{3}, \frac{1}{12}, \frac{1}{12}$. Let's visualize these probabilities as rectangles of width $w$ (for some arbitrary $w > 0$) and height $p_i$ (and thus total area $w \cdot p_i$):

Notice that the total area of these rectangles is, collectively, $w$, since the area is $$\sum_{i = 0}^{n - 1}{w p_i} = w \sum_{i = 0}^{n - 1}{p_i} = w$$

Now, suppose that we draw a bounding box around these rectangles, whose width is $4w$ (because there are four rectangles) and whose height is $\frac{1}{2}$ (since the tallest rectangle has height $\frac{1}{2}$):

Algorithm	Initialization Time	Generation Time	Memory Usage
	Best	Worst	Best	Worst	Best	Worst
Loaded Die from Fair Die	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$	$\Theta(1)$	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$
Loaded Die from Biased Coins	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$	$\Theta(n)$
Roulette Wheel Selection	$\Theta(n)$	$\Theta(\log n)$	$\Theta(n)$
Optimal Roulette Wheel Selection	$O(n^2)$	$\Theta(1)$	$O(\log n)$	$\Theta(n)$

We can then think of this rectangle as being split into five regions - four regions corresponding to the different probabilities, and one region representing unused space. Given this partition, we can think of an algorithm for simulating random rolls of the die as a game of darts. Suppose that we throw a (perfectly uniformly-distributed) dart at this target. If it hits the unused space, we'll remove the dart and throw it again, repeating until we hit one of the rectangles. Since higher probabilities correspond to larger rectangles, the higher the probability that a particular face on the die comes up, the higher the probability that we eventually hit its rectangle. In fact, if we condition on the fact that we actually hit some rectangle, we have the following: $$\mathbb{P}[\mbox{hit rectangle for side i} | \mbox{hit some rectangle}] = \frac{\mbox{area of rectangle for i}}{\mbox{total rectangle area}} = \frac{w p_i}{w} = p_i$$

In other words, once we finally hit some rectangle with our uniformly-random dart, we will pick the rectangle for side $i$ of the loaded die with probability $p_i$, precisely the probability that we want it to have! In other words, if we can find some efficient way of simulating throwing random darts at this rectangle, we will have an efficient way for simulating a roll of the random die.

One way we could think about throwing darts at this rectangle would be to pick two uniformly random values in the range $[0, 1)$, scale them to the appropriate width and height, then check what region was under the dart. However, this poses the same problem we had before when trying to determine which one-dimensional bucket a random value would be in in the earlier case. However, there is a truly beautiful series of observations we can have that can make it simple, if not trivial, to determine where we hit.

As a first observation, note that we've shown that the widths of these rectangles can be arbitrarily chosen, so long as they all have the same width. The heights, of course, depend on the probabilities of the dice faces. However, if we were to uniformly scale all of the heights by some positive real number $h$, then the relative areas of all of the rectangles would be the same. In fact, for any positive real number $h$, we have that the total area of all the rectangles, once their heights are scaled by $h$, is given by $$\sum_{i = 0}^{n - 1}{w h p_i} = w h \sum_{i = 0}^{n - 1}{p_i} = w h$$

So now consider the probability of choosing any individual rectangle, conditioning on the fact that we hit some rectangle at all. Using similar math to as before, we get the following: $$\mathbb{P}[\mbox{hit rectangle for side i} | \mbox{hit some rectangle}] = \frac{\mbox{area of rectangle for i}}{\mbox{total rectangle area}} = \frac{w h p_i}{w h} = p_i$$

So, in fact, there is no change to the probability of choosing any individual rectangle as long as we scale them linearly and uniformly.

Since we can choose any positive scaling factor that we'd like, what if we scaled these rectangles so that the height of the bounding box is always 1? Since the height of the bounding box is defined by the maximum value of $p_i$ of the input probabilities, we could begin by scaling every one of the rectangles by a factor of $\frac{1}{p_{max}}$, where $p_{max}$ is the maximum probability of all the input probabilities. This makes the height of the rectangle 1. Similarly, since we are allowed to pick any arbitrary width for the boxes, let's pick a width of 1. This means that given $n$ probabilities, the total width of the bounding box is $n$ and the total height is 1. This is shown here:

We're now ready to think about how we might throw a random dart at this rectangle and determine what we've hit. The key insight is that we can break the rectangle down so that instead of consisting of several smaller rectangles and an oddly-shaped open space, instead the region is cut apart into a collection of $2n$ rectangles, two for each of the $n$ input probabilities. This is shown here:

Notice how this rectangle is formed. For each side of the loaded die, we have one column of width 1 and height 1 cut into two spaces - a "yes" half-space corresponding to the rectangle for that side, and a "no" half-space corresponding to the remainder of the column.

Now, let's think about how we might throw a dart at this. A perfectly uniform dart tossed at this rectangle would have an $x$ and a $y$ component. Here, the $x$ component, which must be in the range $[0, 1)$, corresponds to which column the dart lands in. The $y$ component, which must be between in the range $[0, 1)$ corresponds to how high up the column we are. The choice of $x$ component influences which side of the loaded die we're considering, and the choice of the $y$ component corresponds to whether or not we pick that side or not. But wait a minute - we've seen these two ideas before! Choosing the $x$ coordinate, which corresponds to a column, is equivalent to rolling a fair die to decide which column to pick. Choosing the $y$ coordinate corresponds to flipping a biased coin to determine whether to choose the side or roll again! This observation is so important that we should make it extra clear:

In fact, this result can be thought of in a much more powerful sense. In order to simulate a loaded die, we build a set of biased coins, one for each side of the die, and then roll a fair die to determine which coin to flip. Based on the die roll, if the appropriate coin comes up heads, we pick the given side, and if the coin comes up tails, we roll the die again and repeat.

Let's recap the important points so far. First, the dimensions of these rectangles are as follows - for each side, the height of the "yes" rectangle is given by $\frac{p_i}{p_{max}}$, and the height of the "no" rectangle is given by $\frac{p_{max} - p_i}{p_{max}}$. This normalizes the total heights of the rectangles to be 1. Second, each rectangle has width $1$, though honestly this value doesn't matter. Finally, our algorithm is as follows: until we pick some outcome, roll the fair die to determine which column we are in (in other words, which biased coin to flip). Next, flip the appropriate biased coin. If it comes up heads, choose the outcome corresponding to the chosen column. Otherwise, repeat this process.

Let's analyze the complexity of this algorithm. In the initialization step, it takes time O(n) to find the maximum probability, and then an additional O(n) time to allocate and populate the array $Coins$, so the total initialization time is O(n). In the generation step, in the best case we end up flipping heads on our very first coin, terminating in O(1). But how many iterations are required on expectation? To find this value, let's compute the probability that we actually end up choosing some side after a single iteration. Since the coins don't all have the same probability of coming up heads, this will depend on which coin actually is chosen. Fortunately, since we pick each coin with identical probability (namely, $\frac{1}{n}$), the math becomes much easier. Moreover, since we only end up flipping one coin, the events of each coin being chosen and coming up heads are all mutually exclusive, so the total probability that some coin is chosen, flipped, and comes up heads is given by the sum of the probabilities of picking each individual coin and having that individual coin coming up heads. Since we know that the probability that side $i$ is chosen is given by $\frac{p_i}{p_{max}}$, so the total probability that some side is chosen is given by $$\sum_{i = 0}^{n - 1}{(\frac{1}{n} \frac{p_i}{p_{max}})} = \frac{1}{n}\sum_{i = 0}^{n - 1}{\frac{p_i}{p_{max}}} = \frac{1}{n \cdot p_{max}}\sum_{i = 0}^{n - 1}{p_i} = \frac{1}{n \cdot p_{max}}$$

If this is the probability that some coin is chosen on any one iteration, then the expected number of iterations that may occur is given by the reciprocal of this fraction, which is $n \cdot p_{max}$. But what exactly does this mean? This depends very much on the choice of $p_{max}$. At one extreme, $p_{max}$ might be equal to $1$ (that is, the die always comes up the same way every time). In this case, the expected number of iterations is equal to $n$, meaning that on expectation we would need to roll the fair die $n$ times. This makes sense, since the only way we would choose a side is if we were to pick the biased coin for the one side that always comes up heads, since each other side has a coin that never comes up heads at all. On the other hand, in the other extreme the minimum value of $p_{max}$ is $\frac{1}{n}$, since if it were any lower than this the total probability of all sides would be less than one. If $p_{max} = \frac{1}{n}$, then the expected number of flips is 1. This too makes sense. If $p_{max} = \frac{1}{n}$, then each side has the same probability of being chosen (namely, $\frac{1}{n}$), so when we normalize the probabilities of each side to 1, each side will have probability 1 of being chosen. Thus the die roll to choose which coin to flip will effectively be determining the outcome, since the coin always comes up heads and we never have to repeat ourselves.

It's interesting that the expected number of flips depends solely on the value of $p_{max}$ and not any of the other probabilities involved, but if we return to our graphical intuition this does make sense. The total area of the rectangle at which we're shooting darts is always $n$, since we normalize the heights to be 1. Moreover, the total area held by the rectangles representing "yes" answers is given by $\frac{1}{p_{max}}$, since each rectangle has width 1 and height normalized by multiplying by $\frac{1}{p_{max}}$. This means that the ratio of the total area of the "yes" rectangles to the total area of the overall rectangle is $\frac{1}{n \cdot p_{max}}$. In other words, the space used up by the "no" rectangles depends purely on the value of $p_{max}$. It can be spread around or distributed however we choose, but ultimately its area is the same and the odds of some dart hitting it is independent of how it's spread.

In the best case, this algorithm is better than the binary search algorithm from above, requiring just one coin flip. However, its worst-case behavior is exponentially worse. Is it possible to eliminate this worst-case behavior?

The Alias Method

The previous technique has excellent best-case behavior, generating a random roll using a single fair die roll and coin flip. On expectation, its worst-case behavior is much worse, though, potentially requiring a linear number of die rolls and coin flips. The reason for this is that, unlike the previous techniques, the algorithm may "miss" and have to repeatedly iterate until it decides on a decision. Graphically, this is because it works by throwing darts at a target that may contain a large amount of empty space not assigned to any outcome. If there were a way to eliminate all of that empty space such that every piece of the target was covered by a rectangle corresponding to some side of the loaded die, then we could just throw a single dart at it and read off the result.

Algorithm	Initialization Time	Generation Time	Memory Usage
	Best	Worst	Best	Worst	Best	Worst
Loaded Die from Fair Die	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$	$\Theta(1)$	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$
Loaded Die from Biased Coins	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$	$\Theta(n)$
Roulette Wheel Selection	$\Theta(n)$	$\Theta(\log n)$	$\Theta(n)$
Optimal Roulette Wheel Selection	$O(n^2)$	$\Theta(1)$	$O(\log n)$	$\Theta(n)$
Fair Die/Biased Coin Loaded Die	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$ (expected)	$\Theta(n)$

A particularly clever insight we might have is to adjust the height of the rectangle such that instead of having the height match the greatest probability, it matches the average probability. Let's consider our above example. Here, our probabilities are $\frac{1}{2}, \frac{1}{3}, \frac{1}{12}, \frac{1}{12}$. Since there are four probabilities, the average probability must be $\frac{1}{4}$. What would happen if we tried normalizing the height of the box to $\frac{1}{4}$, the average, rather than $\frac{1}{2}$, the maximum? Let's see what happens. We begin with rectangles of width $1$ whose heights are equal to the initial probabilities:

Now, we scale all of these rectangles so that a probability of $\frac{1}{4}$ would have height 1. This works by multiplying each probability by four, yielding this setup:

At this point, let's draw a $1 \times 4$ rectangle on top of this image. This will represent the target we'll be shooting at:

As you can see, this doesn't quite work out, since the rectangles for $\frac{1}{2}$ and $\frac{1}{3}$ don't neatly fit into the box. But what if we allowed ourselves to cut the rectangles into smaller pieces? That is, what if we cut some of the space for the $\frac{1}{2}$ rectangle off and move it into the empty space above the space for one of the $\frac{1}{12}$ rectangles? This would give this setup, which still has the vertical bars hanging over, but not by too much:

Now, we still have a bit of overhang, but not too much overhang. One way to completely eliminate the overhang would be to move the extra pieces from the $\frac{1}{2}$ and $\frac{1}{3}$ bars into the empty space, but there's actually a better solution. Let's begin by moving enough of the $\frac{1}{2}$ bar out of the first column and into the third to complete fill in the remaining gap. This will leave a small gap in the first column, but will close the other gap:

Finally, we can tuck the extra overhead from the second column into the first, producing this final rectangle:

What we have below has several excellent properties. First, the total areas of the rectangles representing each side of the loaded die are unchanged from the original; all we've done is cut those rectangles into pieces and move them around. This means that as long as the areas of original rectangles are proportionally distributed according to the original probability distribution, the total area dedicated to each side of the die is the same. Second, notice that this new rectangle has no free space in it, meaning that any time we toss a dart at it, we are guaranteed to hit something that will give us an ultimate answer, not empty space that requires another dart toss. This means that a single dart toss suffices to generate our random value. Finally, and most importantly, note that each column has at most two different rectangles in it. This means that we can retain our intuition from before - we roll a die to determine which biased coin to toss, then toss the coin. The difference this time is what the coin toss means. A toss of heads means that we pick one side of the die, and a toss of tails now means that we should pick some other side of the die (rather than rolling again).

At a high level, the alias method works as follows. First, we create rectangles representing each of the different probabilities of the dice sides. Next, we cut those rectangles into pieces and rearrange them so that we completely fill in a rectangular target such that each column has a fixed width and contains rectangles from at most two different sides of the loaded die. Finally, we simulate rolls of the die by tossing darts randomly at the target, which we can do by a combination of a fair die and biased coins.

But how do we even know that it's possible to cut the original rectangles apart in a way that allows each column to contain at most two different probabilities? This does not seem immediately obvious, but amazingly it's always possible to do this. Moreover, not only can we cut the rectangles into pieces such that each column contains at most two different rectangles, but we can do so in a way where one of the rectangles in each column is the rectangle initially placed in that column. If you'll notice, in the above rectangle rearrangement, we always cut a piece of a rectangle and moved it into another column, and never entirely removed a rectangle from its original column. This means that each column in the final arrangement will consist of some rectangle corresponding to the probability initially assigned there, plus (optionally) a second rectangle pulled from some other column. This second rectangle is often called the alias of the column, since the remaining probability of the column is used as an "alias" for some other probability. The use of the term "alias" here gives rise to the name "alias method."

Before we go into the proof that it's always possible to distribute the probabilities in this way, we should take a quick second to sketch out how the algorithm actually works. Because each column of the resulting arrangement always contains some (potentially zero-height!) piece of the original rectangle from that column, to store the (potentially) two different rectangles occupying a column, implementations of the alias method typically work by storing two different tables: a probability table $Prob$ and an alias table $Alias$. Both of these tables have size $n$. The probability table stores, for each column, the probability within that column that the original rectangle will be chosen, and the alias table stores the identity of the second rectangle (if any) contained in that column. That way, generating a random roll of the die can be done as follows. First, using a fair die, choose some column $i$ uniformly at random. Next, flip a random coin with probability $Prob[i]$ of coming up heads. If the coin flips heads, output that the die rolled $i$, and otherwise output that the die rolled $Alias[i]$. For example, here are the probability and alias tables for the above configuration:

Below is a front-end into a JavaScript implementation of the alias method that has the above probability and alias tables built into it. You can click on the "Generate" button to have it generate a fair die roll and biased coin toss to see what side of the die would be rolled.

Proving Alias Tables Exist

We now need to formally prove that it is always possible to construct the $Alias$ and $Prob$ tables from above. In order to prove that this is always possible, we need to show that it is possible to do the following:

Before going into the proof that it's always possible to do this, let's work through an example. Suppose that we have the four probabilities $\frac{1}{2}, \frac{1}{3}, \frac{1}{12},\frac{1}{12}$ as above. This is a collection of four probabilities ($k = n = 4$) whose sum is $1 = \frac{4}{4}$. Although we saw above how to fill in the alias table by experimentation, let's instead try walking through this construction more explicitly by starting with a completely empty table and then filling it in. We begin by scaling all of these probabilities by a factor of four, giving us these probabilities and this empty table:

Now, notice that of the four rectangles that we have to distribute, two of them ($\frac{1}{3}, \frac{1}{3}$) are less than 1. This means that they won't completely fill up a column and will need some other probability to fill in the remainder. Let's pick one of the two (say, the yellow one) and put it into its appropriate column:

Now, we need to somehow make up the difference in the top of the column. To do this, we'll notice that two of the rectangles that have yet to be distributed have heights greater than 1 (namely, $2$ and $\frac{4}{3}$). Let's pick one of these two arbitrarily; here, let's use the $\frac{4}{3}$. We then distribute enough of the $\frac{4}{3}$ into the column to fill it completely; this ends up using $\frac{2}{3}$ of the $\frac{4}{3}$ up, as shown here:

Now, notice what our setup looks like. We now have three rectangles whose total area is $3$ and three open columns, so it seems like it should be possible to distribute those rectangles into the three columns. To do so, we'll use the same intuition as before. Notice that there is at least one rectangle whose height is less than $1$, so we'll pick one arbitrarily (let's say that we grab the $\frac{2}{3}$ rectangle) and place it into its column:

We now need to top off the column, so we'll pick some probability that's at least 1 and use it to make up the rest of the column. There's only one choice here (using the $2$), so we'll pull $\frac{1}{3}$ off of the $2$ and put it atop the column:

And we're now down to two rectangles shose total area is two. We now repeat this process by finding some rectangle whose height is at most 1 (here, the $\frac{1}{3}$) and putting it into its column:

And then we find some rectangle of height at least $1$ to top off the column, using our only choice of the $\frac{5}{3}$:

Now we have just one rectangle remaining, and it has area 1. We can thus finish the construction by just putting that rectangle in its own column:

Can we prove that this general construction is always possible? That is, we don't end up getting "stuck" when distributing probabilities this way? Fortunately, the answer is yes. The intuition behind this is that we've scaled all of the probabilities such that the average of the new probabilities is now 1 (because originally it was $\frac{1}{n}$, and we multiplied everything by $n$). We know that the minimum of all the scaled probabilities must be no greater than the average and that the maximum of all the scaled probabilities must be no less than the average, so when we first start off there always must be at least one element at most 1 (namely, the smallest of the scaled probabilities) and one element at least one (namely, the largest of the scaled probabilities). We can thus pair these elements together. But what about once we've removed these two? Well, when we do this, we end up removing one probability from the total and decreasing the total sum of the scaled probabilities by one. This means that the new average hasn't changed, since the average scaled probability is one. We can then repeat this procedure over and over again until eventually we've paired up all the elements.

Die Roll	Coin Toss	Result
-	-

Theorem: Given $k$ width-one rectangles of heights $h_0, h_1, ..., h_{k-1}$ such that $\sum_{i=0}^{k-1}{h_i} = k$, there is a way of cutting the rectangles and distributing them into $k$ columns, each of which has height 1, such that each column contains at most two different rectangles and the $i$th column contains at least one piece of the $i$th rectangle.

Proof: By induction. As a base case, if $k = 1$, then we have just one rectangle and its height must be 1. We can therefore assign it to the $0$th column. Thus each column has height 1, contains at most two rectangles, and the $0$th column contains at least one piece of the $0$th rectangle.

For the inductive step, assume that for some natural number $k$ the theorem holds and consider any $k + 1$ rectangles of width $1$ and heights $h_0, h_1, ..., h_{k}$ such that $\sum_{i = 0}^{k}{h_i} = k + 1$. We first claim that there is some height $h_l$ such that $h_l \le 1$ and some different height $h_g$ (such that $l \ne g$) such that $h_g \ge 1$. To see this, assume for the sake of contradiction that there is no $h_l$ with $h_l \le 1$; this would mean that $h_i > 1$ for all natural numbers $i$ in the range $0 \le i \le k$. But then we have that $k + 1 = \sum_{i = 0}^k{h_i} > \sum_{i=0}^k{1} = k + 1$, which is clearly impossible. Thus there is some index $l$ such that $h_l \le 1$. Now, suppose for the sake of contradiction that there is no other height $h_g$ (with $l \ne g$) such that $h_g \ge 1$. Then we must have that each other $h_g < 1$, which would (by similar logic) mean that $\sum_{i=0}^{k}{h_i} < k + 1$, a contradiction. Consequently, we have that $h_l \le 1$ and $h_g \ge 1$.

Now, consider the following construction. Place $h_l$ into column $l$, and fill the remaining $1 - h_l$ space in the $l$th column with a piece of the rectangle $h_g$ (such space must exist, since $0 \le 1 - h_l \le 1$ and $h_g \ge 1$). This completely fills the column. We are now left with a collection of $k$ different pieces of rectangles whose total sum is $k$, since we removed $1$ total area from the rectangles, whose initial total sum was $k + 1$. Moreover, we have completely filled column $l$, so we will never try placing any more pieces of the rectangle there. Thus, by the inductive hypothesis, we can assign the remaining $k$ rectangles into $k$ columns while satisfying the above conditions. Combined with the fact that we have now filled column $l$, this means that we have a way of filling all the columns while satisfying the constraints. This completes the induction.

This is a constructive proof that says that not only can we always build the alias table, but that the above algorithm of finding a rectangle of height at most one and pairing it with a rectangle of height at least one will always succeed. From here, we can start devising faster and faster algorithms for computing alias tables.

Generating Alias Tables

Using just what we have said above, we can get a pretty good algorithm for simulating loaded die rolls using the alias method. The initialization works by repeatedly scanning the input probabilities to find a value at most 1 and a value at least 1, combining them together to fill a column:

Algorithm: Naive Alias Method

Initialization:
1. Multiply each probability $p_i$ by $n$.
2. Create arrays $Alias$ and $Prob$, each of size $n$.
3. For $j = 1 \mbox{ to } n - 1$:
  1. Find a probability $p_l$ satisfying $p_l \le 1$.
  2. Find a probability $p_g$ (with $l \ne g$) satisfying $p_g \ge 1$
  3. Set $Prob[l] = p_l$.
  4. Set $Alias[l] = g$.
  5. Remove $p_l$ from the list of initial probabilities.
  6. Set $p_g := p_g - (1 - p_l)$.
4. Let $i$ be the last probability remaining, which must have weight 1.
5. Set $Prob[i] = 1$.
Generation:
1. Generate a fair die roll from an $n$-sided die; call the side $i$.
2. Flip a biased coin that comes up heads with probability $Prob[i]$.
3. If the coin comes up "heads," return $i$.
4. Otherwise, return $Alias[i]$.

The generation step of this algorithm is exactly the same as the method described above, and runs in $\Theta(1)$. The generation step requires multiple iterations, which are described here. First, we need to spend $\Theta(n)$ time scaling each probability by a factor of $n$, and need $O(n)$ time to allocate the two arrays. The inner loop executes $\Theta(n)$ times, on each iteration doing $O(n)$ work to scan the array, remove one of the array elements, and update the probabilities. This gives a total of $O(n^2)$ total initialization work. If we consider this algorithm in context, we have the following:

Compared to the other efficient simulation techniques, this naive alias method has a large initialization cost, but can then simulate die rolls extremely efficiently. If we could somehow reduce the initialization cost to something lower (say, $O(n)$), then this technique would be strictly better than all of the other techniques employed here.

One simple way to reduce the initialization cost is to use a better data structure for storing the heights as we go. In the naive version, we use an unsorted array to hold all the probabilities, meaning that it takes $O(n)$ work to locate the two probabilities we want. A better alternative would be to use a balanced binary search tree to hold the values. This way, we could locate the values $p_g$ and $p_l$ in $O(\log n)$ time by finding the maximum and minimum values in the tree. Deleting $p_l$ could be done in $O(\log n)$ time, and updating the probability of $p_g$ could also be done in $O(\log n)$ time by simply removing it from the tree and reinserting it. This gives the following algorithm:

Algorithm	Initialization Time	Generation Time	Memory Usage
	Best	Worst	Best	Worst	Best	Worst
Loaded Die from Fair Die	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$	$\Theta(1)$	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$
Loaded Die from Biased Coins	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$	$\Theta(n)$
Roulette Wheel Selection	$\Theta(n)$	$\Theta(\log n)$	$\Theta(n)$
Optimal Roulette Wheel Selection	$O(n^2)$	$\Theta(1)$	$O(\log n)$	$\Theta(n)$
Fair Die/Biased Coin Loaded Die	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$ (expected)	$\Theta(n)$
Naive Alias Method	$O(n^2)$	$\Theta(1)$	$\Theta(n)$

Algorithm: Alias Method

Initialization:
1. Create arrays $Alias$ and $Prob$, each of size $n$.
2. Create a balanced binary search tree $T$.
3. Insert $n \cdot p_i$ into $T$ for each probability $i$.
4. For $j = 1 \mbox{ to } n - 1$:
  1. Find and remove the smallest value in $T$; call it $p_l$.
  2. Find and remove the largest value in $T$; call it $p_g$.
  3. Set $Prob[l] = p_l$.
  4. Set $Alias[l] = g$.
  5. Set $p_g := p_g - (1 - p_l)$.
  6. Add $p_g$ to $T$.
5. Let $i$ be the last probability remaining, which must have weight 1.
6. Set $Prob[i] = 1$.
Generation:
1. Generate a fair die roll from an $n$-sided die; call the side $i$.
2. Flip a biased coin that comes up heads with probability $Prob[i]$.
3. If the coin comes up "heads," return $i$.
4. Otherwise, return $Alias[i]$.

Now, our algorithm's initialization is much faster. Creating $Alias$ and $Prob$ still takes $O(n)$ time each, and adding the probabilities to the BST $T$ will take $\Theta(n \log n)$ time. From there, we do $\Theta(n)$ iterations of filling in the table, each of which takes $O(\log n)$ work. This gives an overall runtime of $O(n \log n)$ for the intialization, as seen here:

Algorithm	Initialization Time	Generation Time	Memory Usage
	Best	Worst	Best	Worst	Best	Worst
Loaded Die from Fair Die	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$	$\Theta(1)$	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$
Loaded Die from Biased Coins	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$	$\Theta(n)$
Roulette Wheel Selection	$\Theta(n)$	$\Theta(\log n)$	$\Theta(n)$
Optimal Roulette Wheel Selection	$O(n^2)$	$\Theta(1)$	$O(\log n)$	$\Theta(n)$
Fair Die/Biased Coin Loaded Die	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$ (expected)	$\Theta(n)$
Naive Alias Method	$O(n^2)$	$\Theta(1)$	$\Theta(n)$
Alias Method	$O(n \log n)$	$\Theta(1)$	$\Theta(n)$

However, there is an algorithm that runs even faster than this approach. It's remarkably simple, and is perhaps the cleanest of all of the algorithms for implementing the alias method. This algorithm was originally described in the paper "A Linear Algorithm For Generating Random Numbers With a Given Distribution" by Michael Vose, and has become the standard algorithm for implementing the alias method.

The idea behind Vose's algorithm is to maintain two worklists, one containing the elements whose height is less than 1 and one containing the elements whose height is at least 1, and to repeatedly pair the first elements of each worklist. On each iteration, we consume the element from the "small" worklist, and potentially move the remainder of the element from the "large" worklist into the "small" worklist. The algorithm maintains several invariants:

For simplicity, each worklist does not store the actual probability, but rather some pointer back to the original probability list saying which side of the loaded die is being referenced. Given these invariants, the algorithm is given below:

Algorithm: (Unstable) Vose's Alias Method

Caution: This algorithm suffers from numerical inaccuracies. A more numerically sound algorithm is given later.

Initialization:
1. Create arrays $Alias$ and $Prob$, each of size $n$.
2. Create two worklists, $Small$ and $Large$.
3. Multiply each probability by $n$.
4. For each scaled probability $p_i$:
  1. If $p_i < 1$, add $i$ to $Small$.
  2. Otherwise ($p_i \ge 1$), add $i$ to $Large$.
5. While $Small$ is not empty:
  1. Remove the first element from $Small$; call it $l$.
  2. Remove the first element from $Large$; call it $g$.
  3. Set $Prob[l] = p_l$.
  4. Set $Alias[l] = g$.
  5. Set $p_g := p_g - (1 - p_l)$.
  6. If $p_g < 1$, add $g$ to $Small$.
  7. Otherwise ($p_g \ge 1$), add $g$ to $Large$.
6. While $Large$ is not empty:
  1. Remove the first element from $Large$; call it $g$.
  2. Set $Prob[g] = 1$.
Generation:
1. Generate a fair die roll from an $n$-sided die; call the side $i$.
2. Flip a biased coin that comes up heads with probability $Prob[i]$.
3. If the coin comes up "heads," return $i$.
4. Otherwise, return $Alias[i]$.

Given the three above invariants, the first part of this algorithm (everything except the last loop) should be reasonably self-explanatory: we continuously pair some small element from $Small$ with a large element from $Large$ as normal, then add the remainder of the large element to the appropriate worklist. The last loop in the algorithm requires some explanation. Once we have exhausted all of the elements from the $Small$ list, there will be at least one element left over in the $Large$ list (since if every element was in $Small$, the sum of the elements would have to be less than the number of remaining elements, violating the last invariant). Since every element of $Large$ is at least 1, and since the sum of the $k$ elements in $Large$ must be equal to $k$, this means that every element in $Large$ must be exactly equal to 1, since otherwise the total would be too large. This final loop thus sets every large element's probability to be 1 so that the columns containing the large element are all equal to 1.

In this algorithm, the type of worklist does not matter. Vose's original paper uses stacks for the worklist because they can be efficiently implemented using arrays, but we could use a queue instead if we'd like. For simplicity, though, we'll use a stack.

Before doing an analysis of the algorithm, let's first trace through an example to see how it works. Let's consider an example of using the seven probabilities $\frac{1}{4}, \frac{1}{5}, \frac{1}{8}, \frac{1}{8}, \frac{1}{10}, \frac{1}{10}, \frac{1}{10}$. To highlight the fact that the algorithm doesn't sort the probabilities or require them to be sorted, let's order them arbitrarily as $\frac{1}{8}, \frac{1}{5}, \frac{1}{10}, \frac{1}{4}, \frac{1}{10}, \frac{1}{10}, \frac{1}{8}$. The algorithm begins by adding these elements to two work stacks, as shown here:

We now place the top of the $Small$ stack into its slot, moving the magenta rectangle into its final position:

Now, we use the top of the $Large$ stack (the cyan rectangle) to fill in the rest of the column. Since $\frac{7}{4} - \frac{1}{8} = \frac{13}{8} \ge 1$, we leave the cyan block atop the $Large$ stack, as shown here:

We then repeat this process. We move the rectangle on top of the $Small$ stack into its column, then top off the difference with the top of the $Large$ stack:

When we repeat this process next, we'll find that while we can use the cyan block to cover the slack space in the table, doing so ends up making the cyan block have height less than one. Consequently, we move the cyan block atop the small stack, as shown here:

Now, when we process the $Small$ worklist, we end up putting the cyan block in its place, then using the yellow block to fill in the slack:

We then process the $Small$ stack to put the orange block into position, topping it off with the yellow block:

And finally, since the $Small$ stack is empty, we put the yellow block into its own column and are done.

A Practical Version of Vose's Algorithm

Unfortunately, the above algorithm, as written, is not numerically stable. On an idealized machine that can do arbitrary-precision real number computations it's fine, but if you were to try running it using IEEE-754 doubles, it may end up completely failing. There are two sources of inaccuracy that we need to deal with before moving on:

The combination of these two factors means that we may end up with the algorithm accidentally putting all of the probabilities into the $Small$ worklist instead of the $Large$ worklist. As a result, the algorithm may end up failing because it expects the $Large$ worklist to be nonempty when the $Small$ worklist is nonempty.

Fortunately, fixing this ends up not being particularly difficult. We will update the inner loop of the algorithm so that it terminates whenever either of the two worklists are empty, so we don't accidentally end up looking at nonexistent elements from the $Large$ worklist. Second, when one worklist is empty, we'll set the remaining probabilities of the elements in the other worklist to all be $1$, since, mathematically, this should only occur if all of the remaining probabilites are precisely equal to $1$. Finally, we'll replace the computation that updates the large probabilities with a slightly more stable computation. This is shown here:

Algorithm: Vose's Alias Method

Initialization:
1. Create arrays $Alias$ and $Prob$, each of size $n$.
2. Create two worklists, $Small$ and $Large$.
3. Multiply each probability by $n$.
4. For each scaled probability $p_i$:
  1. If $p_i < 1$, add $i$ to $Small$.
  2. Otherwise ($p_i \ge 1$), add $i$ to $Large$.
5. While $Small$ and $Large$ are not empty: ($Large$ might be emptied first)
  1. Remove the first element from $Small$; call it $l$.
  2. Remove the first element from $Large$; call it $g$.
  3. Set $Prob[l] = p_l$.
  4. Set $Alias[l] = g$.
  5. Set $p_g := (p_g + p_l) - 1$. (This is a more numerically stable option.)
  6. If $p_g < 1$, add $g$ to $Small$.
  7. Otherwise ($p_g \ge 1$), add $g$ to $Large$.
6. While $Large$ is not empty:
  1. Remove the first element from $Large$; call it $g$.
  2. Set $Prob[g] = 1$.
7. While $Small$ is not empty: This is only possible due to numerical instability.
  1. Remove the first element from $Small$; call it $l$.
  2. Set $Prob[l] = 1$.
Generation:
1. Generate a fair die roll from an $n$-sided die; call the side $i$.
2. Flip a biased coin that comes up heads with probability $Prob[i]$.
3. If the coin comes up "heads," return $i$.
4. Otherwise, return $Alias[i]$.

All that's left to do now is analyze the algorithm's complexity. Seeding the worklists takes a total of $\Theta(n)$ time, because we add each element to exactly one of the worklists. The inner loop does a total of $\Theta(1)$ work, since it needs to remove two elements from the worklist, update two arrays, and add one element back to a worklist. It can't execute more than $O(n)$ times, since each iteration decreases the number of elements (collectively) in the worklists by one by eliminating the smaller probability. The last two loops can each execute at most $O(n)$ times, since there are at most $O(n)$ elements in the $Large$ and $Small$ worklists. This gives a total runtime of $\Theta(n)$, which (as seen below) is as good as we're going to get:

Concluding Thoughts

Phew! We've covered a lot of ground here! We've explored several different methods for simulating loaded dice, beginning with a very simple set of techniques and concluding with extremely fast and efficient algorithms. Each method shows off a different set of techniques, and I find the final version (Vose's alias method) to be one of the most interesting and elegant algorithms I have ever come across.

Algorithm	Initialization Time	Generation Time	Memory Usage
	Best	Worst	Best	Worst	Best	Worst
Loaded Die from Fair Die	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$	$\Theta(1)$	$\Theta(n)$	$O(\prod_{i = 0}^n{d_i})$
Loaded Die from Biased Coins	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$	$\Theta(n)$
Roulette Wheel Selection	$\Theta(n)$	$\Theta(\log n)$	$\Theta(n)$
Optimal Roulette Wheel Selection	$O(n^2)$	$\Theta(1)$	$O(\log n)$	$\Theta(n)$
Fair Die/Biased Coin Loaded Die	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$ (expected)	$\Theta(n)$
Naive Alias Method	$O(n^2)$	$\Theta(1)$	$\Theta(n)$
Alias Method	$O(n \log n)$	$\Theta(1)$	$\Theta(n)$
Vose's Alias Method	$\Theta(n)$	$\Theta(1)$	$\Theta(n)$

If you have any questions, comments, or corrections, please feel free to contact me at keith@keithschwarz.com.