From Practical Probabilistic Programming by Avi Pfeffer

This article discusses forward sampling.

Save 37% on Practical Probabilistic Programming. Just enter code fccpfeffer into the discount code box at checkout at

In forward sampling, you generate values for the variables in a probabilistic program one at a time. After you’ve generated a value for all the variables, you have a possible world, which constitutes a single sample. Each time you generate a value for a variable, you use the definition of the variable: how it depends on its parents using its functional form and numerical parameters. Therefore, the variable’s definition defines a probability distribution over this variable. You choose a value for the variable according to this distribution. Forward sampling proceeds in topological order: you always generate values for the parents of a variable before generating a value for the variable, so you always know exactly which distribution to use.

Here’s pseudocode for forward sampling. This code generates a single sample consisting of values for all the variables:

1.  Let O be a topological order over the variables.

2.  For each variable V in order O:

a.  Let Par be the parents of V.

b.  Let xPar be the previously generated values for Par.

c.  Draw xV from P(V | Par = xpar).

3.  Return x (A vector representing the value of xV for every variable V).

The forward sampling process is illustrated in figure 1. The probabilistic program being sampled is shown on the left. The middle presents a tree that shows all the possible ways of generating values for the variables in this program. The thick arrows show one particular sampling run. First, a value is generated for the variable rembrandt1, indicating whether the first of two paintings is by Rembrandt. Because this variable’s definition is Flip(0.4), the value true will be generated with probability 0.4, and false with probability 0.6. In the shown sampling path, the value true is generated. Next, a value is generated for rembrandt2.

This variable doesn’t depend on any other variables and is also defined by a Flip, so the process is similar. In the given run, the value false is generated for rembrandt2. Finally, samePainter, which represents whether the two paintings are by the same painter, is generated. (Only two possible painters are considered, so if they’re both not by Rembrandt, they’re both by the other painter.) The variable samePainter depends on both rembrandt1 and rembrandt2, which have already been generated by this time. Because rembrandt1 is true and rembrandt2 is false, samePainter is automatically false.


Figure 1 The forward sampling process. The tree shows all the possible ways of generating values for the given program, along with a particular run through the tree. On the right is the possible world resulting from that run. (The abbreviations r1 and r2 are used for rembrandt1 and rembrandt2 in the tree.)

In forward sampling, you run this process many times, each time generating a different possible world, or sample. Table 1 shows four samples for this program. Notice that the same sample can be generated multiple times; in this example, the second and fourth samples are the same. The probability of a possible world is estimated to be the fraction of samples equal to that world. For example, the probability of the world in which rembrandt1 is false, rembrandt2 is false, and samePainter is true is estimated to be ½, because two out of four of the samples are equal to this world.

Table 1 Four samples generated for the program of figure 1


First sample

Second sample

Third sample

Fourth sample
















The estimated sample distribution can also be used to answer queries. For example, what if you want to know the probability that the two paintings are by the same painter? You can see that three out of the four samples have the value true for samePainter. So your estimated answer to this query is ¾.


It’s important to emphasize that the set of samples is only one possible representation of a probability distribution over the possible worlds. Forward sampling is a randomized process, so it can generate a different result every time. The estimated distribution you get from different runs of forward sampling will usually be different. Furthermore, they’ll usually not be exactly the same distribution as the one defined by the probabilistic program. In fact, looking at the example, Flip(0.4) is false with probability 0.6, and Flip(0.2) is false with probability 0.8. The probability that they’re both false is 0.6 * 0.8 = 0.48. And in this case, Flip(0.4) === Flip(0.2) will always be true. So the correct probability of the possible world equal to the second and fourth samples in table 1 is 0.48, not ½. In this particular example, the value is close, but in general there’s no guarantee that the answer will be close, particularly if you don’t generate a lot of samples.

Why then, is sampling a good idea? This example has only three variables, and you can do all the necessary calculations directly by using simple multiplication and addition. Many programs, however, have many variables, and these variables can have rich data types with a large number of or infinitely many values. In those programs, it may be impossible even to create all the necessary factors to run a factored inference algorithm, let alone to perform all the necessary multiplications and additions. With sampling, on the other hand, you need to consider only a relatively small number of possibilities and you get an estimate of the probability distribution you’re seeking.

Here’s an example program to bring the point home:

val x = Apply(Normal(0.4, 0.3), (d: Double) => d.max(0).min(1)) val y = Flip(x)

Here, x represents a normally distributed variable that’s constrained to lie between 0 and 1.

The variable y is defined to be a compound Flip, which is equivalent to Chain(x, (d: Double) => Flip(d)). In figure 1, I showed the entire tree of possible sampling runs. In practice, however, this full tree is never created—only the subtree corresponding to values that are generated. In this example, the full tree is infinite, so you could never compute the exact probability distribution. Sampling computes only a finite fraction of the tree and uses it to estimate the distribution.

Figure 2 shows the tree generated by sampling this program. In each run, a real numbered value is generated for Apply. This is passed as an argument to Chain, resulting in a different Flip element each time. If four samples are generated, only the elements for the four specific Flips will be created; all the infinitely many rest of the elements will never even exist.


Figure 2  Partial tree generated by sampling

If the sampled distribution is only an estimate of the true distribution, what can be said about the result of sampling? Here are two important points, which you probably sense intuitively:

·   On average, the sample distribution will be equal to the true distribution. Any particular sample distribution will be different from the true distribution, but if you average out all the distributions you could take, the average distribution will be equal to the true distribution. Intuitively, this holds because each sample is drawn from the true distribution. The technical term for this property is that the sampling process is unbiased.

·   The more samples you use, the closer you expect the sample distribution to be to the true distribution. Another way of saying this is that if you use the samples to answer a query, the expected error in the answer decreases as you use more samples. This is only the expected error; there are no guarantees. You might be unlucky and find your error increasing with more samples, but on average, your error will decrease. The technical term for this property is that the variance of the sampling process decreases with more samples.

Figure 3 illustrates the typical behavior of a sampling algorithm. The graph shows the probability of a query estimated by sampling as the number of samples is increased over time. You see that at first, the estimate tends to move rapidly toward the true probability and converges to the true probability in the long run, but at times the estimate diverges, and the convergence might be slow.

These two properties are the key to why sampling is a useful approximate inference algorithm for probabilistic reasoning. According to these properties, if you use a sampling process, you’ll get an approximate distribution that’s centered around the true distribution on average, and you expect it to get closer to the true distribution the more samples you use. If you use enough samples, you can expect to get as close as you want to the true distribution.



There is no silver bullet. Inference is a hard problem, and sampling isn’t always a great solution. Sometimes you have to use a really large number of samples to get close enough to the true distribution.


Figure 3 Typical behavior of a sampling process over time. The probability estimated by sampling tends to converge to the true probability, but there are also periods where it diverges.

Once you’ve generated a set of samples, it’s easy to use them to answer queries. For example, to estimate the probability that a variable takes on a specific value, you compute the fraction of samples in which the variable has that value. To estimate the most likely value of the variable, you see which value is most common in the samples. So, to summarize why sampling algorithms are so appealing:

You can get as close as you want to the true distribution if you use enough samples, and it’s easy to answer queries using the samples.

If you want to learn more about the book, check it out on liveBook here.