From time to time we get asked why Barbecana’s Full Monte does not support Latin Hypercube Sampling (“LHS”). This is a fancy name for a technique intended to improve the accuracy obtained from a given number of iterations of a Monte Carlo simulation model, but in fact the benefits are very minor and not without attendant costs. This will be explained in Part II, but first we need to understand what it is and what the problem is that it attempts to solve.
A Monte Carlo simulation model has a number of random variable inputs whose distributions are assumed known and one or more outputs whose distribution is to be estimated by simulation. In the case of schedule risk analysis (“SRA”) the main inputs are typically task durations and the outputs of interest include the project finish date. For the sake of definiteness, let us assume that we are interested in the expected value of the finish date.
For any given iteration, the finish date will have a particular value which we can represent as:
x = m + e
where m is the true mean of the (unknown) distribution and e is a random error term.
If we do multiple iterations, it turns out that the average value of x (a.k.a. the sample mean) is an unbiased estimator of the true mean m. This is still just an estimate, but it can be shown that the standard deviation of the error (a.k.a. the standard error) is equal to the standard deviation of e divided by the square root of (n-1), where n is the number of iterations. Consequently our estimate will be more accurate the more iterations we do.
For any given number of iterations (n) LHS purports to reduce this standard error by “stratifying” the sample. Each input distribution is divided into a number of equally likely ranges (or “strata”) and sampling is conducted so that the number of samples from each stratum is the same. Protagonists normally illustrate the benefits using the case where there is only one input distribution, which in the case of SRA would mean a one-task network, so let’s start there.
Duration distributions are normally assumed to be continuous, but we can save ourselves a lot of nasty calculus without altering the basic principles by using a discrete one. Let us suppose that the duration of our single task can be any integer number of days from 1 to 6, so can be represented by the throwing of a die. Let us further suppose that we are going to do 36 iterations and that the output we are interested in is the probability of getting the value 1.
In this example, we know the true mean to be 1/6 so on average we expect 6 1’s out of the 36 iterations. If we do regular random sampling, the number of 1’s can be anything from 0 (unlikely) to 36 (even more unlikely). It turns out that the standard deviation of this number is the square root of (n.p.q) where n is the number of iterations, p the chance of getting a 1 on any individual throw and q is (1-p). In our case this is sqrt(5), or about 2.236.
If we were to divide the distribution into 6 strata, we would have exactly 6 1’s very time, so the error would be 0. Of course in this case there is no point in doing more than 6 iterations, and they just amount to a simple enumeration of all the possibilities, but remember that in the continuous case there will always be some error, however narrow the strata. To more closely resemble the continuous case, suppose that we divide the distribution into just three strata.
Then the number of 1’s and 2‘s combined will always be 12, but that could include anything from 0 to 12 1‘s. Again we can apply the sqrt(n.p.q) formula, this time getting sqrt(3) or about 1.732.
Finally consider dividing the distribution into just two strata, so that the combined number of 1’s, 2’s, and 3’s is always 18. Applying the same rule, we get a standard error of 2.
So, by stratifying progressively into 2, 3, or 6 strata we can reduce the standard error from 2.236 to 2, 1.732, or 0 respectively. Pretty impressive, but as we shall see in Part II this benefit does not translate into the case when we have many inputs.