Estimation of distribution algorithms (EDAs) are powerful population-based searchers where the variation operations traditionally implemented via crossover and mutation in EAs are replaced by the process of random sampling from a probability distribution. The distribution is modified generation after generation, using information obtained from the fitter individuals in the population. The objective of these changes in the distribution is to increase the probability of generating individuals with high fitness.
Different EDAs use different models for the probability distribution that controls the sampling (see (Larrañaga, 2002; Larrañaga and Lozano, 2002) for more information). For example, population-based incremental learning (PBIL) (Baluja and Caruana, 1995) and the uniform multivariate distribution algorithm (UMDA) (Mühlenbein and Mahnig, 1999a, b) assume that each variable is independent of the other variables. Consequently, these algorithms need to store and adjust only a linear array of probabilities, one for each variable. This works well for problems with weak interactions between variables. Since no relationship between the variables is stored or learned, however, PBIL and UMDA may have difficulties solving problems where the interactions between variables are significant.
Naturally, higher order models are possible. For example, the MIMIC algorithm of de Bonet, Isbell, and Viola (1997) uses second-order statistics. It is also possible to use flexible models where interactions of different orders are captured. The Bayesian optimisation algorithm (BOA) (Pelikan, Goldberg, and Cantú-Paz, 1999) uses baysian networks to represent generic sampling distributions, while the extended compact genetic algorithm (eCGA) (Harik, 1999) clusters genes into groups where the genes in each group are assumed to be linked but groups are considered independent. The sampling distribution is then taken to be the product of the distributions modelling the groups.
EDAs have been very successful. However, they are often unable to represent both the overall structure of the distribution and its local details, typically being more successful at the former. This is because EDAs represent the sampling distribution using models with an, inevitably, limited number of degrees of freedom. For example, suppose the optimal sampling distribution has multiple peaks, corresponding to different local optima, separated by large unfit areas. Then, an EDA can either decide to represent only one peak, or to represent all of them together with the unfit areas. If the EDA chooses the wrong local peak this may lead to it getting stuck and not finding the global optimum. Conversely if it takes a wider view, this leads to wasting many trials sampling irrelevant poor solutions.
Consider, for example, a scenario where there are five binary variables, x1, x2, x3, x4 and x5, and two promising regions: one near the string of all zeros, i.e., (x1,x2,x3,x4,x5) = (0,0,0,0 ,0), and the other near the string of all ones, i.e., (x1,x2,x3,x4,x5) = (1,1,1,1 ,1). One option for a (simple) EDA is to focus on one of the two regions, e.g., setting the variables xi to 0 with high probability (say, 90%). This, however, fails to explore the other region, and risks missing the global optimum. The other option is to maintain both regions as possibilities by setting all the probabilities to 50%, i.e., each of the variables xi is as likely to be 0 as 1. These probabilities will generate samples in both of the promising regions. For example, the strings (0,0,0 ,0,0) and (1,1,1,1 ,1) will each be generated with a 3.125% probability. Also, simple calculations show that 31.25% of individuals generated by this distribution will be at Hamming distance 1 from either (0,0,0,0,0) or (1,1 ,1,1,1). 1 So, both optimal regions are sampled reasonably often. However, it is clear that the majority (62.5%) of samples will be allocated to less promising regions, where the Hamming distance will be 2 or 3 from both (0,0,0 ,0,0) and (1,1,1,1 ,1). This is a significant concern, which is why recently EDAs have often been used in combination with local search (e.g., see (?)).
There have been several applications of probabilistic model-based evolution (EDA-style) in the areas of tree-based and linear GP. We review them in the rest of this chapter.