Starting in the early 1990s, researchers began to notice that in addition to progressively increasing their mean and best fitness, GP populations also showed certain other dynamics. In particular, it was noted that very often the average size (number of nodes) of the programs in a population, after a certain number of generations in which it was largely static, at some point would start growing at a rapid pace. Typically the increase in program size was not accompanied by any corresponding increase in fitness. The origin of this phenomenon, which is known as bloat, has effectively been a mystery for over a decade.
Note that there are situations where one would expect to see program growth as part of the process of solving a problem. For example, GP runs typically start from populations of small random programs, and it may be necessary for the programs to grow in complexity for them to be able to comply with all the fitness cases (a situation which often arises in continuous symbolic regression problems). So, we should not equate growth with bloat and we should define bloat as program growth without (significant) return in terms of fitness.
Bloat is not only surprising, it also has significant practical effects: large programs are computationally expensive to evolve and later use, can be hard to interpret, and may exhibit poor generalisation. For these reasons bloat has been a subject of intense study in GP. Over the years, many theories have been proposed to explain various aspects of bloat, and while great strides have been made, we still lack a single, universally-accepted unifying theory to explain the broad range of empirical observations. We review the key theoretical results on bloat in Section 11.3.1 .
While discussions on the causes of bloat were going on, practitioners have still had to face the reality of combating bloat in their runs. Consequently, a variety of effective practical techniques have been proposed to counteract bloat. We review these in Section 11.3.2 , where we will particularly focus on the parsimony pressure method (Koza, 1992; ?; ?; ?), which is perhaps the simplest and most frequently used method to control bloat in genetic programming.
11.3.1 Bloat in Theory
As mentioned above, there are several theories of bloat. Let us start by looking at three of the oldest ones: the replication accuracy theory, the removal bias theory and the nature of program search spaces theory.
Three Classic Explanations for Bloat
The replication accuracy theory (McPhee and Miller, 1995) states that the success of a GP individual depends on its ability to have offspring that are functionally similar to the parent. As a consequence, GP evolves towards (bloated) representations that increase replication accuracy.
The nodes in a GP tree can often be crudely categorised into two classes: active code and inactive code. Roughly speaking, inactive code is code that is not executed, or is executed but its output is then discarded. All remaining code is active code. The removal bias theory ( Soule and Foster, 1998a) observes that inactive code in a GP tree tends to be low in the tree, residing, therefore, in smaller-than-average-size subtrees. Crossover events excising inactive subtrees produce offspring with the same fitness as their parents. On average the inserted subtree is bigger than the excised one, so such offspring are bigger than average while retaining the fitness of their parent, leading ultimately to growth in the average program size.
Finally, the nature of program search spaces theory (Langdon and Poli, 1997; Langdon, Soule, Poli, and Foster, 1999) predicts that above a certain size, the distribution of fitnesses does not vary with size. Since there are more long programs, the number of long programs of a given fitness is greater than the number of short programs of the same fitness. Over time GP samples longer and longer programs simply because there are more of them.
Executable Models of Bloat
The explanations for bloat provided by these three theories are largely qualitative. There have, however, been some efforts to mathematically formalise and verify these theories. For example, Banzhaf and Langdon (2002) defined an executable model of bloat where only the fitness, the size of active code and the size of inactive code were represented (i.e., there was no representation of program structures). Fitnesses of individuals were drawn from a bell-shaped distribution, while active and inactive code lengths were modified by a size-unbiased mutation operator. Various interesting effects were reported which are very similar to corresponding effects found in GP runs. Rosca ( 2003) proposed a similar, but slightly more sophisticated model which also included an analogue of crossover. This provided further interesting evidence.
A strength of these executable models is their simplicity. A weakness is that they suppress or remove many details of the representation and operators typically used in GP. This makes it difficult to verify if all the phenomena observed in the model have analogues in GP runs, and if all important behaviours of GP in relation to bloat are captured by the model.
Size Evolution Equation
In (Poli, 2001b; Poli and McPhee, 2003b), a size evolution equation for genetic programming was developed, which provided an exact formalisation of the dynamics of average program size. The original equation was derived from the exact schema theory for GP, and expressed mean program size as a function of the size and selection probabilities of particular schemata representing program shapes. The equation has recently been simplified (Poli and McPhee, 2008b) giving:
where m( t + 1) is the mean size of the programs in the population at generation t + 1, E is the expectation operator, is a program size, and p(,t) is the probability of selecting programs of size from the population in generation t.
This equation can be rewritten in terms of the expected change in average program size as:
where F(,t) is the proportion of programs of size in the current generation. Both equations apply to a GP system with selection and any form of symmetric subtree crossover.1
Note that Equations ( 11.1 ) and ( 11.2 ) do not directly explain bloat. They are, however, important because they constrain what can and cannot happen size-wise in GP populations. Any explanation for bloat (including the theories summarised above) has to agree with Equations ( 11.1 ) and ( 11.2 ).
In particular, Equation ( 11.1 ) predicts that, for symmetric subtree-swapping crossover operators, the mean program size evolves as if selection only was acting on the population. This means that if there is a change in mean size (bloat, for example) it must be the result of some form of positive or negative selective pressure on some or all of the length classes . Equation ( 11.2 ) shows that there can be bloat only if the selection probability p(,t) is different from the proportion F(,t) for at least some . In particular, for bloat to happen there will have to be some small 's for which p(,t) < F(,t) and also some bigger 's for which p(,t) > F(,t) (at least on average).
Crossover Bias Theory of Bloat
We conclude this review on theories of bloat with a recent explanation for bloat called the crossover bias theory (Dignum and Poli, 2007; Poli et al., 2007). This is based on and is consistent with the size evolution equation (Equation 11.1 ).
On average, each application of subtree crossover removes as much genetic material as it inserts; consequently crossover on its own does not produce growth or shrinkage. While the mean program size is unaffected, however, higher moments of the distribution are. In particular, crossover pushes the population towards a particular distribution of program sizes, known as a Lagrange distribution of the second kind, where small programs have a much higher frequency than longer ones. For example, crossover generates a very high proportion of single-node individuals. In virtually all problems of practical interest, however, very small programs have no chance of solving the problem. As a result, programs of above average size have a selective advantage over programs of below average size, and the mean program size increases.
Because crossover will continue to create small programs, which will then be ignored by selection (in favour of the larger programs), the increase in average size will continue generation by generation.
11.3.2 Bloat Control in Practice
Numerous empirical techniques have been proposed to control bloat (Langdon et al., 1999; Soule and Foster, 1998b). We cannot look at them all. However, we briefly review some of the most important.
Size and Depth Limits
Rather naturally, the first and simplest method to control code growth is the use of hard limits on the size or depth of the offspring programs generated by the genetic operators.
Many implementations of this idea (e.g., (Koza, 1992)) apply a genetic operator and then check whether the offspring is beyond the size or depth limit. If it isn't, the offspring enters the population. If, instead, the offspring exceeds the limit, one of the parents is returned. Obviously, this implementation does not allow programs to grow too large. However, there is a serious problem with this way of applying size limits, or more generally, constraints to programs: parent programs that are more likely to violate a constraint will tend to be copied (unaltered) more often than programs that don't. That is, the population will tend to be filled up with programs that nearly infringe the constraint, which is typically not what is desired.
It is well known, for example, that depth thresholds lead to the population filling up with very bushy programs where most branches reach the depth limit (being effectively full trees). On the contrary, size limits produce populations of stringy programs which tend to all approach the size limit. See (Crane and McPhee, 2005; McPhee, Jarvis, and Crane, 2004) for more on the impact of size and depth limits, and the differences between them.
The problem can be fixed by not returning parents if the offspring violates a constraint. This can be realised with two different strategies. Firstly, one can just return the oversize offspring, but give it a fitness of 0, so that selection will get rid of it at the next generation. Secondly, one can simply declare the genetic operation failed, and try again. This can be done in two alternative ways: a) the same parent or parents are used again, but new mutation or crossover points are randomly chosen (which can be done up to a certain number of times before giving up on those parents), or b) new parents are selected and the genetic operation is attempted again.
If a limit is used, programs must not be so tightly constrained that they cannot express any solution to the problem. As a rule of thumb, one should try to estimate the size of the minimum possible solution (using the terminals and functions given to GP) and add some percentage (e.g., 50-200%) as a safety margin. In general, however, it may be hard to heuristically come up with good limits, so some trial and error may be required. Alternatively, one can use one of the many techniques that have been proposed to adjust size limits during runs. These can be both at the level of individuals and the population. See for example the work by Silva and Almeida (2003); Silva and Costa (2004, 2005a,b); Silva, Silva, and Costa ( 2005).
Anti-bloat Genetic Operators
One can control bloat by using genetic operators which directly or indirectly have an anti-bloat effect.
Among the most recent bloat-control methods are size fair crossover and size fair mutation (Crawford-Marks and Spector, 2002; Langdon, 2000). These work by constraining the choices made during the execution of a genetic operation so as to actively prevent growth. In size-fair crossover, for example, the crossover point in the first parent is selected randomly, as in standard crossover. Then the size of the subtree to be excised is calculated. This is used to constrain the choice of the second crossover point so as to guarantee that the subtree chosen from the second parent will not be "unfairly" big.
Older methods include several mutation operators that may help control the average tree size in the population while still introducing new genetic material. Kinnear ( 1993) proposes a mutation operator which prevents the offspring's depth being more than 15% larger than its parent. Langdon ( 1998) proposes two mutation operators in which the new random subtree is on average the same size as the code it replaces. In Hoist mutation (Kinnear, 1994a) the new subtree is selected from the subtree being removed from the parent, guaranteeing that the new program will be smaller than its parent. Shrink mutation (Angeline, 1996) is a special case of subtree mutation where the randomly chosen subtree is replaced by a randomly chosen terminal. McPhee and Poli (2002) provides theoretical analysis and empirical evidence that combinations of subtree crossover and subtree mutation operators can control bloat in linear GP systems.
Other methods which control bloat by exploiting the bias of the operators were discussed in Section 9.4 .
As clarified by the size evolution equation discussed in the previous section, in systems with symmetric operators, bloat can only happen if there are some longer-than-average programs that are fitter than average or some shorter-than-average programs that are less fit than average, or both. So, it stands to reason that in order to control bloat one needs to somehow modulate the selection probabilities of programs based on their size.
As we have discussed in Section 9.2.1 , recent methods also include the use of multi-objective optimisation to control bloat. This typically involves the use of a modified selection based on the Pareto criterion.
A recent technique, the Tarpeian method (Poli, 2003), controls bloat by acting directly on the selection probabilities in Equation ( 11.2 ). This is done by setting the fitness of randomly chosen longer-than-average programs to 0. This prevents them being parents. By changing how frequently this is done the anti-bloat intensity of Tarpeian control can be modulated. An advantage of the method is that the programs whose fitness is zeroed are never executed, thereby speeding up runs.
The well-known parsimony pressure method (Koza, 1992; ?; ?; ?) changes the selection probabilities by subtracting a value based on the size of each program from its fitness. Bigger programs have more subtracted and, so, have lower fitness and tend to have fewer children. That is, the new fitness function is f(x) - c × (x), where (x) is the size of program x, f(x) is its original fitness and c is a constant known as the parsimony coefficient.2 ? showed some benefits of adaptively adjusting the coefficient c at each generation but most implementations actually keep the parsimony coefficient constant.
The parsimony pressure method can be seen as a way to address the generalisation-accuracy tradeoff common in machine learning (Rosca and Ballard, 1996b; ?). There are also connections between this method and the Minimum Description Length (MDL) principle used to control bloat in (Iba, 1997; Iba et al., 1994; Iba, de Garis, and Sato, 1995a). The MDL approach uses a fitness function which combines program complexity (expressed as the number of bits necessary to encode the program's tree) and classification error (expressed as the number of bits necessary to encode the errors on all fitness cases). Rosca also linked the parsimony pressure method to his approximate evolution equations for rooted-tree schemata (Rosca, 1996, 1997; Rosca and Ballard, 1996b, 1999).
Controlling bloat while at the same time maximising fitness turns the evolution of programs into either a multi-objective optimisation problem or, at least, into a constrained optimisation problem. The parsimony pressure method effectively treats the minimisation of size as a soft constraint and attempts to enforce this constraint using the penalty method, i.e., by decreasing the fitness of programs by an amount that depends on their size. The penalty is typically simply proportional to program size. The intensity with which bloat is controlled is, therefore, determined by the parsimony coefficient. The value of this coefficient is very important: too small a value and runs will still bloat wildly; too large a value and GP will take the minimisation of size as its main target and will almost ignore fitness, thus converging towards extremely small but useless programs (Soule, 1998). However, good values of the parsimony coefficient are highly dependent on particulars such as the problem being solved, the choice of functions and terminals, and various parameter settings. Furthermore, with a constant parsimony coefficient the method can only achieve partial control over the dynamics of the average program size over time.
Recently, a theoretically sound method for setting the parsimony coefficient in a principled manner has been proposed (Poli and McPhee, 2008b). The covariant parsimony pressure method is based on an analysis of the size evolution Equation ( 11.1 ), and is easy to implement. It recalculates the parsimony coefficient c at each generation using c = Cov(,f)/Var( ), where Cov(,f) is the covariance between program size and program fitness f in the population, and Var() is the variance of program sizes. Note that c needs to be recalculated each generation because both Cov(,f) and Var() change from generation to generation. As shown in Figure 11.2 (in the portion labelled "Local"), using this equation ensures that the mean program size remains at the value set by the initialisation procedure (although there can be a small amount of drift). There is a variant of the method that allows the user to even decide what function the mean program size should follow over time. As shown in the figure this provides complete control over the population size dynamics.
Figure 11.2: Plots of the evolution average size over 500 generations for multiple runs of the 6-MUX problem with various forms of covariant parsimony pressure. The "Constant" runs had a constant target size of 150. In the "Sin" runs the target size was sin((generation+1)/50.0) ×50.0+150. For the "Linear" runs the target size was 150 + generation. The "Limited" runs used no size control until the size reached 250, then the target was held at 250. Finally, the "Local" runs used c = Cov(,f)/Var(), which allowed a certain amount of drift but still avoided runaway bloat (see text).