In GP, programs are usually expressed as syntax trees rather than as lines of code. For example Figure 2.1 shows the tree representation of the program max(x+x,x+3*y). The variables and constants in the program (x, y and 3) are leaves of the tree. In GP they are called terminals, whilst the arithmetic operations (+, * and max) are internal nodes called functions. The sets of allowed functions and terminals together form the primitive set of a GP system.
In more advanced forms of GP, programs can be composed of multiple components (e.g., subroutines). In this case the representation used in GP is a set of trees (one for each component) grouped together under a special root node that acts as glue, as illustrated in Figure 2.2 . We will call these (sub)trees branches. The number and type of the branches in a program, together with certain other features of their structure, form the architecture of the program. This is discussed in more detail in Section 6.1 .
It is common in the GP literature to represent expressions in a prefix notation similar to that used in Lisp or Scheme. For example, max(x+x,x+3*y) becomes (max (+ x x) (+ x (* 3 y))). This notation often makes it easier to see the relationship between (sub)expressions and their corresponding (sub)trees. Therefore, in the following, we will use trees and their corresponding prefix-notation expressions interchangeably.
How one implements GP trees will obviously depend a great deal on the programming languages and libraries being used. Languages that provide automatic garbage collection and dynamic lists as fundamental data types make it easier to implement expression trees and the necessary GP operations. Most traditional languages used in AI research (e.g., Lisp and Prolog), many recent languages (e.g., Ruby and Python), and the languages associated with several scientific programming tools (e.g., MATLAB1 and Mathematica2) have these facilities. In other languages, one may have to implement lists/trees or use libraries that provide such data structures.
In high performance environments, the tree-based representation of programs may be too inefficient since it requires the storage and management of numerous pointers. In some cases, it may be desirable to use GP primitives which accept a variable number of arguments (a quantity we will call arity). An example is the sequencing instruction progn, which accepts any number of arguments, executes them one at a time and then returns the value returned by the last argument. However, fortunately, it is now extremely common in GP applications for all functions to have a fixed number of arguments. If this is the case, then, the brackets in prefix-notation expressions are redundant, and trees can efficiently be represented as simple linear sequences. In effect, the function's name gives its arity and from the arities the brackets can be inferred. For example, the expression (max (+ x x) (+ x (* 3 y))) could be written unambiguously as the sequence max + x x + x * 3 y.
The choice of whether to use such a linear representation or an explicit tree representation is typically guided by questions of convenience, efficiency, the genetic operations being used (some may be more easily or more efficiently implemented in one representation), and other data one may wish to collect during runs. (It is sometimes useful to attach additional information to nodes, which may be easier to implement if they are explicitly represented).
These tree representations are the most common in GP, e.g., numerous high-quality, freely available GP implementations use them (see the resources in Appendix A , page 310 , for more information) and so does also the simple GP system described in Appendix B . However, there are other important representations, some of which are discussed in Chapter 7 .