Contents Top Previous Next

12.2 Curve Fitting, Data Modelling and Symbolic Regression

In principle, there are as many possible applications of GP as there are applications for programs--in other words, virtually infinite. However, before one can try to solve a new problem with GP, one needs to define an appropriate fitness function. In problems where only the side effects of a program are of interest, the fitness function usually compares the effects of the execution of a program in some suitable environments with a desired behaviour, often in a very application-dependent manner. However, in many problems the goal is to find a function whose output has some desired property, e.g., the function matches some target values (as in the example given in Section  4.1 ). This is generally known as a symbolic regression problem.

Many people are familiar with the notion of regression. Regression means finding the coefficients of a predefined function such that the function best fits some data. A problem with regression analysis is that, if the fit is not good, the experimenter has to keep trying different functions by hand until a good model for the data is found. Not only is this laborious, but also the results of the analysis depend very much on the skills and inventiveness of the experimenter. Furthermore, even expert users tend to have strong mental biases when choosing functions to fit. For example, in many application areas there is a considerable tradition of using only linear or quadratic models, even when the data might be better fit by a more complex model.

Symbolic regression attempts to go beyond this. It consists of finding a function that fits the given data points without making any assumptions about the structure of that function. Since GP makes no such assumption, it is well suited to this sort of discovery task. Symbolic regression was one of the earliest applications of GP (Koza1992), and continues to be widely studied (Cai, Pacheco-Vega, Sen, and Yang2006Gustafson, Burke, and Krasnogor2005Keijzer2004Lew, Spencer, Scarpa, Worden, Rutherford, and Hemez2006).

The steps necessary to solve symbolic regression problems include the five preparatory steps mentioned in Chapter  2 . We practiced them in the example in Chapter  4 , which was an instance of a symbolic regression problem. There is an important difference here, however: the data points provided in Chapter  4 were computed using a simple formula, while in most realistic situations each point represents the measured values taken by some variables at a certain time in some dynamic process, in a repetition of an experiment, and so on. So, the collection of an appropriate set of data points for symbolic regression is an important and sometimes complex task.

For instance, consider the case of using GP to evolve a soft sensor (Jordaan, Kordon, Chiang, and Smits2004). The intent is to evolve a function that will provide a reasonable estimate of what a sensor (in an industrial production facility) would report, based on data from other actual sensors in the system. This is typically done in cases where placing an actual sensor in that location would be difficult or expensive. However, it is necessary to place at least one instance of such a sensor in a working system in order to collect the data needed to train and test the GP system. Once the sensor is placed, one would collect the values reported by that sensor and by all the other real sensors that are available to the evolved function, at various times, covering the various conditions under which the evolved system will be expected to act.

Such experimental data typically come in large tables where numerous quantities are reported. Usually we know which variable we want to predict (e.g., the soft sensor value), and which other quantities we can use to make the prediction (e.g., the hard sensor values). If this is not known, then experimenters must decide which are going to be their dependent variables before applying GP. Sometimes, in practical situations, the data tables include hundreds or even thousands of variables. It is well known that in these cases the efficiency and effectiveness of any machine learning or program induction method, including GP, can dramatically drop as most of the variables are typically redundant or irrelevant. This forces the system to waste considerable energy on isolating the key features. To avoid this, it is necessary to perform some form of feature selection, i.e., we need to decide which independent variables to keep and which to leave out. There are many techniques to do this, but these are beyond the scope of this book. However, it is worth noting that GP itself can be used to do feature selection as shown in (Langdon and Buxton2004).

There are problems where more than one output (prediction) is required. For example, Table  12.1 shows a data set with four variables controlled during data collection (left) and six dependent variables (right). The data were collected for the purpose of solving an inverse kinematics problem in the Elvis robot (Langdon and Nordin2001). The robot is shown in Figure  12.1 during the acquisition of a data sample. The roles of the independent and dependent variables are swapped when GP is given the task of controlling the arm given data from the robot's eyes.

There are several GP techniques which might be used to deal with applications where multiple outputs are required: GP individuals including multiple trees (as in Figure  2.2 , page  26 ), linear GP with multiple output registers (see Section  7.1 ), graph-based GP with multiple output nodes (see Section  7.2 ), a single GP tree with primitives operating on vectors, and so forth.

Table 12.1: Samples showing the size and location of Elvis's finger tip as apparent to this two eyes, given various right arm actuator set points (4 degrees of freedom). Cf. Figure  12.1 . When the data are used for training, GP is asked to invert the mapping and evolve functions from data collected by both cameras showing a target location to instructions to give to Elvis's four arm motors so that its arm moves to the target.
continues for a total of 691 lines

Arm actuator
Left eye
Right eye

-376 -6261000 -3604410 29 -9 12 25
-372 -6221000 -38043 7 29 -9 12 29
-377 -627 899-359 43 9 33-2014 26
-385 -635 799-319 3816 27-1722 30
-393 -643 699-279 3624 26-2125 20
-401 -651 599-239 3232 25-2628 18
-409 -659 500-200 3235 24-2731 19
-417 -667 399 -159 3141 17-2836 13
-425 -675 299 -119 3045 25-2739 8
-433 -683 199 -79 3147 20-2743 9
-441 -691 99 -39 3149 16-2645 13
...  ...  ...  ...  ...  ...  ...  ...  ...  ...


Figure 12.1: Elvis sitting with its right hand outstretched. The apparent position and size of a bright red laser attached to its finger tip is recorded (see Table  12.1 ). The data are then used to train a GP to move the robot's arm to a spot in three dimensions using only its eyes.

Once a suitable data set is available, its independent variables must all be represented in the primitive set. What other terminals and functions are included depends very much on the type of the data being processed (are they numeric? are they strings? etc.) and is often guided by the information available to the experimenter and the process that generated the data. If something is known (or strongly suspected) about the desired structure of the function to be evolved, it may be very beneficial to use this information (or to apply some constraints, like those discussed in Section  6.2 ). For example, if the data are known to be periodic, then the function set should probably include something like the sine function.

What is common to virtually all symbolic regression problems is that the fitness function must measure how close the outputs produced by each program are to the values of the dependent variables, when the corresponding values of the independent ones are used as inputs for the program. So, most symbolic regression fitness functions tend to include summing the errors measured for each record in the data set, as we did in Section  4.2.2 . Usually either the absolute difference or the square of the error is used.

The fourth preparatory step typically involves choosing a size for the population (which is often done initially based on the perceived difficulty of the problem, and is then refined based on the actual results of preliminary runs). The user also needs to set the balance between the selection strength (normally tuned via the tournament size) and the intensity of variation (which can be varied by modifying the mutation and crossover rates, but many researchers tend to fix to some standard values).

Contents Top Previous Next