The Evolving Mind -- Copyright Gordon and Breach © 1993 |
We have discussed the structure and dynamics of the brain, in a very crude way. And we have proposed that the brain shares two qualities of the ecosystem: a roughly hierarchical structure, and a dynamics strongly influenced by natural selection. These two hypotheses will be elaborated and unified in Chapter 6. First, however, we must add another ingredient into the mix.
In Chapter 3 we basically avoided genetics _ we talked about DNA and reproduction here and there, but most of our considerations were system-theoretic and therefore not concerned with the mechanics of reproduction and inheritance. This was no accident: the very concept of a system-theoretic analysis of natural selection relies on the assumption that most evolutionary processes can be understood without reference to the particulars of genetics.
In this chapter, however, a certain amount of genetics will enter the picture _ through the back door, to mix metaphors. We will need to discuss the details of sexual reproduction, because we will be exploring the hypothesis that the large-scale dynamics of evolving systems actually mimics the process of sexual reproduction. As we shall see in Chapter 6, this hypothesis actually follows from some very simple psychological assumptions. And, as we shall see in the present chapter, the theory of genetic algorithms developed by John Holland and his colleagues makes the idea seem much more plausible than it might at first.
5.1 BINARY GENETICS
Biologists have long been amazed at the ability of natural selection to solve difficult mathematical optimization problems. For example, Snyder (1977) showed that the hexagons in a bee's eye are precisely the size required to optimize resolution in bright light. And Krebs, Kacelnik and Taylor (1978) studied a group of great tits in an aviary containing two food dispensers. Each dispenser, when landed on by a bird, had a certain characteristic probability of releasing food. They found that the birds visited the two machines according to the optimal strategy dictated by Bayesian probability theory and Laplace's principle of indifference.
In Optima for Animals (1982), R. McNeill Alexander discusses these and a number of other cases in which animal morphology or behavior approximates the solution of a mathematical optimization problem. This type of work shows that the evolutionary process has a very high capacity for the optimization of mathematical functions. It was part of the inspiration for John Holland's invention of the genetic algorithm _ an ingenious technique which seeks to solve mathematical optimization problems by mimicking biological evolution. As hinted above, this algorithm is intimately bound up with the logic of neural self-organization.
The genetic algorithm rests upon a simple mathematical model of the genetic process, which I will call binary genetics. Binary genetics is an oversimplification, but it is an excellent computational tool, and it is also useful conceptually: it reveals aspects of the genetic process that tend to be obscured by the complex peculiarities of biological structures.
And these are not the only good reasons for looking at binary genetics. There is also the fact that conventional mathematical population genetics cannot yet deal effectively with multilocus evolution. Mathematical population genetics is most impressive when dealing with the evolution of traits which depend on only one gene. Results addressing even the relatively simple case where two genes cooperate to determine fitness involve un-intuitive, incredibly complicated mathematical expressions (Ewens, 1979). Thus, the choice is either to use simplified models like binary genetics, or to study something besides the optimization properties of multilocus evolution.
This frustrating limitation of mathematical population genetics is not without philosophical interest. Strict Darwinism focuses on the evolution of specific traits, rather than the interconnections between traits (the interconnections within individual animals, and on the ecosystem level). And mathematical population genetics reinforces this bias, via its complete inability to deal with the interaction of different genes to produce complexes of interacting traits. Connections between different fields are what hold science together, and they are the source of many breakthroughs, but sometimes they can also hold science back.
So, let us assume that that the genetic material of any entity can be encoded in a straightforward way as a binary sequence. From the point of view of DNA it would be more natural to consider sequences over 4 different digits, and from the point of view of protein evolution it would be more natural to consider sequences over 40 different digits (HagenMacken and Perelson, 1991). But for the sake of simplicity, let us stick with binary sequences. Re-coding a base 4 or base 20 sequence in binary is not a particularly arduous task, and its algorithmic complexity is negligible.
Following Holland (1975), then, let us call a binary sequence of length N a genotype. Let A denote the space of phenotypes _ e.g. organisms or proteins, or in the next section neural clusters. Each genotype x denotes a certain probability distribution Px on A, where Px(a) indicates how likely it is that genotype x gives rise to phenotype a. For simplicity, it is often assumed that each genotype leads infallibly to exactly one phenotype (that, for each x, Px takes on one particular value with probability 1).
The individual bits of the binary sequences may be called "genes," and the values 0 or 1 are the "alleles," the possible values which the genes can take.
Each genotype x possesses a certain fitness. In general, the fitness of a phenotype at time t depends upon: 1) the phenotypes of other evolving entities at time t; and 2) the non-evolving environment at time t. We may write the fitness of x as F(Px,E), where E is the environment at time t, inclusive of other evolving organisms. In genetics, it is often assumed that F(Px,E)=F(Px), which is allowable if the effect of a phenotype on its environment is negligible. However, in Chapter 4 we considered a number of arguments against this assumption.
In the context of binary genetics, genetic mutations assume a particularly simple form: a genotype mutates in the ith position when its ith bit changes from 0 to 1 or from 1 to 0. And sexual reproduction, or crossover, is almost as simple. Sexual reproduction takes parts of the genotype of one parent and combines them with parts of the genotype of another. In order to correspond with real genetics, this should be done by splicing. The two parent genotypes, say x and y, should be partitioned in terms of binary sequences pi, qi so that the length of pi is the length of qi:
x=p1p2...pk
y=q1q2...qk
An offspring of x and y is then any binary sequence which differs in only a few places from something of the form r1r2...rk, where ri is equal to either pi or qi. The reason why it is allowed to differ in a few places is, of course, that mutations may occur in the process of reproduction.
5.2 GENETIC FUNCTION OPTIMIZATION
If the effect of a phenotype on its environment is ignored, then the process of binary evolution yields an intriguing algorithm for function optimization. Although this is a woefully inadequate analysis of evolutionary process, there is much to be learned from it anyway. In function optimization applications, k is usually fixed at 2, so that crossover involves taking the beginning of one sequence and splicing it onto the end of another. However, it is not clear that this serves any purpose other than simpli-fication; in some cases a higher k may lead to more efficient optimization (Booker, 1987).
To make things concrete, let's write down an arbitrary example. Suppose one wants to find the maximum of a function f:[0,1]10%R, say
f(x1,...,x10) = x11 + x22 + ... + x1010
To solve this problem using the genetic algorithm, one first settles on a degree of accuracy. Say one requires six digits of accuracy. Then, in binary, this means 20 bits of accuracy. One can then encode each element x=(x1,...,x10) of [0,1]10, to within 20 bits of accuracy, as a binary sequence of length 200.
There are several ways of doing this encoding. The standard method is as follows: the first 20 bits of the sequence correspond to the first 20 bits of the binary expansion of x1, the second 20 bits correspond to the first 20 bits of the binary expansion of x2, and so on. In (Goertzel, 1992e) I proposed a different strategy, based on the theory of space-filling curves. For the purposes of the present discussion, however, any reasonable coding will do; let us stick with the standard technique.
Each binary sequence of length 200 is a genotype, and the corresponding vector x is the corresponding phenotype. The value of f at the phenotype of a sequence is called the fitness of the sequence. In this case, the maximum fitness occurs at the phenotype (1,1,1,1,1,1,1,1,1,1), and the genotype which best approximates this value is 11111...11111.
In order to optimize the function f, one takes an initial population of, say, 50 sequences (following the biological analogy, these sequences will sometimes be called individuals). At each time step, the old population is turned into a new population according to the following three steps:
1) some individuals mutate
2)some pairs of individuals reproduce sexually _ generally oneputs a bias in the process so that fitter pairs are more likely to reproduce
3)a new population is selected, biased in some way _ say, toward the new offspring, or toward the most fit
The details vary from implementation to implementation. For instance, Goldberg's (1989) "simple genetic algorithm" deals strictly with sexual reproduction; there is no Step 1. Mutations occur within the process of sexual reproduction, but not otherwise. Step 3 is trivial; only the new offspring formed in Step 2 are selected. But the trick is that Step 2 works as follows:
For i=1 to population size
2a) Select mate 1 from old population. The probability of selecting x is proportional to the fitness of x.
2b) Select mate 2 from old population. The probability of selecting x is proportional to the fitness of x.
2c) Mate 1 and mate 2 cross over to form the ith element of the new population
If a certain member of the old population is very fit, then it will often be selected in both Step 2a and Step 2b. Thus it will cross over with itself, yielding itself. In this way, the fittest members of the old population are likely to survive into the new population.
On the other hand, if one eliminated Step 2, one would have a population of asexually reproducing individuals. In optimization terminology, this yields a "multistart search method," where the basic search operation is evolutionary mutation. Similar strategies have been studied by Salamon et al. (1990). The simple genetic algorithm has led to some fairly impressive results, as reported in, for example, Bethke (1981) and Brindle (1981). However, since the theory of genetic algorithms is still in the early stages of development, we have very little understanding of why the algorithm works as well as it does. The matter is not so simple as it might seem.
In ordinary genetics, crossover serves to create individuals that combine the strengths of both parents _ or the weaknesses of both parents. A strong but dull female and an intelligent but weak male may give rise to a wide variety of offspring: a child who is strong and intelligent, a child who is weak and dull, a child who is strong and dull, a child who is weak and intelligent, a child who is mediocre in both respects, etc. However, if the two parents were to reproduce by mutation, neither of them would be likely to produce a child both strong and intelligent, nor a child both dull and weak.
One might think that this sort of argument could explain the efficacy of crossover for function optimization. And it can, to a certainextent _ but not nearly so easily as one might expect.
Let's say the fittest genotype is 1111111111. And suppose the fitness function f is such that 1111100010 and 0010011111 are both fairly fit. Then it is reasonably likely that 1111100010 and 0010011111 will both occur in the population, and that they will cross over with each other. If they cross over at the midpoint, the result will be optimal: 1111111111. If they cross over at, say, the fourth position, then the result will be 1111011111 _ and if f is reasonably smooth this will be extremely fit. But the question is, how likely is it that f will have the special property required to make this kind of logic work? Which functions are amenable to crossover?
In order to address this question, let us introduce some of the formalism typically used in connection with genetic algorithms. Following Holland (1975), define a schema as a sequence consisting of *'s, 0's and 1's _ say, **0110001010***10001*1***1* or 11*101. Each schema is associated with a certain set of binary sequences _ all those sequences that can be obtained by filling in a certain pattern of 0's and 1's for the *'s. Define the size of a schema as the number of 0's and 1's in it, and define the fitness of a schema as the mean fitness of the set of sequences associated with the schema. The fitness of a schema, then, is the mean of its fitness distribution.
Finally, by "crossing over" one schema with another, I will mean crossing over a number of sequences associated with the one schema with a number of sequences associated with the other, at a variety of different cross points.
For a trivial example, consider the function f(x)=x10 on the interval [0,1]. Assuming 20 bits of accuracy, the sequence of maximum fitness in this case is 11111111111111111111. A small, fit schema is 111*****************. Cross this over at the third point with *00111**************, which is less fit but still reasonably fit, one obtains 111111**************: a larger schema with even higher fitness etc. In practice, the simple genetic algorithm tends to follow this sort of pattern for a while, arriving at a population full of sequences like 111111111111100100110. From this point on, it arrives at the maximum very slowly, because the schema 111111111111******** cannot be turned into, say, 111111111111111***** by crossover with any reasonably fit schema. The schema ***************11111 and **********1111111111, for example, have rather small fitnesses; in a population full of guesses containing 111111111111111*****, it is unlikely that they will survive to mate.
The function x10 is amenable to crossover in a very weak way. 111***************** and *00111************* can cross over to form something fitter than either. But this is not like a strong parent and an intelligent parent combining to form a strong, intelligent parent. It is more like a strong parent and a slightly less strong parent combining to form a child slightly stronger than either. However, I suggest that in [0,1]n, except in very special cases, this weak sort of amenability to crossover is just about the best one can hope for. In the context of genetic optimization of real-valued functions, the full intuitive power of the crossover operation is virtually never utilized.
But _ and here's the punchline _ one should not think from all this that genetic algorithms are ineffective for the maximization of real functions. On the contrary. In R there are several more effective algorithms; but for complicated functions from Rn to R, the genetic algorithm is competitive. What is striking is how well the genetic algorithm works without the full intuitive ("strong but stupid" + "weak but smart" = "strong and smart") power of the crossover operation. Essentially, the genetic algorithm is working as a sort of random search. Having found a fairly fit schema, it crosses this schema over with other things, thus arriving at new sequences that are generally rather evenly distributed over the space of all sequences associated with that schema. When this random search results in something particularly fit, a new schema is born.
The phrase "random search" should not be misconstrued. This sort of random search is very different from random search by point mutation, such as we find in the evolution of an asexually reproducing species. For one thing, genetic search is more robust with respect to fuzzy data (see Appendix 1). If one generates a fitness function by adding noise to a function with a fairly simple graph, algorithms based on point mutation will be slowed down by the small-scale bumps induced by the noise. But the genetic algorithm will hardly be impaired at all. This is not because such functions are amenable to crossover in the sense indicated above, but simply because the genetic algorithm incorporates a kind of half-global/half-local random search that is capable of overlooking local irregularity. It searches randomly over the whole spaces demarcated by fit schema.
A new mathematical framework for considering these matters is presented in the Appendix 1, coauthored with Malwane Ananda. The dynamical-systems analysis given there suggests that the genetic algorithm will optimize f effectively if:
1) there are some fairly small schema with high fitnesses
2) some of these schema can be crossed over with other smallschema _ perhaps slightly less fit _ to form larger schema with high fitnesses
3) some of the schema formed in Step 2 can be crossed over with each other, or schema formed in Step 1 or 2 _ perhaps slightly less fit _ to form larger schema with high fitnesses
4) some of the schema formed in Step 3 can be crossed over with each other or schema formed in Step 1, 2, or 3 _ perhaps less fit _ to form larger schema with high fitnesses
5) etc.
Note that this intuitive prescription gives no indication of how likely it is for an arbitrary function to be amenable to crossover, or how amenable a given function will be. The probabilistic analysis given in the Appendix does not resolve these difficult questions, but it does reduce them to problems of bounding complicated algebraic formulas. And it leads to a handful of significant insights on peripheral topics.
For instance, a judicious application of n-dimensional calculus shows that, once the genetic algorithm gets near the solution to an optimization problem, it will tend to "dither" around rather than converging in the manner of a typical numerical method. This accords with biological reality: in any macroscopic species, there is a relatively large disparity between individual animals. And it accords with psychological reality as well. If ideas are indeed formed partly by "crossing over" parts of other ideas, it almost goes without saying that this process will never result in inexorable convergence to one optimal idea. It may result in a population of ideas that is in some sense "near" to the answer to the problem at hand, but that is the best it will ever do.
Genetic Image Compression
Let us consider one practical example for which the genetic algorithm may be of use: the problem of image compression. This is an instance of the genetic optimization of real-valued functions, but it will also serve us later, as an example of genetic programming.
Suppose one has a picture _ call it P _ and one wants a short, quickly running program that will generate it. One useful approach is to look at affine iterated function systems. Each affine IFS consists of a collection of affine transformations wi, each with its own probability pi. Therefore, an IFS consisting of N maps w1,...,wN can be represented as a point in R7N. And, given any prescribed accuracy e, a point in R7N can be represented as a binary sequence of length 7N 2-e. Let B = B(e,7N) denote the set of all binary sequences obtained in this way.
For simplicity, let us assume that P is a black and white picture. The introduction of color complicates things technically, but requires no new ideas. If P is black and white, P may be conveniently modeled as a compact subset of the unit square [0,1]2.
For each element x of B, let P(x) = P(K,x) denote the picture obtained by applying K steps of the random iteration algorithm to the IFS encoded in x, i.e. the picture obtained by running the following program:
Random Iteration Algorithm
Pick x0 at random
For i=1,...,K
Set n(i) equal to some j in {1,...,N}, selecting j
with probability pj
Set xi = wn(i)(xi-1)
Plot xi
If the magnitude of the determinant of the linear part of each wi is less than 1, then this algorithm will almost certainly converge to a compact subset of the plane.
Let us fix K at some fairly large number (say, 50,000). Let d(,) denote some measure of the distance between two pictures. Then, IFS theory allows us to represent the problem of constructing a program to generate P as follows: find x within B that minimizes f(x)=d(P,P(x)).
In practice, one may carry this algorithm out by associating with each subset Q of [0,1]2 a binary array A(Q) of some fixed size L x M, and fixing a constant m=m(K). A(Q)ij is set equal to 1 if there are at least m points of Q in the square [(i%)L,iL] x [(j%1)L,jL]. The distance d(Q,R) between two pictures may then be defined as the Hamming distance between the associated arrays: as the number of positions in which A(Q) and A(R) differ.
So, image compression is an optimization problem on the space of length-n binary sequences. It can therefore be approached via binary genetics. The next step is to try to understand why the genetic algorithm might be especially effective for solving this particular optimization problem. First of all, imagine a picture composed of two distinct pieces: a square and a triangle. The square piece can be generated by four transformations, call them w1, w2, w3, w4. On the other hand, the triangular piece can be generated by three transformations, call them w5, w6, w7. So, in effect, the problem is "decomposable" into two subproblems. The genetic algorithm is capable of doing this decomposition automatically.
For instance, suppose the first 28 2-e bits of x encode the transformations w1, w2, w3, w4 and their probabilities, and the last 21 2-e bits of y encode the transformations w1, w2, w3 and their probabilities. Then if x and y cross over after the 28 2-e'th bit, the result will be a sequence generating the target picture P. To a certain extent, the fitness of a portion of a sequence is determined independently of the rest of the sequence.
Of course, there is a problem if x is as above, but the first rather than the last 21 2-e bits of y encode the transformations w1, w2, w3. Then crossing over x with y will never do a bit of good. For this reason, in this context it is best to augment the basic genetic algorithm with an additional operator: rearrangement. Each sequence x encodes a certain set of affine transformations and their probabilities; and each of these transformations, together with its probability, is encoded in a certain subsequence of x. Call this set of subsequences {x(1),...,x(N)}. Since the objective function f(x) is independent of the arrangement of the x(i), it makes sense for the subsequences of each sequence to be randomly rearranged at each generation of the genetic process.
By thus augmenting the basic genetic algorithm, one obtains an algorithm capable of automatically decomposing an image compression problem into two different problems. Now, this automatic decomposition is only part of the optimization problem: there is also the matter of recognizing the affine transformations corresponding to each individual region of the picture. But for this local part of the problem, the genetic algorithm can at least act as an effective random search method.
For functions f derived from real pictures, the amenability to crossover appears to be fairly high. Barnsley and Sloan (1988) have compressed many realistic-looking images using the IFS technique _ see Chapter 1. In most cases, a large number of transformations is required, and each transformation acts only in a limited region. Often the regions intersect substantially. But the point is that decomposition is a valuable heuristic, and the crossover operation can help to achieve it.
Genetic image compression is discussed in more detail in (Goertzel, 1993b). As pointed out there, there is as yet no highly effective strategy for the compression of real images. Genetic image compression tends to be very slow, despite the inherent amenability to crossover of the problem. However, all other known methods are also very slow. At Barnsley's company, Iterated Function Systems, they avoid this problem by maintaining a large computer library of images and IFS's. When supplied with a new image to compress, the company scientists begin by comparing pieces of it with pieces of the images in the library_ this gives them a good initial guess as to an appropriate IFS.
5.3 GENETIC PROGRAMMING
In this section I will discuss the application of the genetic algorithm to the problem of constructing a computer program to serve a certain purpose. John Koza (1991) has called this sort of implementation of the genetic algorithm "genetic programming." He has demonstrated the utility of genetic programming for solving various applied problems. Here, however, we will focus on general principles rather than practical applications.
Genetic programming is only loosely related to the process by which fitness is determined in biological systems. A more intuitively accurate model of this process _ called "enzymatic encoding" _ will be described a little later. However, genetic programming is mathematically, conceptually and computationally far more tractable than enzymatic encoding. Furthermore, it will be argued below that genetic programming is relevant to the dynamics of the brain.
So, without further ado, let us consider an objective function f that maps programs, rather than sequences, into real numbers. To apply the genetic algorithm to such a function f, one might simply encode each program as a sequence. But this would be both ineffective and unintuitive. Instead, it is better to look at a program as a collection of interconnected subroutines, and to represent a program as a process dependency graph.
We will divide a program S into a collection of subroutines, {S1,...,Sn}. In general, this sort of division is somewhat arbitrary: there are many different ways to break a program down into subroutines. Dividing a program into subroutines plays the same role in genetic programming that encoding an entity as a binary sequence plays in genetic function optimization.
We will associate with each Si an ordered set of numbers Ci, called the "list of connections." If j is in Ci, that means subroutine i sometimes calls on subroutine j. Given this formalism, the process dependency graph associated with S may be drawn as follows: vertex i of the graph represents subroutine Si, and there is an arrow from i to j if j is an element of Ci.
This graphical representation leads to a special interpretation of crossover. First of all, two programs may cross over by exchanging subroutines. Putting subroutine Tj in the place of subroutine Si means,simply: 1) every subroutine that previously called on Si, now calls on Tj, and 2) the new list of connections for Tj is the list of connections Ci previously associated with Si. It is quite possible that Si has a different number of connections than Tj. In this case, any connections used by Si but not needed by Tj are ignored, and any excess input requests on the part of Tj are not fulfilled.
But this is only the simplest form of crossover. Recall that the definition of a "subroutine" is somewhat vague. A collection of subroutines is also a subroutine. Once one has drawn the process dependency graph of a program in a certain way, it is certainly reasonable to cross over programs by exchanging sets of subroutines from one program to the other.
This idea _ doing genetic optimization by exchanging subroutines rather than bits _ is simple and profound. However, its implementation presents a great number of difficulties. Let us consider once again the Random Iteration Algorithm. The structure of this program is very simple. One may draw a process dependency graph as follows: one main routine, which calls three subroutines in every iteration. One subroutine to choose a number j, one subroutine to apply wj to the previous point to obtain the current point, and one subroutine to plot the current point. Given this graph, it is very easy to genetically evolve a population of Random Iteration Algorithms. Pairs of programs may be combined to form new programs, by exchanging the subroutines that compute the wi . Note that this is substantially different from the genetic image compression algorithm of the previous section. In order to obtain that algorithm, one would have to generalize our definition of "subroutine" and consider a process dependency graph in which each bit of each wj is considered as a subroutine [Figure 21].
What makes the case of the Random Iteration Algorithm so easy? First of all, that one is looking only at programs with a certain fixed process dependency graph. Given the assumption that S and T have the same process dependency graph, crossover is simple: one exchanges Si and Tj only when node i in the graph of S occupies the same position as node j in the graph of T.
In general, however, just because S and T have the same process dependency graph, one cannot conclude that subroutine Si and subroutine Ti serve similar functions. S and T could be very different programs with similar control structures, in which case exchanging Si with Ti would probably result in garbage. In the case of the Random Iteration Algorithm, we were dealing with programs that shared not only the same process dependency graph but the same algorithmic logic.
In practice, one may improve the situation a little by associating a label with each edge of the process dependency graph. The label on the edge from i to j should denote the type of information that Subroutine i passes to Subroutine j. For instance, in the C programming language, subroutines can pass parameters of type "int," "float," "double," "char," and so on; each of these identifiers could serve as a label on the process dependency graph of a C program. But the introduction of labels is only an incremental improvement _ one can still have totally dissimilar programs with identical labeled process dependency graphs. The fact that S and T have the same process dependency graph, or the same labeled process dependency graph, does not mean that crossing S with T will ever yield anything interesting; it is only an indicator that the potential is there. It is too restrictive to consider only programs with one particular process dependency graph. But it is also unreasonable to expect the genetic algorithm to produce meaningful programs by crossing over unrelated programs with one another. The interesting question is whether there is some middle ground, whether one can impose a certain amount of structure on one's population of programs in such a way that crossover is meaningful and yet creative flexibility is possible.
I suggest that the answer to this question is yes. One partial solution is to imitate biology, and introduce "speciation." A person cannot mate with a gnat or a sheep, or even a monkey. One can restrict a genetic algorithm to cross S with T if and only if S and T are in some sense closely related _ for instance, if S and T have similar process dependency graphs. A similar strategy has been applied and found effective in the context of genetic function optimization (Goldberg, 1989). In the simple genetic algorithm with speciation, two sequences are unlikely to be crossed unless the Hamming distance between them is sufficiently small.
5.4 MULTILEVEL PROGRAMS
Another partial solution to the problem of genetic programming may be found in the multilevel methodology _ a very simple idea, which may be found in Etzione's (1968) sociology as well as the technical literature on multiextremal optimization (Goertzel, 1992e; Brandt, 1985).
To solve a problem by the multilevel methodology, one divides one's resources into a number of levels _ say, levels ...,3,2,1,0. Level 0 is the "bottom level," which contains a number of problem-solving algorithms. Each higher level contains a number of algorithms designedto manage problem-solving strategies on the level immediately below.
One example of the multilevel methodology is the hierarchical power structure of the large corporation. Level 0 consists of those employees who actually produce goods or provide services for individuals outside the company. Level 1 consists of foremen and other low-level supervisors. And so on. The highest level comprises the corporate president and the board of directors.
Another, more relevant, example is the problem of perception. One has a visual image P, and one has a large memory consisting of various images z1, z2,...,zM. One wants to represent the perceived image in terms of the stored images. This is a pattern recognition problem: one wants to find a pair of the form (y,z), where y*z=P and z is a subset of {z1,...,zM}. In this case, the multilevel methodology takes the form of a hierarchy of subroutines. Subroutines on the bottom level _ level 0 _ output simple patterns recognized in the input image P. And, for i>0, subroutines on level i output patterns recognized in the output of level i-1 subroutines. In some instances a subroutine may also instruct the subroutines on the level below it as to what sort of patterns to look for. It appears that this is one of the key strategies of the human visual system. A decade and a half ago, Hubel (1988) demonstrated that the brain possesses specific neural clusters which behave as subroutines for judging the orientation of line segments. Since that time, many other neural clusters executing equally specific visual "subroutines" have been found. As well as perhaps being organized in other ways, these clusters appear to be organized in levels.
At the lowest level, in the retina, gradients are enhanced and spots are extracted _ simple mechanical processes. Next come simple moving edge detectors. The next level up, the second level up from the retina, extracts more sophisticated information from the first level up from the retina _ and so on. Admittedly, little is known about the processes two or more levels above the retina. It is clear, however, that there is a very prominent hierarchical structure (Uhr, 1987), although it may be supplemented by more complex forms of parallel information processing (Rose and Dobson, 1985).
Thinking to mimic these neural structures, Levitan and his colleagues at the University of Massachusetts (1987) have constructed a three-level "pyramidal" parallel computer for vision processing. The bottom level deals with sensory data and with low-level processing such as segmentation into components. The intermediate level takes care of grouping, shape detection, and so forth; and the top level processes this information "symbolically", constructing an overall interpretation of thescene. The base level is a 512 x 512 square array of processors each doing exactly the same thing to different parts of the image; and the middle level is composed of a 64 x 64 square array of relatively powerful processors, each doing exactly the same thing to different parts of the base-level array. Finally, the top level contains 64 very powerful processors, each one operating independently according to LISP programs. The intermediate level may also be augmented by additional connections. This three-level perceptual hierarchy appears be be an extremely effective approach to computer vision.
To cross over two different programs S and T written for the UMASS pyramidal computer would be extremely easy. The process dependency graphs would not necessarily be identical, but they would have a characteristic hierarchical structure, as implied by Figure 22. One could exchange some of S's level 1 subroutines with some of T's level 1 subroutines, or some of S's level 2 subroutines with some of T's level 2 subroutines, and so on. There is a great deal of compatability here. Level 1 subroutines output simple entities, e.g. components. Level 2 subroutines output shapes, lines, and so on. Level 3 subroutines output LISP statements. However, this compatibility does not verge on the uniformity which was present in our IFS image compression example.
In general, if one has a hierarchy of subroutines designed so that level i subroutines recognize patterns in the output of level i-1 subroutines, crossover is likely to be sensible. This sort of situation is not so easy to analyze mathematically as straightforward function optimization. However, it seems to me to provide at least as much insight into the nature of genetic operations.
The ideas of this section have not yet been implemented on any computer. Multilevel programs such as the UMASS architecture are rather complex, and to evolve an entire population of them would require a tremendous number of programmer-hours as well as a sizeable investment in hardware or supercomputer time. However, the potential payoff is great, not only from the point of view of the theory of genetic algorithms, but from the point of view of vision processing as well.
5.5 GENETIC PROGRAMMING WITH NEURAL NETWORKS
Now, let us return to Edelman's theory of Neural Darwinism. The concept of neuronal group selection is very impressive, both philosophically and biologically. However, I am a little bit skeptical that the simple process of mutation and selection is sufficient to generate, overa reasonable period of time, tremendously complex structures such as those found in the human mind. It seems likely that some more powerful generative mechanism is required.
I have no trouble with the basic premise of neuronal group selection: that the essence of mental function is to be found in the dynamics of maps. The goal of this section, however, is to put forth the hypothesis that neural maps reproduce by crossover as well as mutation. This is a highly radical proposition, and I admittedly have no physiological evidence in favor of it. However, so far as I can tell there is at present no physiological evidence against the hypothesis either _ and the following chapter, I will give a logical and psychological argument which strongly supports the idea of map-level genetic programming.
Genetic Programming with One-Step Networks
In this subsection I will describe how a genetically evolving population of one-step neural networks may be used to solve the problem of constructing a computer program satisfying given characteristics. This construction is of little biological relevance; however, it presents the idea of genetic programming with neural networks in a particularly simple context.
Let f be a function which assigns a real number to each Boolean function. Where t = 0,1,2,3,..., let P(t) denote an ordered set, or population, of neural networks. The objective of our genetic programming algorithm is to evolve a network which, considered as a Boolean function mapping its inputs to its outputs, maximizes f.
Something should be said about this function f. One possibility is that for some fixed s and some fixed Boolean function F, f(N) is the average, over all input vectors x, of d(N(x,s),F(x)). To see the value of this, let us introduce a restricted class of one-step neural networks. Define a network to be hierarchical if each neuron has a unique level L, defined as follows. Level 1 neurons accept input only from external inputs, and for L>1 level L neurons accept input only from level L-1 neurons. If N is hierarchical, then the parameter s is most naturally set equal to the number of levels in N. Under this interpretation, f measures how well N computes the function F. Of course, this is not the only possibility. In a pattern recognition context, the output of N might be connected to a processor which interprets binary sequences as visual patterns. f(N) might be defined as, roughly, the structural complexity of the set of patterns which N outputs over a certain period of time. There are as many different forms for f as there are applications for neuralnetworks.
So, we have a population of neural networks, and we are seeking to find the network that maximizes the real-valued function f. In order to apply genetic programming, it suffices to define the processes of mutation, selection and crossover. Mutation and selection are completely straightforward; as usual, the only problem is crossover.
We may proceed as above. Let us consider each neuron as a "subroutine" _ then the process dependency graph of a neural network is simply a schematic diagram of the connections of the network. Note that here, although different networks may have very different process dependency graphs, we do not have serious compatibility problems. One of the reasons that the IFS image compression example was so straightforward was that each of the subroutines being crossed over computed the same type of thing. Each computed an affine transformation and its associated probability. Here, similarly, the subroutines involved are all very similar: they are all formal neurons. Since neural networks may involve large numbers of neurons, one will generally have to exchange large sets of neurons to obtain significant effects from crossover.
The crossover operation is particularly simple in the case where N is a "tree network" in the sense that the process dependency graph of N is a tree. It is obvious that every tree network is an hierarchical network. If both N and M are tree networks with identical process dependency graphs, they may be crossed at any vertex i by exchanging the subtree whose root is node i.
Nonhierarchical and Multistep Neurons
For a neuron in a nonhierarchical network, there is no general way to distinguish the effect of the external input sequence at time t%r from that of the external input sequence at time t%s. The fact that a neuron fires at time t may be due to a combination of fragments of input sequences from various times.
To see the problem with this situation, consider Figure 23: a network containing 4 neurons, in which neuron 4 has two active synapses, one connected to the output of neuron 1, and one connected the output of neuron 3. The two inputs of neuron 1 are directly connected to the first2 bits of the input sequence; the two inputs of neuron 2 are directly connected to the second 2 bits of the input sequence; and the one input of neuron 3 is directly connected to the output of neuron 2. Then, when neuron 4 fires at time t, it is responding whenever the schema recognized by neuron 1 is present in the input sequence at time t%1, and the schema recognized by neuron neuron 3 is present in the input sequence at time t%2.
The presence of multistep neurons in a network can have a similar effect. For instance, consider Figure 24 _ here neurons 1, 2 and 3 have relaxation periods of one time step, and neuron 1 has a period of latent addition 4. The connection between 1 and 2 is inhibitory, the connection between 1 and 3 excitatory. Neuron 1 will fire at time t if neuron 3 has fired at times t, t%2, and t%4, and neuron 2 has not fired more than twice over the interval [t,t%4].
These observations imply that nonhierarchical or multistep neural networks are not useful for evolving specific Boolean functions. However, they may indeed be useful for evolving functions that map sets of sequences into sets of sequences. And this is absolutely essential, because the problems confronting real organisms often involve recognizing patterns that occur over time.
To take a physiologically absurd "toy" example, suppose that the brain contained a "rain-sensing" neuron which fired whenever, in the past few seconds, both thunder and lightening had been experienced. If one input of this neuron were connected to a "lightening-sensing" neuron, and the other input were connected to a "thunder-sensing" neuron, then it would be sensible for this neuron to have a period of latent addition somewhat greater than one, to account for the fact that when a rainstorm is coming thunder and lightening often occur at slightly different times. Say the period of latent addition were twenty. Then if lightening flashed at time 55 and thunder sounded at time 70, the rain neuron would fire at time 70.
In general, suppose that the input to an organism at time t is represented by a binary sequence. Then its input over a fixed period of time is a "window," as in (Goertzel, 1992c) _ a vector of r binary sequences, one for each time step in the interval (t%r,t). In order to optimize functions that map windows to real numbers, multistep or nonhierarchical neural networks are not merely useful but essential. This sort of problem is obviously of tremendous practical interest, but its connection with multistep, nonhierarchical neural networks has not yet been seriously explored.
Map-Level Genetic Programming
In the genetic programming algorithm discussed above, the "subroutines" were individual neurons or subtrees. The theory of neuronal group selection, however, suggests that general neuronal groups themselves are a fundamental unit. This suggests the possibility of doing genetic programming on the level of neuronal groups, by defining a "subroutine" as a neuronal group.
The network of maps is inherently non-hierarchical and "multistep" _ it incorporates circular causation, and significant time lags. Therefore the discussion of the preceding subsection applies also to networks of maps. Networks of maps intermix the input of time t with the input of time t%1, t%2, etc.
In order to do map-level genetic programming, a brain would have to have the capacity to create new maps out of nothing. This presupposes that a the brain has a tremendous ability to re-arrange itself. However, there is evidence that the human brain is indeed this flexible. For example, there is the case (Blakeslee, 1991) of the Italian woman who, before she was trained to be an Italian-English translator, held both English and Italian primarily in her left brain. After her translation training, her knowledge of English moved over to the right. The brain is evidently capable of moving whole networks of maps from one place to another. Therefore, it is not absurd to conjecture that it may be able to form new maps by combining pieces of old ones.
5.6 CROSSOVER AND EPIGENESIS
In Chapter 3 we briefly touched on the topic of embryology or epigenesis _ the process by which an organism is "decoded" from its DNA. It was suggested that this process involves a hierarchy of often structurally unstable processes, each process in the hierarchy making use of patterns formed by the processes below it in the hierarchy. In this section I will take up this suggestion again, pursuing it in more detail and relating it to the concept of genetic programming. In Chapter 6, this analysis will give us important new insights into the nature of mind _ and it will also serve us well in Chapter 7, when we discuss Brown's microgenetic law.
I must admit that I am not terribly happy with the model of epigenesis to be given in this section. It is biologically crude, and it is overly formal and abstract. Worse yet, it is so complex that it is difficult to say anything nontrivial about it. These flaws are fairly serious, andthey might even be fatal _ were it not for the fact that there is no more tractable general model of the epigenetic process (Goodwin and Saunders, 1982). We need to talk about the relation between embryology and thought, and lacking a more graceful way to do so, we must simply bite the bullet. The faint of heart may wish to skim over this section lightly _ but I am convinced that there is something important to be learned.
Multiboolean Forms
First of all, we will need to define a certain hierarchy of function spaces. I will call the functions in these spaces simple multiboolean forms, in analogy to the multilinear forms of advanced calculus. Let S be some set of finite and/or infinite binary sequences. A 0-boolean form is an element of S, and a 1-boolean form is a computable function from S to S. Let Sk denote the set of all k-boolean forms. Then, for k>1, a k-boolean form is a map from S to Sk-1. Thus, 2-boolean forms generate boolean functions out of sequences, 3-boolean forms generate 2-boolean forms out of sequences, etc. Any entity which is a k-boolean form, for any k, may be called a simple multiboolean form; the number k is the order of the form.
Simple multiboolean forms are, appropriately, a simple idea. However, we will also need a slightly less natural variation on the theme of multiboolean forms. Fix an integer m. Define a directed 0-boolean form as a 0-boolean form, and define a directed 1-boolean form as a 1-boolean form together with a list of numbers drawn from {1,...,m}. In general, denote the space of directed k-boolean forms as Dk. Define a directed k-boolean form, for k>1, as a computable function from S to Dk-1 together with a list of numbers drawn from {1,...,m}. A directed multiboolean form, or DMF, is any entity which is a directed k-boolean form for some k.
Now, let us consider a binary sequence s which can be partitioned into disjoint subsequences s=s1s2...sm, so that at least one of the si encodes a DMF. A sequence of this form can be decoded in a natural way. Each of the subsequences that corresponds to a DMF comes equipped with a list of numbers drawn from {1,...,m}. Then, the first step in decoding is to have each DMF act upon the subsequences indicated by its list, and thus create new DMF's. For instance, suppose m=100, and s5 corresponds to the directed 6-boolean form (B5,{1,6,9}), where B5 takes sequences into directed 5-boolean forms. Then in this first phase of decoding, the directed 5-boolean forms B5(s1), B5(s6), and B5(s9) are created.
In the second phase of decoding, all the DMF's created in the firstphase are allowed to act upon the subsequences indicated by their lists. For example, suppose the directed 5-boolean form B5(s1) was created in the first phase. Suppose B5(s1)=(B5,1,{80,6,19}), where B5,1 is a function that maps sequences into directed 4-boolean forms. Then, in this second phase, the directed 4-boolean forms B5,1(s80), B5,1(s6), and B5,1(s19) are created.
Continuing in this manner, in the rth phase of decoding the DMF's created in the (r-1)th phase are allowed to act upon the subsequences indicated by their lists. After a finite number of steps this process will terminate: there will be nothing but zeroth-order DMF's left. To be precise, the number of steps required is the order of the maximum-order DMF encoded in the original sequence. The result of the decoding may be defined as the collection of all zeroth-order forms created along the way.
Each zeroth-order form has a unique address, corresponding to the sequence of DMF's that led to it. For instance, if the sequence q was the product of applying the order 6 DMF encoded in s5 to s1, and then applying the order 5 DMF encoded in B5(s1) to s6, then the address of q will begin with 516.... One may collate all the zeroth-order forms created into one sequence by arranging them in "alphabetic" order by address. This collated sequence, finally, may be called the phenotype corresponding to the original sequence s. This process of encoding a phenotype I call hierarchical enzymatic encoding.
GMF Dynamical Systems
This may seem to be a rather convoluted way of getting a phenotype out of a genotype. However, it is actually much simpler than the biological process of epigenesis. There is a certain similarity, in that DNA does work by creating simple components which act on each other to build more complex entities, continually referring to the DNA for guidance. Then these more complex entities act on each other to build yet more complex entities, still continually referring to the DNA for guidance, etc. Hierarchical enzymatic encoding is but a greatly oversimplified model of this process.
To get a more accurate model, one may define an (n,k)-boolean form as a map from n-boolean forms to k-boolean forms. If k exceeds n, then this sort of form is in a sense capable of generating complexity. These are generalized multiboolean forms, or GMF's.
If one has a sequence s composed of subsequences some of which encode (n,k)-boolean forms, the "decoding" process described above doesnot make sense, since it is not just subsequences on which the GMF's created at each phase act. One has to somehow specify what GMF's a given GMF will act on at each time, but there is no obvious way to do so, short of assigning each GMF an "affinity" for GMF's possessing certain characteristics. Real enzymes, for example, seem to possess affinities based on three-dimensional "positional information" _ they know what should be in a certain position, if everything has gone all right so far, so they act on whatever entity happens to be in that position.
Assigning "affinities" may lead to a highly complex discrete dynamical system, composed of entities which continually combine to produce new entities, all based ultimately on the information contained in the root sequence s. This sort of process might be might be called a GMF dynamical system. The state of the system, at each time, is a set of GMF's.
GMF dynamical systems are very epigenetic in spirit, and they also appear to be totally intractable from the mathematical point of view. Walter Fontana, a scientist at Los Alamos National Laboratory, is currently experimenting with a programming language AlChemy that naturally supports dynamical systems similar to GMF dynamical systems, in which various entities continually act on each other to create new entities. This language would be an excellent vehicle in which to experiment with the ideas of this section.
The process of continually decoding a sequence into a GMF dynamical system, I call enzymatic encoding. It is, I believe, an intuitively faithful model of the transition from genetic material to biological organism.
Simply by pondering the concept of a generalized multiboolean dynamical system, one sees the incredible gap between genetic function optimization and real evolutionary optimization. The structure of common mathematical objective functions has nothing to do with the structure of real biological fitness functions. And I suggest that the operation of crossover has everything to do with the peculiar structure of real biological fitness functions.
Amenability to Crossover
For sake of relative tractability, let us return to the simpler case of hierarchical enzymatic encoding. The general case is of course of greater interest, but it is so complex that we are not at present able to say much about it. In the hierarchical case, we may portray the process of evolution as a tree, divided into a number of levels. The vertices of the treecorrespond to the DMF's created during decoding, and there is an edge between two vertices if one of the two corresponding DMF's is created by the other at some point during decoding. At the kth level reside the DMF's created in the kth phase of decoding.
For simplicity, let us assume that in the first phase only one directed multiboolean form is present. This way all the final sequences occur at the same level, and the phenotype can be obtained by simply juxtaposing, in proper order, the sequences corresponding to the leaves of the tree.
The meaning of this tree is clearest if, at each node, one writes the number of the sequence that led to it. If the children of each vertex are written in increasing order, then the phenotype can be obtained by juxtaposing the sequences corresponding to the leaves, from left to right. An example tree is given in Figure 26.
To see the relation with the crossover operation, let us assume that the phenotype has nice properties with respect to fitness. Denote the sequences corresponding to the leaves by t1, t2,...,tN, and the phenotype by t=t1t2,...,tN. Suppose that fitness(t) = f1(t1) + f2(t2) + ... + fn(tn) + g(t1,...,tn), where at least some of the fi are competitive in norm with g. This means that the phenotype consists of components which contribute separately to fitness to a significant extent.
For an intuitive "toy" example, assume that a person has only three significant qualities: t1=strength, t2=intelligence, and t3=physical beauty. Each of these enhances a person's fitness on its own, to a certain extent; thus the factors f1(t1), f2(t2), f3(t3) are nonnegligible. But the combination of any two of the traits may yield a fitness above and beyond the sum of the fitnesses bestowed by the individual traits. For instance, a person both strong and intelligent may be a dozen times more likely to survive being stranded in the jungle for a year than a person who is merely strong, or a person who is merely intelligent. Thus the factor g(t1,t2,t3) is also nonnegligible.
Given all these assumptions, let us consider, for the sake of illustration, the totally unrealistic special situation in which each number appears on at most one vertex. This means that the different subsequences ti are created by distinct processes running in parallel. In this case, changing one of the sk in the original sequence will affect only one subsequence ti of the phenotype. Very roughly speaking, it seems clear that on average the higher up k occurs in the tree, the more a change in sk will affect the ti which depends on it. This point will be important in Chapter 7, when we discuss the relation between epigenesis and natural history.
I suggest that, in this special case, what is necessary in order for the fitness function to be amenable to crossover is the following. Assume the vertex corresponding to sq(i) is a child of the vertex corresponding to sq(i-1) for i=1,...,M, where sq(M) corresponds to a leaf. Let r(i) be the ith smallest of the q(i), for i=1,...,M. Then the fitness of the schema
s1...sr(1)-1 * ...*sr(1)+1 ...sr(2)-1 *...* ... sq(0) ...
sr(M)-1 **...** sr(M)+1 ... sm
should not tend to decrease too rapidly as M (the order of sq(0)) increases. That is, a subsequence si that determines a higher-order form ought to be a fairly good predictor of the fitness of the sequences that si helps to produce. Naturally it will never be a perfect predictor; otherwise the lower-order forms would serve no role.
For example, if the fitness of this schema were always bounded below by fi(ti)/M, where ti=sq(M) , this would be much more than enough. In this case it is very easy to see that the fitness function is amenable to crossover.
On the other hand, if this general property does not hold to a significant degree, then there is as little reason to expect amenability to crossover here as there is in Rn. Most likely, the only really fit schema will be the ones corresponding to the numbers on the leaves.
Remember, we have been considering the special case in which each number appears on at most one vertex. In general, when each number appears in several different places _ when the epigenesis of different characters is interactive _ a change in one of the sk will affect a number of ti. This situation is considerably more complicated, and its analysis is an open problem. However, one would suspect that a similar condition holds.
These are very sticky matters, but I feel very strongly that they are worth poking into. What is at stake here is nothing less than the evolvability of organisms _ the relation between the structure of an organism and the nature of its genetic code. In Chapter 3 we proposed that biological structural instability is often caused by hierarchies of slightly structurally unstable processes. That general idea fits in well with the more specific model given in this section. But here we have tried to go beyond vague generalities, and explore the question of how well sexual reproduction works at creating fitter organisms out of less fit organisms. One cannot answer this question without getting into the nitty gritty of how genetic codes build organisms. We have proposed that genetic codes build organisms by encoding multiboolean forms that act on one anotherin certain prescribed orders. In this context one obtains trees of embryological process that can be crossed over with one another, just as process dependency graphs were crossed over with one another in Chapter 6. And one can ask for conditions under which crossing over two fit trees might reasonably often tend to yield a fitter tree. We have suggested one such condition, admittedly under extremely restrictive assumptions. There is much work yet to be done.
5.7 EVOLUTIONARY MUTATION ON UNCORRELATED FITNESS LANDSCAPES
As a post-script to this chapter about sexual reproduction, let us consider some recent computer simulations of asexual reproduction. These results are a bit of a digression, but they suggest general principles of no small philosophical interest, and they also tie in with the concept of directed evolution, briefly discussed in Chapter 3.
For starters, let us consider a very simple model of evolution in the absence of crossover. Consider the following simple "evolution game," which models a protein as a (say, binary or base 4 or base 20) sequence. At time i, the protein xi gives rise to one offspring, which is a mutant of itself. Then the protein and its offspring battle it out. Only one wins the battle, namely the fittest one. The winner is xi+1, the loser is gone.
Hagen, Macken and Perelson (1991) have studied this game mathematically, under the assumption that the fitnesses of different proteins are completely uncorrelated. They model a protein as a base-D sequence (their sequences are not necessarily binary), and assume that for each x the fitness f(x) is drawn independently from a certain probability density g(x) on [0,1]. According to this procedure, if x and y are unequal, f(x) and f(y) are statistically independent.
More precisely, they consider the following evolutionary process, beginning from a random initial sequence x0:
Evolutionary Mutation Algorithm
For i=1,2,3,...
1. Mutate xi in one place to obtain y
2. If f(y) > f(xi), then xi+1=y, else xi+1=xi
Because the function f is assumed to be uncorrelated, they are able to give a complete statistical analysis of this process.
One of the most interesting results of their study is that the nature of the probability density g is irrelevant. The statistical properties of evolutionary mutation are independent of g, not only qualitatively but numerically. This is because the algorithm makes no reference to the numerical values of fitnesses, but only to their order structure _ it compares but does not compute arithmetic operations.
Counterintuitively, even an uncorrelated function has some structure. For example, it is easy to see that for an uncorrelated function, a higher local maximum will tend to have a larger basin. If x is a local maximum, all the sequences that can be obtained by mutating x in one place must have a lower fitness than x. But then, the larger f(x) is, the larger, on average, will be the fitness of its one-mutation neighbors. And if a one-mutation neighbor y of x are fitter than average, then the probability that y is fitter than its other one-mutation neighbors is larger than average.
Since a higher local maximum has a larger basin, it follows that the local maxima found by evolutionary mutation will, on average, be fitter than the average local maximum. This intuitive argument can be formalized, but it gives no numerical information. Hagen, Macken and Perelson use asymptotic formulae to estimate that, on average, evolutionary mutation will arrive at a local maximum approximately 38% fitter than the average local maximum.
Grace under Pressure: A Modified Evolution Game
Next, consider the following modification of Hagen, Macken and Perelson's "evolution game." At time i, the protein xi gives rise to one mutant offspring, and then the protein and its offspring battle it out. Only one wins the battle and becomes xi+1 _ but not always the fittest one. The rules are as follows:
1) As in the simulations of Hagen et al, the offspring has an inherent advantage due to its youth, so that if the offspring is fitter, it will always prevail.
2) Let the positive parameter Ti quantify the "benevolence" of the environment. Then, if Ti is large enough _ so that the environment is very benevolent _ the offspring is reasonably likely to prevail even if it is less fit. The larger Ti is, the more likely this is.
3) The environment is getting less and less benevolent _ Ti is gradually decreasing as the time parameter i increases.
This modified evolution game may be interpreted as a simplisticsimulation of the evolution of a single protein in an environment which grows gradually harsher. Initially, survival is not strongly correlated with fitness. But at the end of the time period, there are very few chances for a less fit sequence to survive.
It should be emphasized that this concept of "increasing harshness" makes no sense if fitness is defined tautologically in terms of survival, or rate of reproduction. But if one accepts a physical or informational definition of fitness, the concept is perfectly natural. In harsher environments, it will be more necessary for an organism to match the specific properties of the environment. In less harsh environments, natural selection will have less of an effect, because survival will not be so closely correlated with fitness.
More precisely, the benevolence indicator of an environment might be defined as the sum over all organisms in the environment of the quantity
[number of offspring]/[fitness].
This number has little meaning if one is speaking only of one environment, in isolation. But if one is comparing two different environments A and B, which contain mostly similar organisms, then it is meaningful to say that, if the benevolence indicator of A is greater than that of B, B is a harsher environment than A.
Mathematically, one convenient algorithmic representation of the modified evolution game is as follows _ beginning from a randomly selected initial guess x0.
Modified Evolutionary Mutation Algorithm
For i=0,1,2,3,...
1. Mutate xi in one place to obtain y
2. If f(y)>f(xi), then let xi+1=y. Otherwise, let xi+1=y with probability exp ( [f(y)%f(xi)]/Ti ).
3. Quit when xi has been the same for kN steps
Ti quantifies the benevolence of the environment at the ith time step. For simplicity, it is assumed that the rate of decrease of benevolence is exponential, i.e. that Ti+1=bTi, where b<1.
The reader may recognize this algorithm; it is certainly not a new one. It is is known in the optimization literature as "simulated annealing." The method originated from the equations governing the cooling (annealing) of a substance toward its low energy ground state. In physicalannealing, nature is minimizing energy; here we are maximizing a fitness function f.
Simulated annealing is reasonably effective for a variety of difficult optimization problems, and there are some very important problems for which it is extremely effective _ for example, the graph bipartitioning problem, which occurs in VLSI design. In the simulated annealing literature, the formula Ti+1=bTi is known as the "exponential cooling schedule". Much of the mathematical theory of simulated annealing deals with the cooling schedule Ti=1/log (ai+b), but the exponential schedule seems to be much more effective in practice.
Gregory Sorkin (1992), in an outstanding paper, has shown mathematically that simulated annealing with exponential cooling is effective for optimizing functions with "fractal" graphs. This is the first mathematical result backing up the practical effectiveness of the exponential schedule. Sorkin has done numerical experiments which demonstrate that graph bipartitioning problems, and a few other types of problems which simulated annealing is good at solving, tend to have roughly fractal graphs. Finally, he has made the conjecture that annealing is effective only for functions with fractal graphs.
Now, your typical uncorrelated fitness function f definitely does not have a fractal graph. The surprise is that, for certain G, simulated annealing still works much better than simple evolutionary mutation. The requirement is that the distribution G have a density g that looks like Figure 26. For example, g(x)=xp, .9<p<1, will do nicely. In this case, highly fit sequences are very rare. Figure 27 gives a few typical numerical results. For further numerical results, as well as a partial differential-equations analysis of the modified evolution game, see (Goertzel and Ananda, 1992a).
When highly fit sequences occur just as often as, or more often than, unfit or mediocre sequences, the annealing algorithm and the evolutionary mutation algorithm obtain just about the same final fitness. But the annealing algorithm takes a much longer time getting around to it. All the high-T steps are wasted. However, when unfit sequences are much more likely than fit ones, the extra effort put out by the annealing algorithm tends to pay off.
Biologically, what does all this mean? It suggests the following rather common-sensical maxim:
The Principle of Grace Under Pressure: Take three identical initial populations, and place one under constantly harsh conditions, one under constantly benevolent conditions, and one under increasingly harshconditions. The third one will probably end up fitter (in the sense of being more closely fit to its environment).
Needless to say, this general maxim is not logically implied by our simple experiment, which takes place in the context of a simple two-protein evolution game, and relies upon the biologically false assumption of an uncorrelated fitness landscape. To demonstrate the maxim rigorously, one would have to consider those specific fitness functions induced by real ecosystems.
Combining Sexual and Asexual Reproduction
Finally, let us digress from our digression to briefly consider a different kind of computer simulation. In (Goertzel, 1992f) I describe the GSA algorithm, which combines the genetic algorithm (GA) with simulated annealing (SA), a point mutation algorithm that can be interpreted as a model of the evolution of an asexually reproducing population. It considers a population which, at each time step, gives rise to two offspring populations, one obtained by mutation, the other by crossover. The new population is then chosen by selecting elements from the union of the two offspring populations, with a bias toward those which are fitter.
Biologically, the GSA algorithm is a very peculiar construction. It concerns the evolution of a population that reproduces by both sexual and asexual reproduction. Our results show that this would probably not be a bad strategy, if it were physiologically feasible. Science-fictionally speaking, if the two forms of reproduction were properly balanced, they would provide us with a potentially endless supply of Einsteins, Babe Ruths and Winston Churchills _ as well as the genetic diversity required to generate such remarkable individuals.
The main result regarding GSA is that there are certain functions for which GSA will significantly outperform both SA and GA. And on the other hand, for any particular function, it is unlikely that GSA will take more than twice as long as either SA or GA.
These results raise the following question: is it possible that the fitness functions confronting asexually reproducing organisms are fundamentally different from those confronting sexually reproducing organisms? Perhaps more complex organism/environment interactions lead to qualitatively different fitness functions _ say, "noisier" fitness functions, for which sexual reproduction works better. If this is indeed the case, the shift from asexual to sexual reproduction may conceivably have been a matter of adaptive optimization. Perhaps Nature shifted thealgorithm to match the fitness landscape!
These questions are particularly important in the context of the hypothesis to be proposed a few sections later, that evolution in the mind/brain does involve simultaneously "asexual" and "sexual" reproduction.
5.8. LEARNING TO WALK
In closing, let us consider a concrete psychological example of evolutionary optimization: the process of learning to walk.
For a parent, there are few events quite so exciting as an infant's first steps alone. But learning to walk is not only a behavioral milestone, it is a paradigm case for the study of child development. The theory of motor development is beset by many dilemmas: the role of maturation versus active learning, the meaning of regressions in development, the nature of developmental stages and transition, etc. And every last one of these difficulties may be addressed in the context of the simple question: by what process does independent upright locomotion emerge?
Over the last three years or so, several researchers have sought to approach this question from the point of view of dynamical systems theory. Dynamical systems analysis provides a very attractive alternative to the standard frequency-analysis techniques for interpreting developmental data. Furthermore, it has the advantage of relating motor development processes with processes of change in other contexts _ physics, chemistry, molecular biology, etc.
However, dynamical systems analysis provides rather little insight into the internal structures underlying the development process. For this reason, it seems useful to think about the development of upright locomotion from the point of view of discrete dynamics and discrete optimization.
Development and Dynamics: A Little Background
Over the last half-century, experimenters have accumulated a tremendous amount of data regarding the acquisition of independent upright locomotion. However, much less effort has been devoted to the question of the neural and psychological dynamics underlying this acquisition process. In the words of Thelen and Ulrich (1991), the great bulk of Western developmental science hasproceeded to make significant empirical gains documenting new behavioral forms and their precursors, successors and variants while remaining less concerned, and for the most part atheoretical, about issues of fundamental processes of change.
This aversion to speculation about dynamical processes has encouraged theories which explain development without invoking active learning. For example, the maturationist approach to the development of independent upright locomotion, pioneered by McGraw (1945), views learning to walk as a microgenetic process: as a new portion of the brain natures, a new component of walking skill is acquired. And this approach is not at all obsolete; it is alive and well in the work of Forssberg (1985), for example.
The most important exception to this anti-dynamics bias has been the work of Jean Piaget (1985). Piaget's constructivist point of view places a great deal of emphasis on underlying dynamical processes. Piaget's concept of equilibriation, although somewhat vaguely defined, serves a very important role: it connects developmental psychology with the study of mathematical, physical and chemical dynamical systems (Prigogine and Stengers, 1984; Devaney, 1988).
The Piagetan perspective leaves the connection between developmental psychology and dynamical systems theory on a purely intuitive level. In recent years, however, a handful of researchers have striven to remedy this deficiency. For example, Bruncks (1992) has proposed to use nonlinear dynamics to explain the convergence of disparate early walking styles to a more standard "mature" walking style. And Thelen and Ulrich (1991) have provided an explicitly dynamical-systems-theoretic analysis of the development of treadmill stepping during the first year.
The work of Thelen and Ulrich must be classified as an unqualified success. Based on a study of nine infants, they were able to determine a specific "control parameter" of the dynamics underlying the process of learning to walk on a moving treadmill. However, in striking contrast to the Piagetan way of thinking, their approach was strictly "black box." They did not attempt to get at the neural or psychological processes underlying the dynamical patterns which they observed.
Thelen and Ulrich are certainly correct in studying developmental processes using the tools of dynamical systems theory. However, one suspects that their attitude toward structural explanations may be undesirably severe. The Piagetan concept of "developmental scheme" isa valuable one; and it does not contradict the presence of active learning and complex mathematical dynamics. It seems plain that, in order to really understand the processes underlying motor development, the viewpoint of dynamical systems theory must be integrated with the framework of developmental schemes. The computational psychology of The Structure of Intelligence provides many clues as to how to carry out this integration.
Learning to Walk as a Process of Discrete Optimization
The dynamical pattern observed by Thelen and Ulrich in the context of treadmill stepping is rather similar to the pattern observed by Bruncks (1992) in the context of independent upright locomotion. According to Bruncks's study, beginning walkers use a wide variety of styles. Then, little by little, as they become more adept, their walking styles converge to a relatively standard, "mature" walking style.
In (Goertzel, 1993b) it is proposed to explain these empirical observations by modeling the task of learning to walk as a discrete optimization problem. In this perspective, any possible strategy for learning to walk may be represented as a discrete optimization technique. And, conversely, every discrete optimization technique may be considered as a potential strategy for learning to walk. The problem is then determining which discrete optimization algorithm is actually implemented in the brain.
The class of all discrete optimization techniques is a very large one. However, for reasons that will become clearer at the end of Chapter 6, it seems likely that one of two optimization strategies is at work here: either evolutionary mutation or genetic optimization. The main questions, from this point of view, are then
1) which of these two algorithms is most relevant,
2) how much parallelism is involved in the implementation of these algorithms
3) what degree of "multilevelization" is involved in the implementation of these algorithms
For sake of brevity, we will not discuss the implications of the latter two questions here. Let us focus on the first one.
The evolutionary mutation algorithm, in its most general form, is as follows, where f is some real-valued function that measures the quality of various possible solutions for the problem at hand. For instance, in the case of walking styles, f(S) should measure the speed and stability of locomotion under the walking style S.
Evolutionary Mutation Algorithm: Most General Form
Given: an set P0 of N0 "initial guesses"
Let S1i,S2i,... denote the Ni elements of Pi
For i=1,2,3,...
1. For j = 1,2,...,Ni
Mutate each component of Sji with probability pi to obtain Sji'
2. Select Nj+1 elements from the set
(S1i,S2i,...) U (S1i',S2i',...),
with a bias toward those elements of higher quality as measured by f
The important point here is that Step 1 requires each guess Sji to be decomposed into "components." In the case of walking styles, this means that the every possible walking style must be associated with some finite list of "components." There are many ways of doing this. For example, in the context of treadmill stepping, one might simply take the continuous-variable characterization given by Thelen and Ulrich, and discretize each variable. Or one might seek a more neurologically natural representation. Genetic optimization is a more complex algorithm _ as discussed above, it involves taking various guesses and interbreeding them. It is certainly not inconceivable that genetic optimization plays a role in the selection of walk styles. However, a close look at the data suggests that learning to walk can quite likely be understood as a process of simple evolutionary mutation.
The dynamics of evolutionary mutation depend completely on the nature of the "quality" function f. In this case, however, the function f appears to be rather simply structured. The results of Bruncks, and the results of Thelen and Ulrich, both suggest that this function f is probably unimodal with small bumps.
More precisely, what is meant by this is as follows. Let d(,) be a metric on the set of all walking styles. Assume for simplicity that if S' is obtained from S by a mutation in one component, then d(S,S') = 1; and let D denote the diameter of the set of all walking styles. Note that, since f is a real-valued function whose domain is a finite set, f is necessarily Lipschitz continuous.
Proposition: The "quality" function f involved in upright locomotion has the following property:
There exists some unimodal function g so that
max |f(S) _ g(S)| < cK, where c << D and f is Lipschitz
continuous with Lipschitz constant K
Back-of-the-envelope calculations, based on the representations used by Thelen and Ulrich, indicate that this proposition is indeed correct. However, the specific number c will of course depend on the particular representation involved.
This proposition implies that, for parameters suitably chosen, the evolutionary mutation algorithm will fairly quickly converge to the optimal walking style. In a nutshell, what this means is as follows. Suppose an infant is basically physically capable of walking, and simply needs to learn how. Then the learning process works by evolutionary mutation on the level of psychological schemes. Given a small set of schemes representing initial walk styles, the infant mutates these walk styles to obtain a larger pool of styles. Those styles which work best _ which give the fastest, most reliable locomotion _ are retained; the others are eliminated. Eventually, this process results in the evolution of an optimal "mature" style which permits rapid, stable locomotion. This mature style is the scheme for walking which will be retained into adulthood.
What remains to be done is as follows:
1) to experiment with different representations of walking styles in terms of components
2) for each one of these representations, to estimate the rate of learning which it implies. If we assume that the Proposition holds, then this just means estimating the ratio c/D.
The ratio c/D is an interesting quantity. If we knew which representation the brain used, then we could use this quantity to estimate the rate at which infants, once they possess the requisite physical development, learn to walk. And, conversely, Darwinist thinking indicates that the representation used by the brain should be at least a local maximum of c/D. This suggests that, by experimenting with various representations, one might be able to make an educated guess as to the representation used by the brain.
Based on these preliminary ideas, it seems fairly likely that the discrete dynamics approach is capable of explaining much of the data on learning to walk. If this is indeed the case, then the present approach constitutes a significant improvement over the dynamical systems approach given by Thelen and Ulrich, inasmuch as it explains the same phenomena and yet harmonizes well with scheme-based and microgenetic analyses of the development of upright locomotion. The moral of the story is that discrete dynamics, applied appropriately, allows us to unify dynamical-systems thinking with procedure-oriented thinking.
This method of analysis is, of course, not limited to the study of walking. The development of independent upright locomotion is merely a convenient testing ground on which to refine the discrete dynamics approach to motor development. Once the analysis of walking has been completed, the next step will be more complex motion tasks. The only serious obstacle here is that our Proposition is about walking; it need not hold for general motion tasks. For instance, a task such as frisbee throwing, which may be done in many different ways, probably corresponds to a function f which is significantly multiextremal on all scales. Walking presents a fairly well-behaved objective function, and therefore it is a natural motion task with which to begin.