Back to Ben Goertzel's Research Papers and Essays |
Steps Toward a Complex Systems Theory of Categorization and Similarity
Similarity and categorization are essential to thought. Dividing the world into categories underlies all forms of cognition. Similarity judgements guide the formation of analogies, and mold the structure of memory. Furthermore, it is the similarity and categorization of complex entities that is most crucial. Most of the entities dealt with by the mind are characterized by complex networks of vaguely defined, interlocking features; and the processes of similarity and categorization in such cases are correspondingly complex, mirroring the organization of the mental entities to which they relate.
There are many psychological theories of similarity and categorization. However, while these theories deal adequately with the data collected in simple laboratory experiments, they fall short of providing an intuitive picture of similarity and categorization as carried out in real-world situations. These theories generally work by projecting mental entities into vector spaces, and replacing interactions between mental entities with measurements of distance by standard metrics. This is a reasonable methodology, but its limitations are obvious: essentially, all one is doing is a fancy kind of curve-fitting; and, worse yet, a kind of curve-fitting that becomes impractical in all but the most elementary situations. The limitations of this theoretical model tend to guide empirical research: experimenters tend to focus on the simple situations to which the vector space approach applies best, rather than exploring more realistic and complex situations, for which new theoretical ideas might have to be devised.
In this paper, we will review the standard approaches to categorization and similarity, and then suggest the direction in which we believe a more psychologically realistic, complexity-oriented theory to lie. Unlike the standard approaches, the new "emergence-oriented" approach outlined here is motivated not by data analysis, but by fundamental psychological and computational theory. Detailed consideration of the empirical implications of the emergence-oriented approach is left for the future, but some basic ideas along these lines will be outlined.
In the emergence-based view, categorization is emergence-seeking, and similarity is intertransformability. In other words, we define a natural category as a collection of entities that gives rise to substantial emergent pattern. And we define two entities to be similar if they form a natural category of two, which turns out to be equivalent to stating that the two entities can be easily transformed into one another. These ideas are extremely general, but can be fleshed out by reference to the specific pattern- recognition and transformation operations accessible to the human mind. They provide a new and powerful conceptual framework for discussing similarity and categorization on the level of process dynamics.
The emergence-based perspective is new, but it fits in naturally with recent nonlinear-dynamical and complex-systems models of mind and behavior. And we will show that it also has important implications for computing. After presenting the new approach in a general, abstract way, we will then illustrate it by using it to derive a new algorithm for determining the natural categories within a large group of text documents.
There is a broad psychological literature on similarity and categorization, containing a variety of different models. Amidst this diversity, however, there are only a few foundational ideas.
First, there are two primary theories of categorization - the "Old Fashioned" view that categories are defined by necessary and sufficient conditions, and the "Modern" view that they are defined by family resemblances. A family resemblance is the similarity that different members of the same category naturally have with respect to each other. This similarity may reflect either direct comparison of one item or exemplar to another, or it may reflect comparison of each item to a common prototype; the best member of the category. These exemplar and prototype theories of family resemblance, together with rule-based theories of necessary and sufficient conditions, comprise the complete modern view of categorization mechanisms.
The family resemblance model depends explicitly on the computation of the similarity between items. A feature-based rule model does not, it only depends on identifying the component features of each item, so that the rule can be applied. Other rule models, such as the General Recognition Theory (Ashby, 1986}, require that the items be represented spatially, so that discriminant functions can be computed.
So, in sum, categorization models can be taxonomized based on their assumptions about item representations (featural vs. spatial) and their mechanism for computing membership (similarity vs. rules). Leaving aside featural, rule-based categorization, what all of these mechanisms share is a reliance on the representation of individual stimuli as points (or distributions) in a multidimensional psychological space. This psychological space is defined by the similarities between items; and hence the importance of a clear understanding of the nature of similarity is thus hard to overstate.
There are also two dominant models of the basis of similarity jugments. One is spatial and the other featural. The spatial model of similarity derives from nonmetric scaling models (Shepard, 1962), and assumes that items are decomposed into values along component dimensions. The similarity between items is a decreasing function of the distance between their values on these dimensions. The distance d, between $n$-dimensional items $A$ and $B$ is just the lp metric, for some exponent p:
The featural model assumes that items are decomposed into lists of features, and that these features are compared to those of other items. The dis-similarity between items is given by
The most well known example of this feature-set sort of model is the Contrast Model (Tversky, 1977). Other set-theoretic measures of similarity include Gregson's (1975} ratio measure. This measure is similar to a spatial, city-block, measure, but differs in detail (Gregson, 1984).
For similarity to be governed by either set-theoretic or distance measures, it must abide by certain laws. Namely, an item must be most similar to itself, the similarity between item A and item B must be the same as the similarity between item B and item A, and similarity ratings must satisfy the triangle inequality (S(A,C)>=S(A,B)+S(B,C)).
However successful these models are in accounting for performance in similarity rating and categorization experiments, they remain incomplete for the simple reason that people do not always obey the basic laws of similarity. One example of this can be seen in the existence of nonmonotonicities in similarity ratings. Normally, when the same feature is added to a pair of stimuli, the similarity between them increases (Tversky, 1982; Gregson, 1975}. However, when two multidimensional, structured stimuli with a given similarity have an additional common feature added, the rated similarity between the items can actually decrease. Neither featural or spatial models can account for this.
In a very interesting approach, Goldstone (1996) proposes that the decrease is due to a mismatch between the structure and the feature. If the common feature makes up a different part of one of the stimuli than the other, then it decreases similarity by forcing comparison between unlike components of the stimuli. This `alignment' model of similarity provides one of the only process accounts of the active construction of similarity ratings. It is also able to make predictions about reversals in similarity judgements as processing shifts from local to global (Goldstone, 1994). The notion of alignment, which requires effort, can be viewed as a special case of the more general notion of "similarity as intertransformability," to be presented below.
The SIAM model itself (Goldstone, 1994) is based on easily decomposed items with discrete feature values. These conditions are not general -- many complex items have fractal composition, many item pairs differ in their composition, and features (when identifiable) can of course differ continuously. However, it is not these aspects of the model that are important for its successes. It is the addition of an alignment process acting in concert with similarity evaluation that gives the model the dynamics necessary to produce nonmonotonicities. In the Goldstone model we see traditional psychological modelling moving slowly in the direction of complex systems ideas, of acknowledging and encompassing the full complexity of mind.
It must be noted that psychologists adhering to these standard approaches to similarity and categorization do not necessarily believe in them, as mental models. One can use the theories to model data without believing that people really use metrics on n-dimensional space to compute similarities or to do categorization. The vector space view can be interpreted, conservatively as a fancy calculational tool, which happens to agree well with whatever it is that people actually do. However, this conservative view very naturally blends into a more ambitious stance, which holds that the mind is actually concerned with mapping entities into vectors, and computing distances in vector space. And most importantly, it pushes research in certain directions: in the direction of analyzing experiments with simple stimuli, which can easily be projected into low-dimensional vector spaces.
There are a few theorists who have attempted to work outside the vector space paradigm, to an extent. The best examples are Robert Gregson's (1995) work on the gamma recursion, and Dave DeMaris's (1996) work on shape categorization and lattice dynamics. But these few exceptional research projects, which take nonlinearity seriously as a basis for understanding similarity and categorization, only apply to special cases. As yet the vector space approach is the only example we have of a broadly defined, precisely conceived approach to similarity and categorization. The Need for a New Approach
Having surveyed existing approaches to categorization, the need for something new should be apparent. The standard approaches work almost exclusively on the level of data modelling, and even there they are cumbersome except in very simple cases. Nothing is said about the mental processes underlying observed behaviors.
The work of Goldstone, Gregson and others moves in the direction of process modelling. However, this work, as currently expressed, deals only with special cases, and hence does not come anywhere near to capturing the full complexity of the phenomena involved. It seems definitely worthwhile to consider taking a more radical step, and constructing a new theoretical framework: one that places process dynamics at the center, rather than treating it as an add-on to the vector space approach.
Instead of vector spaces and clusters, the new approach presented here is based on concepts of process, transformation, pattern, and algorithmic information. For the sake of having a manageable term, we have called it the "emergence-based approach." The crucial idea is that natural categories are groups that give rise to emergent pattern; and entities are similar if they belong to natural categories.
In most practical cases, the emergence-based approach will not yield drastically different results from the standard, vector-space approach: this is necessarily the case, because the standard approach is fairly effective, both in terms of explaining psychological data and in terms of dealing with computational problems. However, the new approach is broader than the standard approach: it deals easily with situations involving entities that cannot be mapped into vector spaces in any natural way. And it also leads to new computational algorithms, which, in some cases, will provide significantly superior performance.
In this section we will outline some general theoretical background needed for the appreciation of the emergence-based approach to similarity and categorization. In particular, we will discuss aspects of a a general mathematical and conceptual construction called the psynet model of mind, developed by one of the authors in previous publications (Goertzel, 1994, 1997). The psynet model is highly abstract, but with knowledge and effort it can be applied to particular situations; the emergence-based theory of similarity and categorization, as developed here, is an example of such an application.
The psynet model views the mind as a collection of intercreating pattern-recognition processes. These processes give rise to a network of self-organizing attractor structures, including a "dual network" consisting of an hierarchical network in which patterns emerge from patterns which emerge from patterns, etc., and an heterarchical network in which processes interact with structurally and dynamically related processes. The general concepts of the psynet model are reviewed elsewhere in this volume, and so will not be repeated here. However, the emergence-based approach to similarity and categorization draws on one aspect of the psynet model in particular detail: the theory of pattern. Before moving on to discuss our new approach to similarity and categorization, a brief review of the formal theory of pattern will be required.
The Mathematical Concept of Pattern
In the psynet model, it is considered that that mind is not physical but rather relational: it is made of abstract structure, of pattern. There are many ways to formalize this abstract idea of pattern. For example, algorithmic information theory gives us a convenient way of studying pattern using the theory of universal Turing machines. However, the concept of pattern is arguably more basic than the theory of universal computation. In previous publications we have given a very simple mathematical definition of pattern, and used it to model numerous psychological and biological processes. Namely, one may characterize a pattern as a representation as something simpler. This is the definition that we will work with here.
In symbols, what this definitions means is that a process p is a pattern in an entity X if
p is simpler than X
R(p), the result of carrying out process p, is a sufficiently good approximation of X
These two criteria need to be balanced against each other -- i.e., the definition of "sufficiently good" depends on how much simpler p is than X.
To make this characterization of pattern more rigorous, one needs to choose a particular set of "entities." In the Turing machine model, the "entities" X involved are binary sequences representing data, and the "processes" p are binary sequences representing programs, outputting binary sequences R(p).
In this context, let | | be a "simplicity function" mapping the union of the space of entities and the space of processes into the nonnegative real numbers; and let d be a metric on the space of entities, scaled so that d(R(p),X)/|X| = 1/c represents an unacceptably large difference. Then one reasonable definition of the intensity with which p is a pattern in X is given by the formula
[1 - c d(R(p),X)/|X|] [|X| - |p|] / |X|
The term [|X| - |p|]/|X| = 1-|p|/|X| gauges the amount of simplification or "compression" provided by using p to represent X. If p provides no compression, this yields 0; in the limit where p is entirely simple (|p|=0), the term yields its maximum value of 1 (100% compression). if, say, |p| = .5|X| then one has 50% compression. Next, the term 1 - c d(R(p),X)/|X| allows for approximate matching or "lossy compression": it has a maximum of 1 when R(p) , the result of carrying out process p, is exactly the same as X. The maximum intensity of a pattern, according to this formula, will be 1; and anything with an intensity greater than zero may be considered to be a pattern.
The definition of a simplicity measure | | is a key point, but also a source for subjectivity. In the case of binary sequences, the simplest course is to define the simplicity of a sequence as its length. However, this is not the only possibility. There are other factors such as the program's running time, and the "crypticity" of the program (the difficulty of discovering the sequence in the first place).
Finally, the relation between this general theory of pattern and conventional algorithmic information theory is not difficult to determine. The key is to restrict attention to the special case where the metric d is defined so that d(X,Y) is infinite unless X=Y. In this case, the first term in the definition of pattern intensity can be ignored. Also, in algorithmic information theory, it is conventional to assume that all computer programs are "self-delimiting," i.e. contain a segment specifying their own length. Given these assumptions, if one defines the simplicity of a binary sequence as its length, one concludes that the algorithmic information I(X) is the simplicity of the simplest pattern in X. A straightforward example is the binary sequence
X = 100100100100100100100100100100100100100100100100100100100100100
consisting of 1000 repetitions of the string "100". Suppose that the program p = "Repeat '100' 21 times," encoded in binary, comes to 200 bits. Then we have |p| = 200, while |X| = 1000, and so the intensity of p as a pattern in X comes out to [1000 - 200]/1000 = .8.
Emergence and The Dual Network
Having established the concept of pattern, we can now move on to the crucially relevant concept of emergence. One may say that a process is emergent between X and Y to the extent that it is a pattern in the "union" of X and Y but not in either X or Y individually. Consider for instance the part of this sentence preceding the semicolon, and the part of this sentence following the semicolon; the meaning of the sentence does not exist in either part alone, but only emergently in the juxtaposition of the two parts. "Emergence" is a pragmatic version of the old idea of Gestalt; all the examples of gestalt in perception are also examples of emergent pattern.
The notion of emergence is, in the present view, essential to categorization and similarity. In the following section, we will define a natural category as a category that gives rise to much emergent pattern amongst its parts; and we will define two entities to be similar if they form a natural, emergence-producing category of two. Similarity and categorization are thus revealed as specialized manifestations of an underlying level of "pattern space dynamics."
Emergence has a crucial importance to the "dual network" structure, which the psynet model posits as a necessary component of mind. The dual network consists of an hierarchical network of complex processes emerging from and controlling simpler processes, superimposed on an heterarchical network in which processes connect to processes with common patterns. The hierarchical component of the dual network is characterized by the fact that one process that emerges from a group of other processes, will often have control over these other processes. The heterarchical component suggests that these groups of processes under common control are likely to be intricately interrelated, to give rise to a great deal of emergent pattern on a pair-by-pair basis.
Basically, what the dual network suggests is that the mind's natural categories are those that give rise to a great deal of emergent pattern; and that mentally similar entities, which will tend to belong to the same natural categories, are characterized by the fact that they give rise to emergent pattern when considered together. The emergence-based view of similarity and categorization, as presented here, represents an abstraction of these conclusions drawn from the dual network aspects of the psynet model.
In the vector space approach, similarity is prior to categorization. Categorization proceeds by grouping similar entities together. (One might dispute this by observing that, technically speaking, categorization and similarity both depend on distance; but similarity and distance are merely scaled versions of one another, so we believe the point still holds.) In the emergence-based approach, on the other hand, it is most natural to begin with categorization. One defines what it means for a group of entities to be a natural category; and the similarity between two entities is then defined as the extent to which the two entities, in themselves, form a natural category.
We will begin by setting up some notation. Suppose one has a collection D={D[1], D[2], ...., D[N]} of entities. In the specific application to be considered below, these will be texts; but in general, they may be anything conceivable or perceivable by the mind: visual figures, sounds, smells, ideas, feelings, etc.
In principle, no assumptions are made regarding the space in which the D[i] lie, except that S has a similarity function on it, which gauges the "simplicity" of elements in S. Also, one must isolate a class of "acceptable" functions mapping S into S, so that this function class also has a similarity function on it. These in-principle requirements are extremely general. In practice, however, it is highly convenient to assume that the D[i] are computationally specified, and that the acceptable functions involved are computatable functions; in this case the similarity function may be defined in terms of algorithmic information theory and its variants. For instance, in many computational applications, |X| may be very simply conceived as the length of the computer code representing X.
In the present discussion, we will refer throughout to one particular application of these ideas: the case where the D[i] are "text documents," i.e. sequences of words. In this case the computational framework is plainly applicable. The simplicity of a text is defined as its length; the simplicity of a computer program for manipulating text, if it needed to be considered, would be defined as a weighted sum of program length and program run-time. The emergence-based approach, we will show, leads to a new approach to the practical problem of finding the natural categories amongst a large collection D of text documents.
The starting-point for investigating categorization, in the approach taken here, is the notion of pattern. A categorization system is viewed as a special kind of pattern in a collection of entities.
Recall the definition: "A pattern is a representation as something simpler." In particular, a pattern was defined as a process p producing an entity R(p) that approximates another entity, X. Here we will look at a special case of this formalism, in which the process p consists of two parts: a "program" section f, and a "data" section Z. The program f will be assumed to operate in an approximately reversible manner: it transforms X into Z, and Z into an approximation of X. The "result" R(p) of the process p is then the approximation of X generated by applying f to Z.
In this framework, when one seeks a category within D, what one really seeks is: a program f that takes some entity Z as input, and outputs an approximation of some subset D' = {D[i(1)],...,D[i(k)]} of the original collection D, with the property that f and Z, taken together, are simpler than D' (as measured by the appropriate valuations). The quality of the approximation can be bad if the improvement in simplicity is high; or the improvement in simplicity can be slight if the quality of the approximation is excellent.
Exemplar-based categorization (as represented computationally by the k-means method) represents a low-acccuracy, high-simplicity alternative. Suppose we divide D into categories, C[1],...,C[r], on the basis of r "exemplars" E[1],...,E[r]. Each entity is placed into the category whose exemplar it is closest to. The idea here is that each exemplar is a good approximation to everything in its category. Thus the exemplar list itself is an approximation of the entire collection of entities. It is not ordinarily a good approximation, but its simplicity is outstanding. So the list of exemplars is a pattern in the collection D.
This kind of categorization may be useful in some circumstances. It is at least as interesting, however, to explore higher-accuracy, lower-simplicity approaches. What we are going to propose now is, in fact, a potentially perfect-accuracy alternative: one that can operate entirely on the basis of exact patterns.
To move in this direction, suppose first that one has a particular transformation f for recognizing patterns in entities, and amongst groups of entities; and that one has a pair (f,Z) that is a pattern in the subcollection D'. The transformation f acts to transform D' into the simpler entity Z, and back again. In the text-categorization application, if D' is a collection of text documents, then f is a compression/decompression algorithm, and Z the compressed file produced by applying f to D'. (Note that in text applications the transformation f must do what it does without any loss of information. In applications such as image compression, on the other hand, one considers lossy compression and hence inexact pattern; it need not be possible to restore D' from Z in precise detail.)
The amount of emergent pattern obtained by using f to transform D' into Z may be quantified as
[ |f(D[i(1)])| + ... + |f(D[i(k)])| - |Z| ] / |D|
In compression language, this is the extra compression obtained through compressing the collection as a whole, instead of compressing each document individually.
Next, suppose one wants to consider the effects of adding a new entity X to an existing subcollection D', obtaining a new collection D'+X. Here the quantity of interest is the "information added":
J[X|D'] = 1 - [ |f(D' + X)| - |f(D')| ] / |f(X)|
i.e., the extra compressibility for X that is obtained by considering X as a part of the class D'. The supremal value here is 1, which can never be achieved: J[X|D']=1 means that D' simplifies X down to nothing. On the other hand, J[X|D']=0 means that D' is of no use in compressing X. Conceivably, one might even have a negative value for this quantity, meaning that it is better to consider X on its own.
Suppose, then, that instead of seeking categories based on similarity to exemplars, one seeks categories based on compressibility and emergent pattern. The basic principle of category-formation, in the emergence-based view, becomes:
A natural category is a collection D' so that, for X within D', the quantity J[X|D'] is large
The task of categorization, then, is to divide D up into mutually exclusive, or largely mutually exclusive, subcollections, with the property that each subcollection is a natural category.
We have intentionally phrased this principle in a vague way: it is a principle, not a mathematical definition. What is required is that some kind of average of J[X|D'] over X in D' is large. Whether one chooses to make this a maximum, an arithmetic mean, a p'th-power mean, etc., is not theoretically essential. From a psychological point of view, it is interesting to ask what kind of averaging humans implement here; and from a computational point of view, it is interesting to ask what kind of averaging gives the best results in a particular application area. However, these are subordinate technical questions, and should not be confused with the essential nature of the notion of a "natural category."
Computational Implementation
Computationally speaking, it is not difficult to see how this kind of categorization might be carried out. Most simply, one can envision an iterative categorization algorithm based on the k-means method, and roughly defined as follows:
Fix a number of r categories: call the categories C[1],...,C[r] Assign all entities to arbitrary categories initially. Repeat until satisfied: { For i= 1 to N { Place entity D[i] in the category C which minimizes J[D[i]|C] } } As in the k-means method, one may consider ways to vary the number of categories, to avoid local optima by swapping many entities at a time etc. Alternately, far more sophisticated approaches are also simple to imagine: for instance, one might evolve categories by the genetic algorithm, using a fitness function FIT(D') defined as the average of J[X|D'] over all X in D' (or some variant thereof).
Connections with the Psynet Model
We have presented a formal model of categorization, and suggested how it might be implemented algorithmically. The psychological meaning of these ideas, in the context of the psynet model, was hinted at above; it is time to return to this topic in a little more detail.
The hierarchical network posited in the psynet model consists of process systems that recognize emergent patterns in their subordinate systems (while at the same time sending "control" messages to these systems). The recognition of categories is thus seen as a natural consequence of the dynamics of the hierarchical network. Category recognition is a consequence of the general drive toward emergent pattern creation/recognition that is present in the hierarchical network.
On the other hand, the heterarchical network posited in the psynet model consists of processes that interact with other "related" processes -- where the relatedness of X and Y is defined as the presence of emergent pattern between X and Y, recognized by other processes in the system. It is obvious that a natural category, whose components are dynamical processes, will be an effective subset of the heterarchical network. The natural dynamic of the hierarchical network, which seeks to maximize emergent pattern amongst neighbors, will automatically facilitate category formation.
We thus see that the emergence-based approach to categorization fits in naturally with a global system-theoretic picture of mind -- a picture of mind as a self-organizing, nonlinear, self-observing dynamical system. This is a constrast to the vector space approach to categorization, which does not fit naturally into any well-articulated view of the mind as a whole, functioning system.
So much for categories. Now what about similarity?
In the emergence-based perspective, similarity follows immediately from categorization, rather than the other way around. The basic principle of similarity judgement in this context is that
The perceived similarity between two entities is a scaled version of the degree to which the two entities, taken together, form a natural category
In this context, the idea that categories are formed by grouping together mutually similar entities emerges as a kind of first-order approximation. Pairwise emergent patterns are only part of the story: natural categories are formed according to these patterns, and patterns emergent between three entities, four entities, and so forth.
The similarity between two entities, then, is the extent to which each of the entities helps to simplify one's description of the other. Another way to say this is that two entities are similar if each of the entities is a transformed version of the other, under some "simple" transformation. If X is a simply-transformed version of Y, then knowing Y helps one to construct X: one needs merely to apply the transformation to Y, and presto, X appears. In a slogan, similarity is intertransformability. Or, more completely: similarity is intertransformability by psychologically simple and available operations.
In terms of the psynet model, this is the definition of similarity that is needed if one wishes to declare that the heterarchical network (the associative memory network) stores similar entities near to each other. The dependence on "psychologically simple and available operations" is crucial, for it implies that the similarity between two objects can never be understood in an objective, context-independent fashion. Similarity is part of the overal fabric of mind, arising from a substrate of emergent pattern recognition and complex transformation dynamics.
This is a new approach to similarity, but yet, it is not without its resemblances to previous approaches. On a philosophical level, it is much the same as Goldstone's (1994,1996) approach, mentioned above. In the context of similarity of visual forms, Goldstone considers the effort involved in aligning a feature on one entity with a feature on the other: what is this but a measure of the effort required to transform one entity into the other? Basically, while the traditional vector space view interprets "intertransformability" as the effort of transforming one feature vector into another, the Goldstone approach adds in an extra stage, having to do with the physical relations of different elements of the feature vector. This is a good idea, but somewhat awkward, due to the continued reliance on the vector-space model of mental entities. The perspective taken here acknowledges that the vector space approach captures important aspects of similarity and categorization, but places the emphasis where it belongs, on the underlying level of mental process dynamics.
This emergence-based framework for categorization and similarity judgement has implications in a variety of areas within psychology and computing. In this section I will describe one particular computational application, which served as the initial motivation for the more abstract ideas presented here: the problem of dividing a large collection of text documents into a collection of natural categories.
What is needed to make the emergence-based approach to categorization work, as an algorithm for text processing, is a rapid way to compute J[X|D'], the information added by compressing a document X relative to a collection of documents D'. This depends, obviously, upon the specific compression scheme f that one is implementing. More sophisticated compression schemes will lead to more difficult the computations will be, but more useful categorizations. Here we will assume a standard compression method based on extraction of repeated strings. (In fact, all common text compression methods in use today are based on this same simple principle; they differ only in how they implement it; see Bell, Cleary and Witten.)
For simplicity we will assume in the present discussion that the most basic level of interest is the "word" level rather than the "letter" level. This involves the principle that in categorizing texts we are concerned primarily with syntactic and semantic information rather than morphological information. In practice, however, one might well want to explore letter frequencies as well.
Beginning on the word level, then, we will consider a compression algorithm that acts as follows. Suppose one has a text, or collection of texts, X. First, the frequencies of occurrence all words in X are computed, along with the frequencies of all sequences of words of length less than k (k, a parameter, is the "order" of the compression scheme). All sequences of length k that do not occur at least k+m times are ignored (m is another parameter embodying "computational overhead"; m=1 or 2 is reasonable). Then these words and sequences are sorted in order of frequency. Codewords are assigned, drawn from a fixed alphabet, with the principle that the most frequent words or sequences get the shortest codewords. The original text X is then replaced by a sequence of codewords. The compressed file Z consists of two parts: the sequence of codewords, and the codebook itself.
This is compression of X in itself. How then does one compress X relative to another entity D'? This is simple enough. One computes the codebook corresponding to D'+X, and encodes X using this codebook. If X consists mainly of words and phrases that already occur frequently in D', then the codebook fragment of f(X) will be minimal, and so |f(D'+X)| - |f(D')|, the size of X compressed relative to D', will be much smaller than |f(X)| itself. On the other hand, if X has a lot of words that do not occur in D', or that do not occur frequently in D' (and hence have long codewords in D'), then the gain obtained by compressing X relative to D' may be small, or even nonexistent or negative.
If D' is large compared to X, then it may be inefficient to re-code D' every time one wants to consider the possibility of adding a new document X to it. A reasonable approximation in this case may be obtained by simply coding X in terms of the already-existing codebook for D', and making corrections for the additional terms in X that are not in this codebook (say: assuming a fixed code length for these additional terms). If it is indeed decided to add X to the document collection D', at this point the codebook of D' is rewritten.
Finally, it is worth contrasting this emergent-pattern-based method with the vector-space approach, as it would be applied to the same problem. In the vector-space approach, one would begin by mapping each text into a vector. This is not difficult to do: lists of words and phrases, tagged by frequency, can be unravelled into vectors. A metric on multidimensional space is then used to determine the distance between texts; this is also simple enough, though there is a serious and largely arbitrary decision to be made as regards the weighting of differences between words versus differences between phrases of different lengths. Finally, a clustering method such as k-means is used to find natural categories amongst these vectors.
The vector-space misses all emergent patterns above the pair level. And even at the pair level, it is extremely problematic, due to the sterility of the vector representation. The distance between two vectors, according to standard vector space metrics, is not the same as the extent to which two texts enhance one another's compressibility. One can define a vector space metric corresponding to intercompressibility, and in this way force the vector space approach to be a simulacrum of the emergence-based approach, restricted to pairwise interactions. But this violates the whole spirit of the vector space approach. If one is going to move to a methodology where one adopts a special metric for each application, based on the structure of the original space of entities prior to the projection into multidimensional space, then one may as well abandon the vector space approach altogether.
The next step, regarding textual categorization, would be to go beyond mere statistics and deal with syntax. The repetition-based compression algorithms considered here are standard but simplistic, and superior performance can doubtless be achieved by methods that take into account the syntactic structure of texts. Transformational grammar comes into play here, for one is concerned precisely with transformations that map one sentence or text fragment into another.
We have outlined an entirely new approach to categorization and similarity judgement, based on emergent pattern and intercompressibility rather than on distance in vector spaces. we have shown how this new, emergence-based approach may be used to develop new computational techniques in one particular area, the determination of categories within a collection of texts. Numerous other applications within psychology and computing suggest themselves, and will be pursued in good time. In this final section, we will consider one particular gedankenexperiment, which illustrates the principles of the emergence-based approach quite well. This experiment has not yet been done, but experiments along these lines are planned for the near future.
XXXXXXXXXXXXXXXXXXxx