Inferring Genetic Regulatory Networks from Microarray Data plus Annotations

The New Genetics

Computer Technology Leads the Way to Post-Genomic Biology

Ben Goertzel

June 6, 2001

As the excitement of having mapped the human genome fades into a matter-of-fact acceptance, the genetics community is looking ahead to what’s called post-genomic biology. The next big challenge: figuring out how the genetic code actually does anything. How do these sequences of amino acids decode themselves, making use of other molecules in their environment to create organisms like us, or even simpler one-celled organisms? The completion of the human genome project was one of those ends that was actually a beginning. It put us in a position where we’re able to finally start asking the really interesting questions.

This is a very exciting area of research – and a tremendously difficult one as well. As yet there are no tales of tremendous triumph – only some minor victories, a lot of hard work and furious innovation, and the tremendous promise of infinite victories to come. But the progress made so far has many lessons to teach – for example, regarding the remarkably tight interrelation between computer technology and biological research.

It’s hardly shocking that post-genomic biology is enabled by advanced computer technology every step of the way. After all, most branches of physical science have become thoroughly computerized – little of modern chemistry and physics could exist without computers. But it’s instructive to see just how many roles computers have played in the new genetics. Firstly, it’s only because of recent computer engineering and robotics driven advances in experimental apparatus design that we are able to gather significant amounts of data about how genes build organisms. New “microarray” technologies like DNA chips (built like silicon chips) and spotted microarrays (built with robot arms) allow us to collect information regarding the expression of genes at different times during cell development. But this data is too massive and too messy for the human mind to fully grasp. Sophisticated AI software, used interactively by savvy biologists, is needed to analyze the results.

It’s not hard to see what the trend is here. Biological experiments, conducted using newfangled computer technology, are spinning us an increasingly detailed story of the microbiological world – but it’s a story that only increasingly advanced AI programs will be able to understand in full. Only by working with intelligent software will we be able to comprehend the inner workings of our own physical selves.

Gene therapy, the frontier of modern medicine, relies on the ability to figure out what combinations of genes distinguish healthy cells from diseased cells. This problem is too hard for humans to solve alone, and requires at very least advanced statistical methods, at most full-on computer cognition. The upshot? Rather than fearing AI’s as movies like 2001 have urged us to do, we may soon be thanking AI programs for helping find the cure for cancer.

Biology – the Future of AI?

Artificial intelligence programs have never even come close to equaling humans’ common sense about the everyday world. There are two main reasons for this. First, most AI programs have been written to excel only in one specialized kind of intelligence – like playing chess, or diagnosing diseases -- rather than to display general intelligence. And second, even if one does seek to create an AI program with general intelligence, it still is just a software program without any intuition for the human world. We homo sapiens sapiens have a special feeling for our physical and social environment -- for simple things like the difference between a cup between a bowl, or between happiness and contentment. AI programs, even those that push towards general intelligence, can’t help lacking this intuition.

But the world of molecular biology is not particular intuitive to human beings. In fact it’s complex and forbidding. It has much of the ambiguity of everyday life – there is not as much agreement as one would think about the meanings of various technical terms in genetics and molecular biology. But this ambiguity is not resolved by a simple tacit everyday understanding, only by a very advanced scientific intuition. The number of different patterns of genetic structure and activity boggles even the ablest human mind. In this domain, an artificial intelligence has much more to offer than in the world of everyday human life. Here in the microworld, human intuition is misleading as often as it is valuable. Artificial intuition can be tuned specifically to match the ins and outs of introns and exons, the turns and twists of DNA.

Gene Expression

The new genetics has many aspects, but perhaps the most exciting of them all is the emerging study of gene and protein expression. The terminology here is both evocative and appropriate: Just as with a person, it’s not what a gene does when it’s just sitting there that’s interesting, it’s what a gene does when put in a situation where it can express itself!

At any given moment, most genes are quiet, doing nothing. But some are expressed, some are active. Now, using the new experimental tools, we can tell which. We can see how many genes are expressed at a given moment … and then a little later … and then a little later. In this way we can make a kind of map of genetic dynamics as it evolves. And by analyzing this map, using advanced computer software, a lot of information about how genes go about their business can be understood. Which genes tend to stimulate which other genes. Which ones tend to act in groups. Which ones inhibit which other ones, preventing them from being expressed. And by applying the same analysis tools to proteins instead of tools, one can answer the same questions about proteins, the molecules that genes create and send around to do the actual business of building cells. These kinds of complex interactions between genes, and between genes and proteins, are the key to the decoding of genomes into organisms – which is, after all, what genomes are all about.

All this complexity is implicit in the genetic code itself, but we don’t know how to interpret the code. With microarray technology we can watch the genetic code interpret itself and create a cell, and by analyzing the data collected in this process, we can try to figure out exactly how this process of interpretation unfolds. And the potential rewards are great– the practical applications are tremendous, from drug development to disease diagnosis, genetic engineering and beyond.

It’s a straightforward enough idea, but the practical pitfalls are many. A huge host of tools from mathematics and computer science have been unleashed on the problem, both by researchers at major academic firms, and by companies like Rosetta Inpharmatics (recently acquired by Merck, the major pharmaceutical firm) and Silicon Genetics, a gutsy and clever California start-up. New data analysis techniques come out every couple months, each one with its own strengths and weaknesses.

Massively Parallel Genomics

What is making this new genetics possible is a veritable revolution in the design of experimental apparatus for genetic analysis. And the same tools, with minor variations, also work for proteomic analysis, the study of protein expression. For the first time, with these new devices, biologists are able to study thousands or even millions of different molecules at once, and collect the results in a systematic way.

Chemists have long had methods for carrying out many simultaneous chemical reactions. Most simply, trays can be created with 96 or 384 wells, each containing a different chemical and a unique bar code. The last few years, however, have seen the development of methodologies that push far further in this direction –making possible experiments that scientists only a few years ago would have called impossible. The application of these new methodologies to the analysis of gene and protein data has led to a new area of research that may be called massively parallel genomics and proteomics.

Most of the work done so far has been in genomics; the extension to proteomic analysis is more recent. So I’ll talk about microarrays as used for genomic analysis; the proteomics case is basically the same from the point of view of data analysis, though vastly more difficult from the point of view of experimental apparatus biomechanics. (Many proteins are much more difficult than DNA to induce to stick on the surfaces used in these instruments.)

There are several types of microarrays used in genomics, but they all embody a common methodology. Single stranded DNA/RNA molecules are anchored by one end to some kind of surface (a chip or a plate depending on the type of apparatus). The surface is then placed in a solution, and the molecules affixed to the chip will seek to hybridize with complementary strands (“target molecules”) floating in the solution. (Hybridization refers to the formation of base pairs between complementary regions of two strands of DNA that were not originally paired).

Affymetrix’s technology, pioneered by Dr. Stephan Fodor, involves making DNA chips in a manner similar to the manufacture of semiconductor chips. A process known as “photolithography” is used to create a huge number of molecules, directly on a silicon wafer. A single chip measuring 1.28 cm X 1.28 cm can hold more than 400,000 “probe” molecules. The procedure of gene chip manufacture has been fully automated for a while now, and Affymetrix manufactures 5-10,000 DNA chips per month.

Affymetrix DNA chips have a significant limitation in terms of the size of the molecules that can be affixed to them. So far they’re normally used with DNA/RNA segments of length 25 or less. Also, they are very expensive. It currently costs about $500,000 to fabricate the light masks for a new array design, so their technology is most appropriate when the same chip needs to be used again and again and again. The main example of this kind of use case is disease diagnosis.

On the other hand, spotted microarrays, first developed by Pat Brown at Stanford, are ordinary microscope slides on which robot arms lay down rows of tiny drops from racks of previously prepared DNA/RNA samples. At present this technology can lay down tens of thousands of probe molecules, at least an order of magnitude off from what Affymetrix can do. The advantage of this approach is that any given DNA/RNA probe can be hundreds of bases long, and can, in principle, be made from any DNA/RNA sample.

Note the key role of computer technology in both of these cases. Affymetrix uses a manufacturing technique derived from the computer hardware industry, which depends on thorough and precise computer control. Spotted microarrays depend as well on the inhuman precision of robot arms, controlled by computer software. Massively parallel genomics, like the mapping of the human genome itself, is a thoroughgoing fusion of biology and computer science – only here the emphasis is on computer engineering and hardware, whereas gene mapping relied upon fancy software algorithms.

There are other approaches as well. For instance, Agilent Technologies, a spin-off from HP, is manufacturing array makers using ink-jet printer technology. Their approach is interesting in that it promises to make practical the synthesis of a single instance of a given array design. Lynx Corporation is pursing a somewhat Affymetrix-like approach, but circumventing Affymetrix’s patents by using addressable beads instead of a silicon wafer. And so forth. Over the next few years we will see a lot of radical computer-enabled approaches to massively parallel genomics, and time will tell which are most effective.

So how are these massively parallel molecule arrays used? Let’s suppose that, one way or another, we have a surface with a number of DNA/RNA molecules attached to it. How do we do chemical reactions and measure their results?

First, the target molecules are fluorescently labeled, so that the spots on the chip/array where hybridization occurs can be identified. The strength of the fluorescence emanating from a given region of the surface is a rough indicator of the amount of target substance that bound to the molecule affixed to that region. In practical terms, what happens is that an image file is created, a photograph of some sort of the pattern of fluorescence emanating from the microarray itself. Typically the image file is then “gridded”, i.e. mapped into a pixel array with a pixel corresponding to each probe molecule. Then, there is a bit of black art involved in computing the hybridization level for a spot, involving various normalization functions that seem to have more basis in trial-and-error than in fundamentals.

This data is very noisy, however. To get more reliable results, researchers generally work with a slightly more complex procedure. First, they prepare two related samples, each of which is colored with a different fluorescent substances (usually, one green, one red). They then compare the relative amounts of expressed DNA/RNA in the two samples. The ratio of green/red at a given location is a very meaningful number. Using this ratio is a way of normalizing out various kinds of experiment-specific “noise”, assuming that these noise factors will be roughly constant across the two samples.

But even this ratio data is still highly noise-ridden, for a number of reasons beyond the usual risk of experimental error or manufacturing defects in the experimental apparatus. For one thing, there are many different factors influencing the strength of the bond formed between two single stranded DNA/RNA molecules, such as the length of the bonded molecules, the actual composition of the molecules, and so forth. Errors will occur due to the ability of DNA to bind to sequences that are roughly complementary but not an exact match. This can be controlled to some extent by the application of heat, which breaks bonds between molecules – getting the temperature just right will break false positive bonds and not true positive ones. Other laboratory conditions besides temperature can have similar effects. Another problem is that the “probe molecules” affixed to the surface may fold up and self-hybridize, thus rendering them relatively inaccessible to hybridization with the target.

All these issues mean that a single data point in a large microarray data set cannot be taken all that seriously. The data as a whole is extremely valuable and informative, but there are a lot of things that can go wrong and lead to spurious information. This means that data analysis methods, to be successfully applied to microarray data, have got to be extremely robust with respect to noise. None of the data analysis methods in the standard statistical and mathematical toolkits pass muster, except in very limited ways. Much more sophisticated technology is needed – yes, even artificially intelligent technology, software that can build its own digital intuition as regards the strange ways of the biomolecular world.

The payoff for understanding this data, if you can do it, is huge. These data can be used for sequencing variants of a known genome, or for identifying a specific strain of a virus (e.g. the Affymetrix HIV-1 array, which detects a strain of the virus underlying AIDS). They can be used to measure the differences in gene expression between normal cells and tumor cells, which helps determine which genes may cause/cure cancer, or identify which treatment a specific tumor should respond to best. They can measure differences in gene expression between different tissue types, to determine what makes one cell type different than another. And, most excitingly from a scientific viewpoint, they can be used to identify genes involved in cell development, and to puzzle out the dynamic relationships between these genes during the development process.

Internet Genomics

We’ve seen that the actual experimental apparatuses being used all come in one way or another out of the computer industry. And that the analysis of large, noisy, complex data sets like the ones microarrays produce can only be carried out by sophisticated computer programs running on advanced machines – no human being has the mind to extract subtle patterns from such huge, messy tables of numbers. There is also another crucial dependency on computer technology here: the role of the internet. The biology community has come to use the Net very heavily for data communication – without it, there is no way research could proceed at anywhere near its current furtive pace.

Perhaps you’re a bit of a computer hacker and you want to try out your own algorithms on the data derived from microarray experiments on the yeast genome during cell development. Well, you’re in luck: the raw data from these experiments are available online at http://cmgm.stanford.edu/pbrown/sporulation/additional/spospread.txt. Download it and give it a try! Or check out Rosetta’s site, www.rii.com, and download some sample human genome expression data. Or, perhaps your interests are less erudite, and you’d simply like to view the whole human genome itself? No problem, check out the Genome Browser at http://genome.ucsc.edu/goldenPath/hgTracks.html.

But gene sequence information, and the quantitative data from gene expression experiments, is only the beginning. There’s also a huge amount of non-numerical data available online, indispensable to researchers in the field. When biologists interpret microarray data, they use a great deal of background knowledge about gene function – more and more knowledge is coming out every day, and a huge amount of it is online for public consumption, if you know where to look. Current automated data analysis tools tend to go purely by the numbers, but the next generation of tools is sure to boast the ability to integrate numerical and non-numerical information about genes and gene expression. As preparation for this, biologists in some areas are already working to express their nonquantitative knowledge in unambiguous, easily computer-comprehensible ways.

This exposes the dramatic effect the Net is having on scientific language. Yes, the net is rushing the establishment of English as the world’s second language, but something more profound than that is happening simultaneously. The Net demands universal intercomprehensibility. In biological science, this feature of Internet communications is having an unforeseen effect: it’s forcing scientists working in slightly different areas to confront the ideosyncracies of their communication styles.

Compared to ordinary language, scientific language is fairly unambiguous. But it’s far from completely so. An outsider would understandably assume that a phrase like “cell development” has a completely precise and inarguable meaning – but biology is not mathematics, and when you get right down to it, some tribes of researchers use the term to overlap with “cell maintenance” more than others do. Where is the borderline between development and maintenance of a cell? This issue and hundreds of others like it have come up with fresh vigor now that the Internet is shoving every branch of biological research into the faces of researchers in every other branch. As a result of the Internetization of biological data, a strong effort is underway to standardize the expression of non-numerical genetic data.

One part of this is the Gene Ontology Project, described in detail at http://www.geneontology.org/ . In the creation of this project, one thorny issue after another came up – a seemingly endless series of linguistic ambiguities regarding what would at first appear to be very rigid and solid scientific concepts. What is the relation of “development” versus “maintenance”, what does “differentiation” really mean, what is the relation of “cell organization” and “biogenesis”, etc. The outcome of this quibbling over language? A much more precise vocabulary, a universal dictionary of molecular biology. Ambiguity can’t be removed from the language used to describe cells and molecules, but it can be drastically reduced through this sort of systematic effort. And the result is that genes from different species can be compared using a common unambiguous vocabulary. The fly, yeast, worm, mouse and mustard genomes have all been described to a significant extent in standardized Gene Ontology language, and the human genome can’t be far behind. Soon enough, every gene of every common organism will be described in a “Gene Summary Paragraph”, describing qualitative knowledge about what the gene does in carefully controlled language -- language ideally suited for digestion by AI programs.

The standardization of vocabulary for describing qualitative aspects of genes and proteins is a critical part of the computerization of biological analysis. Now AI programs don’t have to have a sensitive understanding of human language to integrate qualitative information about gene function into their analyses of gene sequences and quantitative gene expression data. It’s only a matter of years – perhaps even months, in some cutting-edge research lab -- before the loop is closed between AI analysis of genomic data and the automated execution of biological experiments. Now, humans do experiments, use sophisticated algorithms to analyze the results, and then do new experiments based on the results the algorithms suggest. But before too long, the human will become redundant in many cases. Most of the experiments are predominantly computer-controlled already. The software will analyze the results of one experiment, then order another experiment up. After a few weeks of trial and error, it will present us humans with results about our own genetic makeup. Or, post the results directly to the Web, where other AI’s can read them, perhaps faster than humans can.

Curing Disease the Post-Genomic Way

All this abstract, complicated technology conspires to provide practical solutions to some very real problems. Genetic engineering is one of the big potential uses. Understanding how genes work to build organisms, we’ll be able to build new kinds of organisms. Frankenfoods, and eventually new kinds of dogs, cats and people, thus raising all kinds of serious ethical concerns.

But there are also applications that are ethically just about unquestionable. From an economic point of view, the main value of microarrays and related technologies right now is as part of that vast scientific-financial machine called the drug discovery process. The path from scientific research to the governmental approval of a new drug is a long, long, long one, but when it’s successfully traversed, the financial rewards can be immense.

Gene therapy is a new approach to curing diseases, and one that hasn’t yet proved its practical worth in a significant way. Although it hasn’t lived up to the over-impatient promises that were made for it 10 years ago, biologists remain widely optimistic about its long-term potential – not only for curing "classic" hereditary diseases, but also widespread diseases such as cancer and cardio-vascular diseases. The concept is scientifically unimpeachable. Many diseases are caused by problems in an individual’s DNA. Transplanting better pieces of DNA into the cells of a living person should be able to solve a lot of problems. A great deal of research has been done regarding various methods to implant genes with the desired characteristics into body cells. Usually the injected gene is introduced within the cell wall, but resides outside the nucleus, perhaps enmeshed in the endoplasmic reticulum. Fascinatingly, the result of this is that the gene is still expressed when the appropriate input protein signal is received through the receptors in the cell wall even though the gene is not physically there in the nucleus with the rest of the DNA.

Aside from the practical issues of how to get the DNA in there in various circumstances, though, there’s also the major issue of figuring out what DNA is responsible for various diseases, and what to replace it with. To understand this, in the case of complex diseases, requires understanding how DNA is decoded to cause cells of various types to form. And this is an understanding that has been very, very hard to come by. The presence of gene and protein expression data from microarray experiments, and sophisticated bioinformatics software, renders it potentially achievable, though still by no means trivial. More precise microarrays and more intelligent data analysis software may render the problem downright straightforward in 5 or 10 or 20 years from now. No one knows for sure.

One thing biologists do, in trying to discover gene therapies, is to compare the genetic material of healthy and disease-affected individuals. A key concept here is the “genetic marker” – a gene or short sequence of DNA that acts as a tag for another, closely linked, gene. Such markers are used in mapping the order of genes along chromosomes and in following the inheritance of particular genes: genes closely linked to the marker will generally be inherited with it. Markers have to be readily identifiable in the organism the DNA builds, not just in the DNA – some classic marker genes are ones that control phenomena like eye color.

Biologists try to find the marker genes most closely linked to the disease, the ones that occur in the affected individuals but not in the healthy ones. They narrow the markers’ locations down step by step. First they find the troublesome chromosome, then they narrow the search and try to find the particular troublesome gene within that chromosome….

It used to be that genetic markers were very hard to find, but now that the human genome is mapped and there are technologies like microarrays, things have become a good bit simpler. Some markers are now standard -- and Affymetrix sells something called the HuSNP Mapping Array, a DNA microarray with probes for many common markers across the human genome already etched on its surface, ready for immediate use. If you have samples of diseased tissue, you can use this microarray to find whether any of a large number of common markers tend to coincide with it. In the past this would have required thousands or millions of experiments, and in many cases it would have been impossible. Now it’s easy, because we can test in parallel whether any of a huge number of gene sequences is a marker for a given disease-related gene. Right now, scientists are using this approach to try to get to the bottom of various types of cancer, and many other diseases as well.

If a disease is caused by one gene in particular, then the problem is relatively simple. One has to analyze a number of tissue samples from affected and healthy people, and eventually one’s computer algorithms will find the one gene that distinguishes the two populations. But not all diseases are tied to one gene in particular – and this is where things get interesting. Many diseases are spread across a number of different genes, and have to do with the way the genes interact with each other. A disease may be caused by a set of genes, or, worse yet, by a pattern of gene interaction, which can come out of a variety of different sets of genes. Here microarrays come in once again, big time. If a disease is caused by a certain pattern of interaction, microarray analysis of cell development can allow scientists to find that pattern of interaction. Then they can trace back and find all the different combinations of genes that give rise to that pattern of interaction. This is a full-on AI application, and it pushes the boundaries of what’s possible with the current, very noisy microarray data. But there’s no doubt that it’s the future.

Onward!

Gene therapy itself is still in its infancy, and so is microarray technology, and so is AI-driven bioinformatics. But all these areas are growing fast – fast like the Internet grew during the 1990’s. Very fast. Exactly where it’s all going to lead, who knows. But it’s a pretty sure bet that the intersection of medicine, genetics, proteomics, computer engineering, AI software, and robotics is going to yield some fascinating things. We’re beginning to see the rough outlines of early 21’th century science.