Things seem to have gotten rather crazy in the world of genetics. Even the concept of the gene is being questioned. The gene –– what could be more solid than that? But think about it: the definition of a gene has changed in a number of ways since it was introduced by Wilhem Johannsen in 1919 as an abbreviation for the term ‘‘pangen,’’ which had been coined by Hugo de Vries 20 years earlier to describe the unit that controls the production of a single hereditary trait. After all, neither man knew that DNA was the material substance of such units. And it wasn't until the structure of DNA was discovered and explored in the mid-20th century that the idea of a gene as a sequence of nucleotides developed, along with the concept that the sequence was transcribed into mRNA and then translated into a protein sequence. Also, keep in mind that the idea of a gene was further elaborated with the discovery that only portions of the mRNA (exons) were actually translated.
So genes have had quite a checkered career already, and it shouldn't come as a big surprise that the gene concept is being renovated yet again. We humans need to find names for things even when we know very little about them, so the idea of hereditary elements was given a designation based on little real knowledge of what genes actually were. And our knowledge is still partial. The reason the gene's definition has been called into question lately is because of the rush of genomic data appearing in the last few years. Since we are again tempted to think we have the story straight, we think it's time for a definitional overhaul. The first inkling that this would be needed was when the human genome was published and the gene number was estimated to be around 25,000, meaning that most of the genome didn't seem to be coding for protein. If most DNA wasn't made up of genes, what was it doing, if anything? This question hasn't come close to being answered yet. What working out the human genome, and that of dozens of other species, has indicated is that knowing the sequence of nucleotides is not enough to figure out how genes are really behaving.
It's also becoming clear that we need information on the genomes of many species in order to understand any one of them. Just as humans in general understand something new by finding comparisons with what they already know (Lakoff & Johnson, 1980), biologists are learning about the human genome by comparing it to the genomes of the mouse, chimp, and even yeast. In addition, they are looking at many human genomes to discover the significance of individual differences. As another source of information, they are studying the human genome in finer detail through a project called ENCODE, which stands for ‘‘Encyclopedia of DNA Elements’’ (Pennisi, 2007). A pilot ENCODE program investigated 1%% of the genome for four years and published the results in 2007. The project's approach involved mapping not only genes, but regulatory DNA and other features such as gene start sites. If genomic data are to become useful, this is just the kind of work that's essential. Researchers identified additional exons or coding regions for 80%% of the approximately 400 genes in the study region. Many of these exons were found thousands of bases away from the rest of the gene's sequence, and some were even found within other genes.
Some exons actually belong to two genes. This issue of double –– duty –– and more –– comes up in Carl Zimmer's (2008) description of why it is now so difficult to define a gene. He cites the issue of alternative splicing, meaning that the same transcribed mRNA can be spliced, that is, have introns or areas that do not code for amino acids removed, in two different ways. Essentially, this results in the same gene sequence coding for more than one protein. Of course, it's been known for a long time that many proteins are coded for by more than one gene, hemoglobin being a favorite example: the genes for its two different polypeptide chains are actually on different chromosomes. But now it appears that the opposite is also true, that one gene can code for two –– or more –– proteins. This could help to explain why the apparent gene number for the human genome is so low –– researchers were counting coding regions rather than what was encoded. Now the ENCODE group, composed of over 300 researchers, estimates that the average protein-coding region produces at least five different transcripts.
The classical simplicity of the gene and of gene transcription and translation as described by the molecular biology pioneers of the 1960s has given way to a baroque complexity. Other seemingly simple and important principles are also getting shakier thanks to ENCODE. All biologists learn the ‘‘rule’’ that highly conserved gene sequences, those that change little over broad taxonomic categories, are important regions. The reasoning behind this idea is that even slight changes in the sequence would lead to some severe and detrimental change in a protein and, thus, its functioning. The opposite also appeared to be true: regions that were not conserved, that varied greatly across species, couldn't be very important because any old sequence seemed to be alright.
Now ENCODE has changed biologists' views on both these ideas (Check, 2007). One of the project's approaches is to compare long stretches of noncoding sequences in humans and other mammalian species, including mice and monkeys. There are many functional sequences, defined as causing abnormalities if they were deleted, that are not conserved across species. On the other hand, there are a number of highly conserved or constrained stretches that don't seem to do much of anything. This is all very puzzling, and as with every puzzle, humans try to make sense of the data, to come up with explanations for things they hadn't even considered possible before (Monroe, 2009). Of course, these new ideas are conjectural, but they at least provide fodder for further research. For example, the unconstrained sequences may be evolving neutrally, meaning that they change in ways that are neither good nor bad in terms of survival. Other researchers think these areas may be under weak rather than nonexistent genetic control. This difference of opinion is based in part on different views about how much of the genome is actually useful. Those estimating on the high end place the number at 20%%. But any estimate makes me uneasy. The word ‘‘hubris’’ springs to mind. How can we even attempt a guess when we know so little? To me this is like geneticists talking about ‘‘junk’’ DNA back in the 1970s when our knowledge was much cruder. How can we even attempt to say what DNA is important when we understand so little about what it does?
The controversies I've just mentioned are at the sequence level, but things get even dicier when we delve into DNA's three-dimensional structure. I'm not talking about the double helix, but rather about how that helix is twisted and folded as it is being stored and used in a cell. Exploring these issues requires a combination of structural and functional research at the very limits of imaging and cell biology techniques. If the DNA sequence is getting to seem baroque, its structure is even more so, and not just in a metaphorical sense. Baroque architecture is known for its whirling curves and layers upon layers of ornament. One type of marble was not enough if they could find 10, and why not have columns and cornices and jutting projections, with a bunch of statues as well, in addition to paintings on the walls, of course.
The same seems true of DNA. First there is the helix; then the helix is twined around nucleosomes, structures made of eight histone proteins involved in controlling whether or not the DNA is transcribed. But now, thanks to ENCODE, control sequences long distances from the gene's coding regions have been identified; in some cases, they are even on different chromosomes. This means that the DNA had to fold in ways that allow these areas to come near each other. Also, it's becoming evident that sequence differences can lead to subtle shape differences and that transcription factor proteins can detect these subtle differences, allowing them to bind in the correct places. I love these ideas, and thinking about them in terms of baroque architecture makes me like that style more than I have before –– proving once again that a metaphor changes one's perceptions of both its subjects.
Aesthetically, all this elaboration may be pleasing and exciting, but it does pose problems for those trying to keep track of genomic information. Michael Seringhaus and Mark Gerstein (2008) explore some of this complexity in an article on how ‘‘Genomics confounds gene classification.’’ Gerstein codirects the Computational Biology and Bioinformatics Program at Yale University, and Seringhaus was one of his Ph.D. students. Having completed his degree, Seringhaus is now a Yale Law School student. My first thought was that this was quite a switch; my second was that it's probably a smart career move in terms of patent law; but my third was that it would take a law student to write this article. It's about why gene names are now so hard to pin down, why genes are so hard to define. We know now that one sequence can code for more than one gene; that a gene's regulators may be at a wide remove from it; that different protein domains may be coded for by sequences on different chromosomes. A gene has become a slippery thing.
How to deal with this mess? Form a committee! The Gene Nomenclature Committee of the Human Genome Organization is tackling the issue. The authors also suggest a four-step process for classifying a gene, beginning with identification and naming, then a brief description based on functions and sequence studies. Third, decide on standardized key words to categorize it, and lastly, arrange these categories into a hierarchy. The last step is related to the fact that a gene and its protein product can no longer be classified under one heading –– there are just too many interrelationships. This whole scheme looks messy, but it isn't nearly as bad as the current hodgepodge of names that Seringhaus and Goldstein describe. As more genomes are examined in finer detail, these problems will escalate, making the need for nomenclature standards that much more crucial.
Still other issues arise in considering epigenetics –– how the DNA is used in a cell, how genes are expressed. Now geneticists are talking about not only the genome, but the epigenome, and they are looking at how not one gene but many genes are expressed and controlled. This investigation has begun to change what biologists mean by ‘‘heredity.’’ This used to be defined as the genes that are passed on from one generation to the next, but now it's becoming obvious that parents pass on more than DNA to their offspring. They pass on chromosomes that are encrusted with thousands of proteins, and the specifics of that adornment make a difference in the offspring's phenotype. The science historian Evelyn Fox Keller notes that DNA is ‘‘a far richer and more interesting molecule than we could have imagined when we first started studying it,’’ but she adds that ‘‘it doesn't do anything by itself’’ (Angier, 2008). DNA folding is directed by the proteins that attach to it, but their attachment is in part related to the DNA sequence, so there is an intricate interplay between the two. The third player here is, of course, RNA. Protein-DNA structures can make it more or less likely that the RNA polymerase complex will bond to a particular DNA region, thus allowing for transcription.
I definitely can't go into all the details here, so I'll just mention a couple of points to give some sense of the richness of the interplay of nucleic acids and proteins (Zimmer, 2008). The nucleosomes, structures of DNA wrapped around histones, can be rearranged by complexes called chromatin-remodeling proteins, which can broaden the space between nucleosomes. This can open up a stretch of DNA that is then more likely to be transcribed. In the opposite direction, nucleosomes can cluster to form large aggregates of ‘‘closed’’ chromatin or heterochromatin, where transcription is prevented. Proteins can modify or ‘‘mark’’ histones, which then interact differently with still other proteins. There is a complex orchestration here, with a particular combination of marks and proteins changing how likely it is that DNA will be transcribed. This seems very complicated, and it is. I wouldn't go into this with my non-science majors, but I can at least tell them that such complexities exist –– and are necessary. After all, it is through this orchestration that gene expression in development and in cells in different tissues are controlled. There's also evidence that epigenetic marks can make cells more likely to become cancerous. This makes sense, because the marks could turn off essential control genes and also turn on genes that are normally turned off. Many cancer cells make proteins that are usually only made in the embryo.
It is not surprising that things can go wrong with a system this intricate and finely tuned. It's like an expensive sports car –– great to drive when it's working well, but prone to be fussy and in constant need of adjustment. However, Stephen Baylin and Kornel Schuebel (2007) come up with a different analogy to explain what is going on epigenetically. They describe the work I've just mentioned as genome topography and predict that in the future ‘‘a complete view of the genome will require a further era of investigation, in which we generate maps of the genomic topography that characterize each of the many cell types of which we are constituted. Who knows? We are just at the beginning of exploring how a single genome can spawn multiple epigenomes’’ (p. 549).
Another aspect of epigenetics is RNA interference, one of the hot items in nucleic acid research in the past 10 years. This involves short sequences of double-stranded RNA (iRNA) that can block translation of mRNA into protein, and the first papers describing this process came out in 1998 (Fire et al., 1998). One reason progress has been so swift is that researchers quickly saw the medical implications of this process. For example, if too much of a particular protein, particularly a mutated one, is being produced in the body, iRNA could reduce its synthesis and thus its ill effects. Experimental drug trials have already shown that this approach works in reducing production of a protein related to high cholesterol levels (Pollack, 2008).
Hundreds of different micro-RNA sequences, 20––25 bases long, have been discovered that control translation of particular genes and, in some cases, many genes. Researchers estimate that more than half the genes in the human genome are affected by such RNA and that iRNA helps to fine-tune protein production. Some compare protein transcription factors to on-and-off switches that control transcription of DNA, whereas iRNA is like a dimmer switch. Almost every week in Science and Nature there are research articles reporting on the latest small RNA sequences and their amazing powers. Such a rapid research flow suggests that there's still a great deal to learn. As with information from ENCODE, I get the feeling that it's a very bad idea to state too categorically what is or is not going on here. I don't think we have a real clue yet; we are using a thimble to study the waters of a vast ocean.
ENCODE scientists estimate that while only about 1.2––1.5%% of the human genome codes for protein, 93%% of the genome produces RNA transcripts. Researchers are guessing as to how much of the more than 90%% of the RNA that doesn't code for something has a function. Okay, some of it may be junk, left over from genes that have mutated to become pseudogenes and other such forms of debris, but here's a vast ocean that we have only tested with a toe. At least some of this RNA is regulatory, and the importance of gene regulation has come to be appreciated even more as more genomes are sequenced. For example, a recent search of the human gene catalogue yielded only 168 genes that do not have a close homologue in the mouse or the dog –– that's out of over 20,000 genes. This is pretty scary! My sister thinks that my dog and I are very alike, and now I know why –– she's been reading the genetics literature again. Equally impressive is the fact that the human genes we share with mice produce proteins with over an 80% sequence identity. These figures make it very apparent that the noncoding regions must be important, at least from the viewpoint of humans who like to think themselves a cut above dogs, let alone mice.
Regulatory sequences determine when and where proteins are produced. Evidence for this has been mounting for years, but as with so many ideas in biology, it's difficult to find cases where other variables aren't clouding the picture. Recently, researchers used a mouse model for Down syndrome to clarify things. These mice have a copy of human chromosome 21 along with a full complement of mouse chromosomes. The question the research team asked was: ‘‘Is regulation of the genes on human chromosome 21 in these mouse cells (Tc1 hepatocytes) determined by the human DNA sequence, or by the mouse cellular environment and transcriptional machinery?’’ (Coller & Kruglyak, 2008: p. 380). The expression pattern for human chromosome 21 in the mice turned out to be almost exactly what it is in humans, whereas the genes on the other chromosomes expressed mouse proteins as they normally would. It seems that regulatory sequences are at the heart of gene regulation, but that doesn't mean that they are easy to locate.
At this point, it is relatively simple to find genes for proteins by studying the gene products. This isn't the case with regulatory sequences. One approach being used is to scan genomes for conserved regulatory sequences. This is how researchers have recently found an interesting control sequence involved in limb development (Wray & Babbitt, 2008). The noncoding control region for this gene in humans is called HACNS1 or human-accelerated conserved noncoding sequence 1. Its function is to control early development in the forelimbs and later hindlimb development. This is a different pattern from that in macaques and chimps, where the similar sequence is expressed only during hindlimb development. Researchers were able to narrow down the relevant mutations to an 81-base-pair sequence with 13 substitutions that arose during human evolution. This is a beautiful example of species-specific regulation. Yes, it is just one example, but it sets the stage for other such investigations.
I have to start to wind down here and can't get into more of the regulatory picture –– such as it is at the moment. But before closing, I want to get back to those 168 genes we don't share with mice. Though the article that mentioned them didn't get into what they code for or how they arose, another little piece I came across gives an intriguing possibility: a few of them might simply be ‘‘brand new.’’ Manyuan Long (2007) describes such a gene found in Drosophila melanogaster for which there is no homologous sequence in a species that diverged only 13 million years ago. That's too short a time for mutations to obscure related sequences, so they just aren't there. He cites another research group that found 16 such genes in flies. They speculate that these genes arose from noncoding regions –– here are those noncoding regions again. Besides everything else they do, they may also be serving as a source of genetic novelty.
What I hope I have presented here is a picture of how dynamic and relational the genome is. This is hardly a neat picture, and I doubt that it ever will be. This is a very different depiction from the old one where molecules behaved ‘‘properly.’’ DNA controlled the show, RNA helped out, and proteins did the dirty work. Now there are enzymes telling DNA what to do, RNAs that act like enzymes and others that act like DNA. We have left the classical period of genetics, when order reigned, and entered a baroque world of complex control loops and chromosomes in knots. I seem to have used more metaphors in this article than I usually do, and perhaps that's because this territory is so new that to get my mind around it I have to associate it with things I know a little better.
When I was planning to write an article on genetics, I took out four folders of clippings that I thought would be relevant –– DNA, RNA, proteins, and genomes –– thinking that I would concentrate on one of them. I now realize that this approach was based on a very dated view of genetics –– the old ‘‘central dogma’’ of DNA to RNA to protein (Judson, 1979). I ended up, instead, dipping into all these folders, and getting myself twisted into a very baroque knot, but a fascinating one. Biology is now well beyond dogma to a richly beautiful place where researchers are beginning, just beginning, to work out how all these elements fit together more intricately. Now we have to be patient and stop rushing to judgments based on minuscule amounts of information –– just sit back and watch the forms and functions unfold before us.
She earned a B.S. in biology from Marymount Manhattan College; an M.S., also in biology, from Boston College; and a Ph.D. in science education from New York University. Her major interests are in communicating science to the nonscientist and in the relationship between biology and art.