Mouse Genetics: Concepts & Applications (Full Table of Contents)

Copyright ©1995 Lee M. Silver

10. Non-Breeding Mapping Strategies

10.1 Linkage maps without breeding

10.1.1 Single sperm cell typing

10.1.2 Mitotic linkage maps

10.2 Chromosomal mapping tools

10.2.1 Conservation of synteny

10.2.2 In situ hybridization

10.2.3 Somatic cell hybrid genetics

10.3 Physical maps and positional cloning

10.3.1 Prerequisites to positional cloning

10.3.2 PFGE and long range genomic restriction maps

10.3.3 Large insert genomic libraries

10.3.4 Protocols for gene identification

10.4 The human genome project and the ultimate map

10.1 Linkage maps without breeding

10.1.1 Single sperm cell typing

In 1988, Arnheim and his colleagues (Li et al., 1988) reported the extraordinary finding that unique DNA sequences could be amplified from isolated sperm cells. Amplification from the single DNA target locus that is present in haploid sperm cells represents the theoretical limit in sensitivity that is possible with PCR. The implications of this experimental breakthrough were enormous, especially in the field of human genetics. For the first time, it became possible to consider the analysis of unlimited numbers of individual meiotic events derived from a single individual.

Linkage information could be derived by first identifying a male volunteer who was heterozygous at all loci of interest in a manner that could be distinguished with the use of PCR. The DNA within individual sperm cells donated by the volunteer would be subjected to co-amplification with primer pairs that define each of the loci. Allele determination at each locus in each cell could, in theory, be accomplished by any of the various PCR techniques described in section 8.3, including hybridization to allele specific oligonucleotides, restriction enzyme digestion, or SSCP. In contrast to typical genetic studies in humans, the typing of large numbers of single sperm cells from a single volunteer provides genetic information which is simple to interpret. Linkage distances and gene order can be derived directly by counting the number of cells with each allelic combination (Goradia et al., 1991), in a manner equivalent to that described in section 9.4 for the analysis of backcross data in the form of haplotypes. Since the number of sperm cells is essentially unlimited, the resolution of the map obtained is a function simply of the time and effort that the investigator wishes to expend in the typing of additional cells. Furthermore, with the use of a protocol for universal PCR amplification, Arnheim and his colleagues have been able to co-amplify the majority of sequences present in each sperm genome (Zhang et al., 1992). Thus, in theory, it should be possible to co-type many different loci within a single panel of sperm samples.

Although single sperm typing provides a significant new tool for mapping in humans and other large animals that also do not provide a sufficient number of offspring for typing, its importance to mouse genetics would appear to be more limited since numerous, well-defined backcross mapping panels are available for analysis. Furthermore, at the time of this writing, the typing of single cells is still so technically-demanding that it has not been used in a general way even by the broader community of human geneticists.

10.1.2 Mitotic linkage maps

Classical linkage analysis is based on meiotic recombination events that occur in sperm and egg precursor cells. The products of these events are observed and counted in the offspring of heterozygous animals (or in sperm cells as described in the previous section). Meiotic recombination is a very frequent event — it occurs 30 times, on average, in each individual germ cell line — and it appears to have been selected by evolution to play two very different roles in the biology of higher eukaryotes. First, the physical event itself appears to be required to tether the homologs of each chromosome to each other so that they line up and segregate into opposite daughter cells during the first meiotic division. Second, the production of offspring with non-parental combinations of alleles provides a non-mutational means for the generation of diversity at the genotypic level, and this generation of diversity appears to be generally beneficial to the population as a whole.

Recombination has also been observed at the mitotic level in somatic cells (Rajan et al., 1983; Henson et al., 1991). In comparison to meiotic recombination, mitotic events are exceedingly rare, and they do not appear to have any biological function. It seems most likely that mitotic recombination events are simply accidents that happen in response to spontaneous nicks in the DNA molecule that allow migrating single strands to invade opposite homologs. Usually, mitotically-recombined cells will go entirely unnoticed among the millions of nearby cells having germline haplotypes. However, in individuals that are born heterozygous for null alleles at "tumor suppresser genes" such as retinoblastoma (RB), mitotic recombination can lead to the production of rare homozygous mutant cells that are released from growth control in the absence of the wild-type allele; uncontrolled division of these cells leads to tumor formation. It now appears that a large class of human tumors are caused by this homozygosing at a variety of tumor suppresser genes (Marshall, 1991; Weinberg, 1991).

If rare cells that undergo mitotic recombination could be identified and recovered in clonal form from a tissue culture line, a means for generating a linkage map that was not dependent on the breeding of animals would be possible. Such a ‘mitotic linkage map’ has been obtained for the proximal half of chromosome 17 (Henson et al., 1991). The generation of this particular map was dependent upon the ability to immuno-select cells that had undergone allele-loss at an H-2 complex gene. Selected cells were isolated and expanded into cultures that could be analyzed for various DNA markers that were heterozygous in the parent. All loci that map proximal to the clone-specific breakpoint will remain heterozygous; all that map distal will have become homozygous. Through the analysis of a large number of individual clonal lines, it becomes possible to construct a linkage map with gene order and an estimate of relative distances between loci. The mitotic linkage map constructed for the proximal half of chromosome 17 corresponds well with the meiotic linkage map (Henson et al., 1991). As expected, the gene order is identical, and there is some minor variation in the relative intergenic distances.

The construction of a mitotic linkage map in one chromosomal region was important for providing biological information concerning the relationship between the distribution of mitotic and meiotic crossover sites, and future experiments of this type may also provide clues to the mechanisms responsible for homologous recombination in somatic cells. However, the mitotic map did not provide any new information specific to chromosome 17, and this approach is not likely to play an important role in future gene mapping experiments for two reasons. First, with current technology, mitotic maps can only be constructed along chromosomal regions that are marked with genes encoding polymorphic cell surface antigens that are expressed co-dominantly in cultured cells, and are readily distinguishable from each other with specific monoclonal antibodies. The second reason is that the construction of traditional meiotic linkage maps at the same resolution is likely to be much faster and easier.

10.2 Chromosomal mapping tools

10.2.1 Conservation of synteny

As cloning and mapping of both the mouse and human genomes began in earnest during the 1980s, two important evolutionary facts became clear. First, nearly all human genes have homologs in the mouse and vice versa. Second, not only are the genes themselves conserved, but so is their order — to a certain extent — along the chromosome. In 1984, Nadeau and Taylor used linkage data obtained from 83 loci that had been mapped in both species to estimate the average length of conserved autosomal segments as 8.2 cM in the mouse (Nadeau, 1984). In 1993, the same analysis was performed on linkage data obtained from 917 homologous loci mapped in both species to yield an average conserved chromosomal length of 8.8 cM which is not significantly different from the earlier estimate. The major evolutionary implication of this result is that approximately 150 major rearrangements have occurred along the human or mouse lines as they diverged from a common ancestor that existed 65 million years ago.

The practical implication of conserved chromosomal segments is that the mapping of a gene in one species can provide a clue to the location of its homolog in the other species. One should be cautious, however, in not over-interpreting synteny information. There are many examples of smaller genomic segments that have popped-out or into larger syntenic regions. Thus, even if a human gene maps between two human loci with demonstrated synteny in the mouse as well, there is still a small chance that it will have moved to another location in the mouse genome. Nevertheless, over 80% of the autosomal genomes of mice and humans have now been matched-up at the subchromosomal level (Copeland et al., 1993). Thus, with map information for a gene in humans, it will often be possible to identify a corresponding mouse chromosomal segment of ~10 cM in length as a likely location to test first for linkage with nearby DNA markers.

10.2.2 In situ hybridization

10.2.2.1 Overview

The technique of in situ hybridization was conceived of by Gall and Pardue (1969) and John and his colleagues (1969). These workers demonstrated that the DNA within preparations of chromosomes attached to microscope slides could be denatured in a gentle manner so as not to disrupt the overall morphology of the chromosomes themselves. Target sequences within these chromosomes are then available for hybridization to labeled nucleic acid probes. Thus, in situ hybridization allows the mapping of cloned DNA sequences to specific chromosomal sites that can be visualized directly by light microscopy.

In early work, probes were labeled with radioactive isotopes and target sequences were identified by autoradiography. This method of labeling and detection limited both the sensitivity of the technique and its resolution (Lawrence, 1990). In particular, the original protocol only allowed the detection of tandemly repeated sequences such as the ribosomal genes and satellite DNA. By 1981, however, investigators had optimized the in situ protocol for use in mapping single copy mammalian sequences (Harper et al., ), and in 1984, an improved method was developed for better resolution of chromosome banding patterns (Cannizzarro and Emanuel, ). Nevertheless, the technique was still not ideal because with single copy radioactive probes, localization could not be determined within the chromosomes of a single cell; instead, it was necessary to perform a statistical analysis of silver grain distributions in 50 to 100 sets of metaphase chromosomes.

Two critical changes in the protocol now allow the detection of single copy sequences and their high resolution mapping through the direct observation of single chromosomes. The first change was in the nature of the label; with the substitution of fluorescent tags for radioactive ones, the physical resolution of the hybridization site was dramatically improved. The modified in situ protocol that utilizes fluorescent tags is referred to as FISH (for Fluorescent In situ Hybridization). The second change was in the nature of the hybridization cocktail. With the inclusion of a large excess of unlabeled total genomic DNA, it is possible to block dispersed repetitive sequences — present in essentially every genomic region larger than a few kilobases in length — from hybridization to their targets throughout the genome. This allows the use of whole phage or cosmid clones as probes leading to a substantial increase in signal strength which will be proportional to the length of single copy DNA in the clone. With these major changes in the protocol and other optimizations, it is now possible to use in situ hybridization to visualize the map position of any cloned locus within single chromosomes from any mammalian species (Lawrence, 1990; Trask, 1991).

Although in situ hybridization has played a pivotal role in the construction of the human gene map, its role in mouse gene mapping has been more limited for several reasons. First, a certain amount of specialized training and experience is required to perform this protocol, and thus, it is often not an option for independent investigators in the absence of a collaboration. Second, in humans, classical linkage analysis is not easily performed, and thus, alternative methods for human mapping are much more important. Third, whereas the human karyotype is highly amenable to direct cytogenetic analysis — chromosomes come in a variety of shapes and sizes and staining techniques yield excellent banding resolution — the mouse karyotype is a cytologist’s nightmare. All twenty chromosomes have the same shape with only a single visible arm and a centromere that appears to lie at one end (see figure 5.1). A continuum of chromosome lengths makes the identification of individual chromosomes more difficult, and finally, banding patterns are much less distinct and more difficult to resolve.

In the past, in situ hybridization had the advantage that it did not require the existence of variants between parental strains for mapping to be accomplished. However, with the advent of new methods for the detection of polymorphism discussed in chapter 8, it has become possible to quickly identify DNA variants at essentially all cloned loci. Consequently, in situ hybridization is now used most often only for specific experimental problems such as those that described below.

10.2.2.2 Experimental usage

The power of in situ hybridization lies in the fact that it allows the direct localization of DNA sequences relative to all visible cytological landmarks such as centromeres, telomeres, and rearrangement breakpoints in aberrant chromosomes. In some instances, it will be important to localize a DNA marker relative to one or more of these landmarks. For example, an investigator may have a DNA marker that maps to the beginning or end of a linkage map associated with a particular chromosome. In situ hybridization can be used to determine how close to the centromere or telomere the DNA marker actually is (in physical terms); this information can serve to establish the size of the chromosomal region that is not contained within the associated linkage map. In another example, an investigator may have a DNA clone, from either a wild-type or mutant animal, that is believed to extend across a cytologically visible inversion or translocation breakpoint. If the clone was derived from a wild-type genome, the in situ results would show hybridization to two sites in the rearranged karyotype.

As discussed in section 7.3.2, in situ hybridization is also useful in the special case of mapping transgene insertion sites. The same DNA construct that was originally injected into the embryo can often be used directly as a probe. In another instance, in situ hybridization can be combined with classical linkage analysis using the M. spretus backcross system to follow the segregation of centromeres from one parental chromosome or the other as described in section 9.1.2 (Matsuda and Chapman, 1991; Matsuda et al., 1993). Finally, in situ hybridization is useful in experiments aimed at questions that go beyond the simple mapping of genes. For example, the technology has revealed the unexpected finding that both LINE and SINE sequences are non-randomly distributed among bands and interband regions of all chromosomes as described in section 5.4.4.

10.2.3 Somatic cell hybrid genetics

10.2.3.1 Overview of the classical approach

The ability to derive long-term cultures of mammalian cells was perfected during the 1950s. Cell cultures provided important experimental material for early biochemists and molecular biologists interested in molecules and processes that occur within mammalian cells, but they were of little use to geneticists since somatic cell genomes remain essentially unchanged during continual renewal through mitotic division. This situation changed dramatically during the early 1960s when investigators discovered and developed methods for the induction of cell fusion in culture (Ephrussi and Weiss, 1969).

Normal diploid cells from all species of mammals carry approximately the same amount of DNA in their nucleus (twice the haploid amount of 3,000 mb). Thus, after fusion between any two mammalian cells, the hybrid cell nucleus becomes, in effect, tetraploid, with a genome that is twice the normal size. The enlarged genomes of hybrid cells are inherently unstable. Presumably, the increased requirement for DNA replication acts to slow down the rate of cell division, and as a consequence, cells that lose chromosomes during mitotic segregation will divide more quickly and outgrow those cells that maintain a larger genome content. Eventually, after many events of this type, cells can reach a relatively stable genome size that is close to that normally found in diploid mammalian cells. For reasons that are not understood, hybrids formed between particular combinations of species will preferentially eliminate chromosomes from just one of the parental lines. In hybrids formed between mouse cells and either hamster or human cells, mouse chromosomes will be eliminated in a relatively random manner. This process has allowed the derivation and characterization of a number of somatic cell hybrid lines that stably maintain only one or a few mouse chromosomes.

The field of somatic cell genetics had its heyday in the 1970s and early 1980s when it provided the predominant methodology for mapping loci — albeit, often to the resolution of whole chromosomes. The major tools for gene detection in this era (before the recombinant DNA revolution was in full gear) were species-specific assays for various housekeeping enzymes. Somatic cell geneticists could type each member of a panel of hybrid cells for the presence of a particular enzyme and then use karyotypic analysis to demonstrate concordance with a particular chromosome. In a strictly formal sense, this type of analysis is analogous to classical two locus linkage studies studies with one marker being the enzymatic activity and the other marker being the particular chromosome that contains the gene encoding the enzyme.

The somatic cell hybrid approach has always been more important to human geneticists than to mouse geneticists. This is because well-established somatic cell hybrid lines with one or a few mouse chromosomes are relatively rare compared to the large number of well-characterized hybrid lines with individual human chromosomes. There are several reasons for this state of affairs. First, the power of mouse linkage mapping has always been so great that somatic cell hybrid lines were never considered to be essential tools. Second, most mouse/hamster hybrid lines are chromosomally unstable and must be re-characterized each time they are grown in culture. With the difficulty of performing karyotypic analysis on mouse chromosomes, most investigators have shied away from this approach in the past. However, with alternative PCR-based methods for characterizing the chromosomal content of hybrids (Abbott, 1992), this problem may have been overcome so that the derivation of new hybrids for special situations may no longer be as formidable as it once was.

The use of somatic cell hybrid panels as a general approach to gene mapping has now been superseded by in situ hybridization — which resolves map positions to chromosome bands rather than whole chromosomes — and, of course, classical linkage analysis. However, there are two special cases where somatic cell hybrid lines can provide unique tools for mouse geneticists. First, their DNA can be used as a source of material for the rapid derivation of panels of DNA markers to saturate particular chromosomes or subchromosomal regions as described in section 8.4.4 (Herman et al., 1991; Simmler et al., 1991). Second, their DNA can also be used to rapidly screen new clones obtained from other sources for their presence in a particular interval of interest as described in section 7.3.3. This can be accomplished with the use of duplicate blots containing just three lanes of restriction digested and fractionated DNA from (1) the somatic cell hybrid line containing the chromosome of interest, (2) mouse tissue (a positive control), and (3) the host cell line without mouse chromosomes (a negative control). Each blot can be subjected to repeated probing with different potential markers. A negative result allows one to discard a particular probe immediately; a positive result can be followed-up by higher resolution linkage analysis.

10.2.3.2 Radiation hybrid analysis

In 1990, Cox, Meyers, and their colleagues described a novel technique for determining gene order and distance which is as highly resolving as traditional linkage analysis but does not depend upon breeding. The approach used has similarities to, as well as differences from, both recombinational mapping and physical mapping. Radiation hybrid mapping was originally developed for use with the human genome, but with appropriate starting material and a sufficient number of chromosome-specific DNA markers, it can be used in the analysis of any species (Cox et al., 1990).

The starting material is a somatic cell hybrid line that contains only the chromosome of interest within a host background derived from another species. As indicated above, a common host species used for mouse chromosomes is the hamster. A well-established, stable hamster cell hybrid line containing a single mouse chromosome can be subjected to irradiation with X-rays that shatter each chromosome into multiple fragments. The irradiated cells are then placed together with pure hamster cells under conditions that promote fusion. Approximately 100 new hybrid clones are recovered that contain fragments of the mouse chromosome present in the original hybrid line. Finally, each of these lines are analyzed for the presence of various DNA markers that had been mapped previously into the chromosomal region of interest.

The order and distance of loci from each other can be determined according to the premise that X-rays will break the chromosome at random locations. Thus, the closer two loci are together, the less likely it is that a break will occur between them. If two loci are side-by-side, they will either both be present or both be absent from all 100 cells with 100% concordance. If two loci are at opposite ends of the chromosome, there will still be cells that have neither or both, but there will also be a large number that have only one or the other. (A cell can carry both loci even if the frequency of breakage between the two is 100% since it is possible for a hybrid cell to pick-up more than one chromosomal fragment.) As the probability of chromosome breakage varies between 0% and 100% for various pairs of loci under analysis, the fraction of hybrid cells that carry both loci will vary from 100% down to a control value obtained for unlinked loci. Thus, by typing each of the "radiation hybrid cells" in the set of 100 for a series of DNA markers, it becomes possible to construct a linkage map which is highly analogous to traditional recombinational maps.

It is possible to obtain linkage maps at different levels of resolution through the use of different intensities of radiation to break chromosomes. For example, with high levels of radiation that break chromosomes once every 100 kb, on average, one could map loci from 10 kb to 500 kb; with lower levels of radiation, mapping could be performed over a window from 500 kb to 5 mb.

The analogy to classical recombination mapping is striking in that a determination of linkage distance in both cases is based on the probability with which chromosomes will break followed either by recombination (in the classical case) or by segregation upon cell fusion (in the radiation hybrid case). In both cases, linkage distances are determined by counting the ratio of offspring (pups or cells) that do or do not carry particular sets of DNA markers (alleles or genes). However, linkage distances obtained through radiation hybrid analysis are much more likely to be indicative of actual physical distances.

Although radiation hybrid analysis has provided a crucial tool for genetic analysis in humans, once again, it has not been as widely used by the mouse community because classical linkage analysis is so much more powerful. Nevertheless, the resolution of this protocol has been validated in a study of the region of mouse chromosome 2 surrounding the agouti locus (Ollmann et al., 1992). In this study, the radiation hybrid map that was obtained corresponded exactly with that predicted from linkage analysis, with a level of resolution that was approximately 40-fold higher. Thus, radiation hybrid mapping could serve to fill-in the gap between linkage maps and physical maps, especially in "cold" regions between hotspots where distantly spaced markers cannot be separated by recombination (section 7.2.3).

10.3 Physical maps and positional cloning

There are two stages in the process of positional cloning. The first stage is the focus of a major portion of this book: to use formal linkage analysis and other genetic approaches — as tools — to find flanking DNA markers that lie very close to the locus of interest. With these markers in-hand, one can move to the second stage of this pathway: obtaining clones that cover the critical region, then identifying the gene of interest apart from all other genes and non-genic sequences within this region.

This second stage will be the focus of the remaining section of this book. In what follows, I will move away from the realm of the formal geneticist to that of the molecular biologist. However, for several reasons, my intention is only to provide an overview of the conceptual framework that underlies the various approaches being used at the current time. First, the topics of physical mapping and positional cloning have filled entire books and many excellent review articles. Second, these linked topics are driven by technology, and new improved protocols are constantly moving old ones onto the shelves. Consequently, any detailed discussion of actual molecular techniques will quickly become outdated.

10.3.1 Prerequisites to positional cloning

The absolute first step in the process of positional cloning is the high resolution mapping of the locus of interest relative to closely linked DNA markers. This process (described at length in chapter 9) provides an investigator with two sets of complementary tools that are essential prerequisites to the actual generation of a physical map around the locus of interest. The first set of "tools" will be represented by the small number of animal samples with crossover sites in the vicinity of the locus. The second set of tools is the small group of closely linked DNA markers.

Once the phenotypically-defined gene has been closely linked to one or more DNA markers, it becomes possible to consider the complete cloning of the region that must contain the gene. There are no absolute cutoffs for determining what level of linkage is necessary before one can pursue this path, but in general, linkage should be tighter than one centimorgan. Ideally, it is best to start a cloning project with one, or preferably more, DNA markers that show absolute linkage to the gene of interest upon analysis of at least 300 meiotic events or 77 recombinant inbred lines. From the equations used to derive figures 9.8 and 9.17 (from appendix D), one can determine that complete concordance in either of these cases provides a mean estimate for linkage distance of 0.23 cM which translates into a mean physical distance of 460 kb between marker and locus. These data also provide a 95% confidence upper limit of one centimorgan which translates into a distance of 2,000 kb.

10.3.2 PFGE and long range genomic restriction maps

It is possible to derive long range restriction maps spanning genomic regions that have yet to be cloned (Barlow and Leharch, 1987). The main utility of such restriction maps is to place lower and upper limits on the physical distance that separates two or more DNA markers known to be linked from breeding studies or other methods discussed previously. With this information in-hand, one can make a more informed decision as to whether it is best to proceed directly with cloning and walking between marker loci or better to derive additional DNA markers that lie between those available.

Long range restriction mapping requires two tools: the first is a method for separating very large DNA fragments based on size differences, and the second is a set of reagents for cutting DNA at relatively rare restriction sites. The required methodology was invented by Schwartz and Cantor (Schwartz and Cantor, 1984) and is known as Pulsed Field Gel Electrophoresis (PFGE). This technique permits the physical separation of DNA molecules that vary in size up to nine megabases in practice, with no upper limit in theory. The actual "window" of separation achieved is determined by the conditions of electrophoresis: at the lower end, one can obtain separation in the range of 20 kb to 200 kb, just beyond that possible with classical electrophoresis; at the upper end, one can obtain separation in the range of 1.4 to nine megabases (Barlow and Leharch, 1987).

The PFGE protocol would not be very useful for mapping mammalian chromosomes — which typically vary in size from 100 to 250 megabases — without a means for cutting these chromosomes at specific sites that are scattered from hundreds of kilobases up to a few megabases apart from each other. The means for doing just this appeared with the discovery of a special class of "rare-cutting" restriction enzymes. Restriction enzymes may cut rarely within mammalian DNA for two reasons. The first is a recognition site of eight bases rather than the usual four or six. In a genome with truly random sequence, an eight base recognition site would appear only once in every 48 bp or 64 kilobases. However, mammalian DNA is not truly random. In fact, one particular dinucleotide — CpG — is severely under-represented by a factor of five (see section 8.2.2). This fact provides the second reason why certain enzymes will cut genomic mouse DNA only rarely — they contain one or more CpG dinucleotides in their recognition site. One enzyme in particular — Not I — has an eight base recognition site as well as two CpG dinucleotides; the average distance between Not I sites is estimated at over one megabase. Other enzymes have either an eight base recognition site (Sfi I) or a six base recognition site with one two CpG dinucleotides (Nru I, Mlu I, BssHII, Eag I, Sac II, etc.), and finally, there are enzymes with a six base recognition site and only one CpG (Sal I, Cla I, Nar I, Xho I, etc.) Taken together, experiments with these various enzymes can be used to provide a distribution of restriction fragments that vary from 50 kb to multiple megabases in length.

Long range restriction maps are best generated by a combination of two approaches (Herrmann et al., 1987; Barlow et al., 1991). First, single or double digests can be performed on very high molecular weight genomic DNA with a panel of rare-cutting enzymes. Second, the same DNA sample can be treated with individual rare-cutting enzymes under conditions where partial digestion will occur (Barlow and Lehrach, 1990). All of these samples are loaded into adjacent lanes on the same gel which is run according to the PFGE protocol, blotted, and then probed sequentially with various markers from the region of interest. The basic strategy for building up restriction maps is similar to that encountered with isolated small clones like plasmids (Sambrook et al., 1989). The physical distance between two markers can be determined by identifying and sizing those restriction fragments, or partially digested fragments, that hybridize to both markers, or only one marker or the other.

Prior to the development, and easy availability, of large insert genomic libraries, the rare cutting enzyme/PFGE approach provided the most feasible means for estimating physical distances between linked loci that are separated by hundreds of kilobases or more. However, it is now often the case that physical mapping is more readily accomplished in the context of clones. Nevertheless, there are still many situations where a region of interest is flanked by two markers that are too distant from each other to allow rapid cloning between them. Genomic restriction mapping can play a unique role in these situations.

10.3.3 Large insert genomic libraries

10.3.3.1 YACs and other large insert cloning systems

With the availability of one or more closely linked DNA markers from a genomic region of interest, one can begin to develop a contig of overlapping clones that spans the region. A cloned contig not only provides information on physical distances but can also be used as the raw material from which positional cloning of a phenotypically-defined locus can proceed. The generation of a contig is pursued most efficiently by screening and walking through a large insert genomic library. Although a number of systems for generating large insert libraries have been described, to date, the Yeast Artificial Chromosome (YAC) cloning system remains the most important for mouse geneticists.

The YAC cloning system was first developed by David Burke and Maynard Olson at Washington University in St. Louis (Burke et al., 1987). It is based on the formation of "artificial" yeast chromosomes with the ligation of random, large fragments of genomic DNA between two arms that contain, in one case, a telomere and a centromere, and in the other case, a telomere alone, with selectable drug-resistance markers on both arms. These YAC constructs are transfected back into yeast where they will move alongside host chromosomes into both daughter cells at each mitotic division.

The construction of a YAC library proceeds in a manner that is very different from that of most other types of genomic libraries. Every clone in the library must be picked individually and placed into a separate compartment (of a microtiter dish, for example). This process is extremely time-consuming and labor-intensive, but once a library has been formed with individual clones in individual wells, it is essentially immortal. For this reason and others, it makes good sense to screen established libraries for a gene of interest rather than to create a new library. The first mouse YAC library to be described had a 2.2-fold genomic coverage and an average insert size of ~265 kb and was distributed freely to the entire scientific community (Burke et al., 1991; Rossi et al., 1992). Several other mouse YAC libraries have since been described with greater insert size and genomic coverage (Larin et al., 1991; Chartier et al., 1992; Kusumi et al., 1993). The most comprehensive, well-characterized mouse YAC library described to date contains 19,421 clones with an average insert size of 650 kb for a 4.3-fold coverage of the genome (Kusumi et al., 1993). This library is available for screening commercially through Research Genetics Inc. (Huntsville, Alabama USA). Screening of this library, and most others, is based on PCR analysis of a hierarchy of clone pools (Green and Olson, 1990). Detailed protocols for library preparation, screening, and analysis have been described (Larin et al., 1993; Nelson and Brownstein, 1994).

It should be mentioned that the YAC cloning system is not perfect. At the time of this writing, it is still the case that a very high percentage of clones from all of the largest insert YAC libraries are chimeric; that is, their inserts are composed of two or more unrelated genomic fragments that have become co-ligated in an undefined manner. The pre-identification of chimeric clones is essential before one can begin to generate a physical map.

Two other systems for cloning large genomic inserts have been described more recently. One is based on the use of the bacteriophage P1 as a cloning vector (Pierce et al., 1992; Pierce and Sternberg, 1992). This system has been used to obtain a mouse genomic library with average inserts in the range of 75-95 kb with a maximum cloning capacity of 100 kb. The P1 cloning system has two advantages over YACs: first, it has much more efficient cloning rates, and second, like other bacterial cloning systems, it allows the efficient purification of large amounts of clone DNA away from the rest of the bacterial genome. The utility of this cloning system in the analysis of genomic organization within the H2 region has been demonstrated (Gasser et al., 1994).

Another more recent system is derived from the well-studied E. coli F factor which is essentially a naturally-occurring single copy plasmid (Shizuya et al., 1992). This plasmid has been converted into a vector that allows the cloning of inserts with more than 300 kb of DNA, and with a reported average size range of 200-300 kb. The authors have called this vector/insert system a Bacterial Artificial Chromosome or BAC. The BAC system has the same advantages as P1 and the added advantage of a larger potential insert size. The BAC system has not been analyzed extensively enough to know whether chimerism will be a problem and whether the whole mouse genome will be fairly represented within this library.

10.3.3.2 Walking and building contigs

All positive clones from a YAC, or other large insert, library can be sized by PFGE, and fragments at both ends of each insert can be isolated rapidly by several standard protocols (Riley et al., 1990; Cox et al., 1993; Zoghbi and Chinault, 1994). End fragments from each clone should be used as probes to perform an initial test of the possibility of chimerism. This can be accomplished by probing appropriate somatic cell hybrid lines to determine whether both ends map to the same chromosome as the original DNA marker used to isolate the clone; if appropriate somatic cell hybrid lines are not available, one can also test the segregation of the end fragments on a panel of 20 interspecific (or intersubspecific) backcross samples. If the two end-fragments show complete concordance in transmission, this can be taken as strong evidence for non-chimerism; in contrast, two or more recombination events would be highly suggestive of a chimeric clone. Chimeric clones need not be discarded; it is just necessary to be aware of their nature in any interpretation of the data that they generate.

If multiple clones have been obtained from a screen with a DNA marker, end fragments from each should be used in cross-hybridization experiments to identify the particular clones that extend furthest in each direction along the chromosome. Often this approach will reduce the number of clones worth pursuing to just two. "Walking" through the library can proceed by using the farthest end-fragments for re-screening, and then analyzing the resulting clones in the same manner described above. In this manner, a "contig" will be built over the genomic region that contains the locus of interest (Zoghbi and Chinault, 1994).

The process of deriving YAC clones from a library can be brought to a halt when the clones that have already been obtained must include the locus being sought. It is only possible to reach this conclusion when the derived contig extends over markers that map apart from the locus on both of its sides. In other words, the contig must extend across the two closest recombination breakpoints that define the outer limits of localization. If cloning is begun with a very dense map of markers placed onto a high resolution cross, this endpoint is likely to be reached more quickly. With real luck, it might even be reached with the first set of YACs obtained in the initial screening of the library.

10.3.3.3 Physical mapping: a comprehensive example

The overall strategy that one follows to move from a phenotype to a cloned contig is best explained within the context of a hypothetical example that is illustrated in figure 10.1. Suppose you are interested in cloning a newly identified locus that has mutated to cause a phenotype of green eyes. First you search through the literature and the various genetic databases to see if any similar mutant phenotype has been uncovered previously. When this search fails to uncover previous examples of green-eyed mice, you decide to set up an intersubspecific backcross to follow the segregation of the mutant locus relative to DNA markers spread throughout the genome as detailed in section 9.4. An analysis of 50 backcross offspring with two to three markers taken from each chromosome demonstrates linkage to the distal region of Chr 3 between two markers that are forty centimorgans apart from each other. With this information in-hand, you re-type the same offspring with ten additional Chr 3 markers — spaced at approximately two centimorgan intervals over the derived map position — to further localize the green eyed mutation. This step yields a map position between two limiting markers that are spaced four centimorgans apart. Now you increase the number of backcross offspring in your typing set to 400 and you analyze each for the segregation of another thirty markers that were previously mapped with or between the limiting markers. This third step yields four markers that are most tightly linked to the green eyed locus with the hypothetical haplotype data shown in panel A of figure 10.1. As illustrated in the figure, the data demonstrate (1) complete concordance between green eyed and the marker D3Xy55, (2) one recombinant in 400 with the proximal marker D3Ab34, and two recombinants in 400 with two completely concordant distal loci D3Xy12 and D3Ab29. Panel B of figure 10.1 shows the linkage map that is generated from these data.

With one concordant marker and closely flanking markers on either side of the locus of interest, one can begin to develop a physical map. All four of the nearby markers are used to screen a YAC library. As shown in panel C of figure 10.1, D3Ab34 identifies two clones (1 and 5), D3Xy55 identifies another two clones (2 and 6), D3Xy12 identifies another two clones (3 and 8), and D3Ab29 identifies another two clones (4 and 7). End fragments are derived from all eight clones and are used first to search for overlaps by hybridization to the complete set of YACs. This search demonstrates an overlap between clone 8 and clones 4 and 7. Thus, in the first round of screening, two independent markers — D3Xy12 and D3Ab29 — have been physically linked into a single contig.

End fragments from clones 1, 6, 2, and 3 are used to re-screen the YAC library. This screen yields clones 9, 12, 10, and 11. Once again, end fragments are derived from these new clones and used first to search for overlaps. This search demonstrates an overlap between clones 9 and 12 which provides a physical linkup between the markers D3Ab34 and D3Xy55. Thus, after two rounds of screening, two contigs have been formed with each containing two of the four markers. Finally, in an attempt to fill-in the gap between YACs 10 and 11, a third round of screening is performed with an end fragment from each of these clones. Both end fragments immediately identify the same clone (13) and thus, without further analysis, it is possible to state that a single contig has been generated across the entire region of interest. Most importantly, the contig crosses recombination breakpoints both proximal and distal to the green eyed locus. Thus, the green eyed locus must lie within the contig.

The contig is minimally defined by 10 overlapping clones; from most proximal to most distal, they are 5, 9, 12, 2, 10, 13, 11, 3, 8 and 7. Each of these clones must be sized and restriction mapped to construct the complete physical map shown in panel D of figure 10.1. At this stage of the analysis, one can say only that the green eyed locus must reside within the 1,360 kb cloned region between D3Ab34 and D3Ab29. A region of this size is still quite large for undertaking gene identification studies, and thus it makes sense to try to narrow it down further. Toward this goal, one can return to the three backcross animals with recombination breakpoints located nearest to the green eyed gene (156, 078, and 332 from panel A). The three corresponding samples of genomic DNA can be typed at the new loci defined by the six end fragments characterized in the YAC walking protocol just completed (1R, 6L, 2R, 10R, 11L, and 3L, where R and L signify left and right ends respectively). The results of this last genetic analysis are shown in panel E of figure 10.1 (where haplotypes are rotated 90° to match them up to the physical map). The data allow further localization of the proximal breakpoint between 1R and 6L and further localization of a closest distal breakpoint (in sample no. 332) between markers 2R and 10R. These results reduce by two-fold the size of the genomic region that must contain the green eyed locus down to 560 kb. This region is contained within just four YACs — 9, 12, 2, and 10 — that can now be analyzed for potential gene sequences as described in the next section.

10.3.4 Protocols for gene identification

Even before the entire region that encompasses the locus of interest has been cloned, it is possible to begin the search for candidate genes within the large-insert clones that first become available. It is good idea to pursue this search simultaneously with genomic walking for two reasons. First, you could get lucky and find your gene in the initial cloned region. Second, the search for candidate genes can be daunting — it took ten years to identify the human Huntington Disease gene (Huntington’s Disease Collaborative Research Group, 1993) — so it makes good sense to start as soon as possible.

Many different protocols have been devised over the past several years to carry out this task (Parrish and Nelson, 1993). Generally-speaking, these protocols can be placed into three groups according to the underlying principle that they incorporate. First are protocols that rely upon the identification of transcribed sequences by cross-hybridization. Second are protocols that do not depend on gene activity but rather special characteristics in the DNA itself that are unique to mammalian genes. Third are computational protocols that can be used to distinguish coding regions and regulatory regions from non-functional regions within long stretches of DNA sequence.

10.3.4.1 Candidate gene identification based on expression

Traditional approaches to identifying expressed sequences within genomic clones rely on using these clones, or subfragments from them, as hybridization probes to screen cDNA libraries constructed from a tissue in which the locus of interest is thought to be expressed. In theory, the simplest strategy would be to use YAC clones directly as probes (Marchuk and Collins, 1994). In practice, this simple strategy has shown only limited success for a number of reasons including not only the high complexity of the clone — which results in a reduced signal strength for each individual transcript region within it — but also the difficulty of purifying high quality YAC DNA in sufficient quantity. These problems are circumvented by subcloning the YAC into cosmids or phage which can each be used individually to probe the cDNA library. However, this increases the workload by at least an order of magnitude.

Another expression-based strategy that is not dependent on cDNA libraries is to use subclones from YACs as probes of Northern blots containing RNA from tissues thought to express the gene along with RNA from tissues that should not express the gene based on the mutant phenotype. Positively-hybridizing subclones can be further subcloned and individual fragments can be re-tested to narrow down the location of the transcript-containing sequences. Although this process provides some additional expression information that can be useful for sorting out candidates, it is quite tedious, requires large amounts of tissue RNA, and is no longer the method of choice.

New approaches to detecting expressed cDNA sequences that circumvent many of the disadvantages of the methods just described are all based on the use of PCR. With one such approach, the YAC DNA - rather than the RNA or cDNA — is immobilized on filters. These filters are probed with specially engineered cDNA libraries in which all inserts are flanked by unique targets for PCR amplification. After filter hybridization and washing, those cDNAs that remain specifically attached can be eluted, amplified, and cloned (Lovett et al., 1991; Parimoo et al., 1991). Many variations upon this general theme have been described.

10.3.4.2 Expression-independent gene identification.

There are two serious problems inherent in all attempts to locate genes based on hybridization to RNA transcripts or amplified products from these transcripts. The first problem is that, from the phenotype alone, it may not be possible to determine the tissue specificity of gene expression. The second problem occurs even when the specific expressing-tissue can be reasonably well-identified in that the majority of transcript classes will be present at relatively low levels and will be difficult to retrieve. As a consequence, whole classes of genes will go undetected including, for example, those that are expressed only during brief periods of embryonic development, or only in a small subset of cells from complex tissues like the brain. Even genes expressed more broadly can go undetected if their corresponding transcripts are present in one or a few copies per cell.

Three general approaches to gene-identification have been developed that are not dependent on gene expression. Broadly-speaking, these approaches are based on three corresponding characteristics of mammalian genes: (1) the occurrence of introns in nearly all mammalian genes, (2) the presence of "CpG" islands at the 5’-ends of most mammalian genes, and (3) the evolutionary conservation of nearly all mammalian genes from mice to humans and sometimes beyond.

10.3.4.3 Exon Trapping

The first approach is referred to as Exon Trapping (Buckler et al., 1991). It is based on the empirical finding that the vast majority of splice recognition sites are not cell type-specific. Instead, a general splicing machinery present in all cells can act with precision upon endogenous as well as foreign transcripts. This machinery can be exploited in tissue culture to identify YAC-derived genomic fragments that contain exons. Essentially, one subjects the YAC DNA to restriction digestion with a standard 6-base recognition site enzyme followed by shotgun cloning into a special eukaryotic expression vector that contains flanking splice donor and acceptor sites. The clones are then transfected into a mammalian cells that allow their high level expression into RNA containing each cloned insert. If an insert does not contain an exon, splicing will proceed directly from the splice donor site on one side of the transcribed insert to the splice acceptor site on the other side to produce a final transcript of a pre-defined size. However, if an entire exon is contained within a particular fragment, it will be spliced into the mature transcript. The set of transcripts produced in a particular transient cell culture can be amplified by RT-PCR and analyzed by gel electrophoresis. All PCR products that are larger than the background splicing product should contain insert-derived exons that can be readily cloned directly from the gel. Once exons have been identified and cloned by this protocol, they can be used directly as probes to study tissue and stage specificity of expression, which is a prerequisite to the recovery of full-length cDNA clones.

In theory, a protocol of this type should allow the isolation of all of the exons present in a particular YAC clone. This collection of exons would be representative of all of the gene present on the YAC. But, in practice, sophisticated protocols of this type often fail to live up to expectations. To test the validity of Exon Trapping as a generalized approach to gene identification, Lehrach and his colleagues (North et al., 1993) used this strategy to search for exons on eight cosmid clones that covered a region of 185 kb from the MHC class II region. Of the eight genes that are known to be present within this contig, seven were accounted for within the exon clones that were recovered. This result would imply a success rate for gene identification of ~90%.

10.3.4.4 CpG islands

As discussed previously, the dinucleotide CpG is severely underrepresented in mammalian genomes. This under-representation results from the methylation of the cytosines on both strands of the two-basepair sequence; methylated cytosines are highly susceptible to spontaneous deamination which can cause a transitional mutation to thymidine (Barker et al., 1984). Thus, when sequences are present in the genome, they will mutate frequently to or , and, in fact, these dinucleotide sequences are present at a frequency significantly higher than expected throughout the genome (McClelland and Ivarie, 1982). In contrast, the CpG dinucleotides present at the 5’ ends of many vertebrate genes remain devoid of methylation and thus resistant to mutation. As a consequence, the distribution of CpG dinucleotides is highly non-random with high density "islands" of multiple CpGs that mark the 5’-ends of genes in the midst of large genomic seas that contain only scattered CpGs as isolated entities (Bird, 1986; Bird et al., 1987; Bird, 1987).

Gene searchers can exploit this situation by using restriction enzymes that contain two CpG dinucleotides in their recognition sites to identify the 5’ ends of genes. Lindsay and Bird (1987) have calculated that 89% of all NotI sites (GCGGCCGC) are located in CpG islands, as is the case for 74% of all EagI (CGGCCG), SacII (CCGCGG), and BssHII sites (GCGCGC). Thus, the approach to identifying CpG islands within YAC clones becomes relatively straightforward. Partial digestion of the clone is performed with each of the double CpG enzymes just described and the resulting DNA is separated by PFGE, blotted and probed sequentially with fragments from each of the YAC arms. The appearance of bands of the same size in digests obtained with two or more enzymes is highly suggestive of a CpG island. If NotI and one of these other enzymes both recognize sites within one or two kilobases of each other (below the resolution of PFGE), the presence of a CpG island can be assumed with a probability of 97%. If two of the 6-cutter double CpG enzymes both recognize nearby sites, the likelihood of a CpG island is 93%. Once a putative CpG island is identified, various PCR-based methods can then be used to clone the DNA adjacent to the island (Parrish and Nelson, 1993), and these sequences can be examined thoroughly to characterize the associated transcription unit.

There are two main advantages to this approach to gene identification. The first is its simplicity: it is based entirely on restriction digests, gel running, and cloning. The second is that it can enable the identification of genes that may not be detectable by other approaches. The only real disadvantage is that CpG islands are only found in association with 50 to 70% of all genes.

10.3.4.5 Conservation from mice to humans and beyond

Less than five percent of the sequences within the mammalian genome actually contain information that is used to encode gene products. An even smaller fraction (~0.1%) of the genome accounts for all of the regulatory elements like promoters and enhancers that control the stage and tissue-specific expression of this genetic information. Another five to ten percent of the genome consists of elements required for the construction of centromeres, telomeres and other chromosomal structures. The remaining 85 to 90% of the genome has no apparent sequence-specific function.

Nucleotide changes that occur within a non-functional DNA sequence are considered to be neutral. That is to say that such changes provide neither benefit nor harm to the organism within which they reside and, as such, they will not be subjected to selective forces. Instead, over a period of many generations, they will decrease or increase in allele frequency by a process of random drift which will lead either to their extinction (in most cases) or to their fixation within the species. Since spontaneous mutations occur at a constant rate within a population and since each neutral change will have the same (very low) probability of fixation, a non-functional sequence of DNA will slowly change at a constant rate. In mammals, this rate of change has been determined empirically to be on the order of 0.5% (five changes in 1000 nucleotides) per million years.

The constant rate of change in non-functional sequences can be used as a "molecular clock" to gauge the evolutionary distances that separate different species from each other [see, for example, Nei (1987)]. And when the distance between two species is already known, the molecular clock can be used to predict the expected homology between sequences in each that are descendent from a common ancestor. Consider the consequences of genetic drift on a non-functional sequence present in the common ancestor to mice and humans some 65 million years ago. During the evolution from this common ancestor to the modern house mouse, it would have undergone changes in 65 X 0.5% or 32.5% of its nucleotides. During the separate evolution of this sequence along the line from the common ancestor to modern humans, changes would also have occurred in 32.5% of its nucleotides. With so many random changes occurring, there is a certain probability that the same nucleotide will be hit two or more times. Taking this fact into consideration yields a corrected divergence of ~27% along each evolutionary line and a comparison between the sequences currently present in mice and humans would show a divergence of ~48%. With changes at approximately half of the nucleotides present in the derived sequences, it will often be hard to even recognize the fact that they have a common ancestor. Most importantly for physical mappers, at this level of divergence, specific cross-hybridization will not take place.

In contrast to the situation encountered with non-functional sequences, most nucleotide changes that occur within coding regions will, in fact, be subjected to selective forces. The vast majority of changes that alter protein sequence are detrimental and will not survive within a population. Thus, coding regions will evolve much more slowly than non-functional regions. Although the actual rate of evolution can vary greatly for different genes, the vast majority of mammalian genes characterized to date show specific cross-hybridization between homologous sequences in the mouse and human genomes. In addition, a subset of mammalian genes are so conserved that specific cross-species hybridization can be detected with homologs in Drosophila and C. elegans, and in a smaller subset still, cross-hybridization is detected with homologs in yeast.

With all of this information in-hand, it becomes clear that cross-species Southern blot hybridization studies that use subcloned YAC fragments as probes, under low stringency conditions, can provide a tool for distinguishing between the five percent of sequences that encode proteins and the remaining 95% without coding information. Since investigators often run samples from several different species in adjacent lanes, this approach is often referred to as "zoo blotting".

There are several advantages to this approach. First, it allows detection of the vast majority of coding sequences and, thus, is more universal than the CpG island approach. Second, it is not dependent on actual gene expression and, thus, avoids all of the problems inherent in low level or restricted transcript distributions. The major problem with this approach is that the YAC clone must be subdivided into much smaller pieces that need to be tested individually for hybridization, and once a positive result is obtained, further subcloning and analysis is required. Consequently, the examination of a several hundred kilobase contig by this approach can be extremely tedious and time consuming.

10.3.4.6 Gene identification based on sequence

As sequencing becomes more highly automated and more accurate, the feasibility of stepping nucleotide by nucleotide across an entire YAC clone becomes more and more realistic. The basic approach would be to begin sequencing across the insert from both sides with initial primers facing in from the two YAC arms. The maximum amount of sequence possible would be obtained by moving away from these two primers and then a short segment at the end of each would be used to design a new primer for the next step in sequencing. This process would be repeated over and over until overlap was reached in the middle of the clone. If, with improvements in technology, it becomes possible to read one thousand bases of sequence in any single run, then each step in this procedure would provide two kilobases of information. In total, one hundred and fifty steps would be required for a single pass over the complete sequence of a 300 kb clone. This is certainly not a feasible approach in 1993, but the pace of technology is such that it may well be possible within in the next five years.

With a long range sequence, it becomes possible to use computational methods alone to ferret out coding regions. Sophisticated neural net-based computer programs have been developed that can identify 90% of all exons with a twenty percent false positive rate as of 1992 (Forrest, 1993; Jan and Jan, 1993; Little, 1993; Martin et al., 1993). Once a putative exon has been identified, it can be used as a probe to search for the tissue in which its expression takes place, and with further studies, it becomes possible to identify the remaining portions of the transcription unit with which it is associated.

10.4 The human genome project and the ultimate map

The goal of determining the complete DNA sequence of the human genome has captured the imagination of many geneticists as well as other biomedical scientists (Kevles and Hood, 1992; Wills, 1992). It was this goal that originally catapulted the "Human Genome Project" into the headlines of newspapers and the talk of politicians in the U.S. and elsewhere during the latter part of the 1980s. In 1990, the National Institutes of Health and the Department of Energy began a coordinated effort to reach this goal within a period of fifteen years by stepping through a series of intermediate goals. In 1993, Francis Collins, the new director of the National Center for Human Genome Research at NIH, outlined the newest version of a five year plan for this project (Collins and Galas, 1993). The five-year goals incorporated into this plan include: (1) a high resolution linkage map at a resolution of 2-5 cM with an established set of easy-to-use DNA markers; (2) a high resolution physical map with Sequence Tagged Site (STS) markers located every 100 kb; (3) improvements in DNA sequencing technology to allow the annual accumulation of 50 megabases of sequence; (4) the development of efficient methods for gene identification within cloned sequences, and for the mapping of genes identified by other means; and (5) further development of technology for increased automation, increased use of robotics, and more sophisticated computer-based tools for information management and analysis.

Although the Human Genome Project is focused, of course, on the human, there is a uniform consensus among researchers that it is only in comparison with the genomes of model organisms such as the mouse that the human genome will reveal all of it secrets. Thus, another primary goal of the human genome project is to sequence selected segments of the mouse genome side-by-side with homologous regions of the human genome. By evaluating the relative level of conservation across a genomic region and combining this information with other computational assessments of sequence content, it will become possible to uncover essentially all genes, promoters, enhancers, and other regulatory elements. Unknown, and unexpected, non-coding, sequences that show conservation indicative of biological function are sure to be identified as well, and further studies will be required to understand the functions of these new elements.

Even if the Human Genome Project meets its target of a complete human sequence by the year 2005, in the minds of geneticists, the ultimate map of the genome will extend far beyond the sequence alone and will require many more years to attain. This ultimate map will show all genomic regions that are subject to parental imprinting, all regions that show methylation, and the locations of all DNase I hypersensitivity sites that appear in the context of the chromatin structure which forms around the DNA molecule. Finally, the complete sequence will be used as a jumping board to map the entire network of genetic and gene product interactions that occur during the development of a human being, with a determination of the complete pattern of expression of every single gene through both time and space.

Will such an ultimate map ever be attained? It’s impossible to predict in 1994 as these words are being written. But even if it is attained, will it really tell us something deeply profound about the state of being human? This will only happen if it becomes possible to visualize the manner in which the whole functional network becomes more than the sum of the parts. Will any human mind have the capacity for such a visualization? Is it our destiny to understand the molecular biology of the human soul or, rather, to search forever in vain? These are questions that may be answered in the twenty first century, even later, or perhaps never.