What is DNA sequencing?
DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks (called bases and abbreviated A, T, C, and G) that make up the DNA of the 24 different human chromosomes, was the greatest technical challenge in the Human Genome Project. Achieving this goal has helped reveal the estimated 20,000-25,000 human genes within our DNA as well as the regions controlling them. The resulting DNA sequence maps are being used by 21st Century scientists to explore human biology and other complex phenomena.Meeting Human Genome Project sequencing goals by 2003 required continual improvements in sequencing speed, reliability, and costs. Previously, standard methods were based on separating DNA fragments by gel electrophoresis, which was extremely labor intensive and expensive. Total sequencing output in the community was about 200 million base pairs for 1998. In January 2003, the DOE Joint Genome Institute alone sequenced 1.5 billion bases for the month.
Gel-based sequencers use multiple tiny (capillary) tubes to run standard electrophoretic separations. These separations are much faster because the tubes dissipate heat well and allow the use of much higher electric fields to complete sequencing in shorter times.
See a figure depicting this technology.
Whose genome was sequenced in the public (HGP) and private projects?
The human genome reference sequences do not represent any one person’s genome. Rather, they serve as a starting point for broad comparisons across humanity. The knowledge obtained from the sequences applies to everyone because all humans share the same basic set of genes and genomic regulatory regions that control the development and maintenance of their biological structures and processes.
In the international public-sector Human Genome Project (HGP), researchers collected blood (female) or sperm (male) samples from a large number of donors. Only a few samples were processed as DNA resources. Thus donors' identities were protected so neither they nor scientists could know whose DNA was sequenced. DNA clones from many libraries were used in the overall project.
Technically, it is much easier to prepare DNA cleanly from sperm than from other cell types because of the much higher ratio of DNA to protein in sperm and the much smaller volume in which purifications can be done. Sperm contain all chromosomes necessary for study, including equal numbers of cells with the X (female) or Y (male) sex chromosomes. However, HGP scientists also used white cells from female donors' blood to include samples originating from women.
In the Celera Genomics private-sector project, DNA from a few different genomes was mixed and processed for sequencing. DNA for these studies came from anonymous donors of European, African, American (North, Central, South), and Asian ancestry. The lead scientist of Celera Genomics at that time, Craig Venter, has since acknowledged that his DNA was among those sequenced.
Many polymorphisms—small regions of DNA that vary among individuals—also were identified during the HGP, mostly single nucleotide polymorphisms (SNPs). Most SNPs have no physiological effect, although a minority contribute to the beneficial diversity of humanity. A much smaller minority of polymorphisms affect an individual’s susceptibility to disease and response to medical treatments.
Although the HGP has been completed, SNP studies continue in the
International HapMap Project, whose goal is to identify patterns of SNP groups (called haplotypes, or “haps”). The DNA samples for the HapMap Project came from 270 individuals, including Yoruba people in Ibadan, Nigeria; Japanese in Tokyo; Han Chinese in Beijing; and the French Centre d’Etude du Polymorphisme Humain (CEPH) resource.
[Answer supplied by Dr. Marvin Stodolsky, U.S. DOE Office of Biological and Environmental Research, Office of Science]
Who sequenced the human genome?
Human Genome Project research was funded at many laboratories across the U.S. by the Department of Energy (DOE), the National Institutes of Health (NIH), or both. A list of the major U.S. Human Genome Project research sites can be found
here.Other researchers at numerous colleges, universities, and laboratories throughout the United States also have received DOE and NIH funding for human genome research. At any given time, the DOE Human Genome Project has funded about 100 principal investigators. For DOE-funded projects, see
Research. To see a list of NIH-funded projects, visit the agency's grants
database.
In addition, many large and small private U.S. companies are conducting genome research. For more on the genomics research partnership between the public and private sectors, see the
Human Genome Project and the Private Sector Fact Sheet. At least 18 other countries have participated in the Human Genome Project. See the
list.
How is DNA sequencing done?
Download a PDF
illustration courtesy of the Department of Energy's
Joint Genome Institute.
- Chromosomes, which range in size from 50 million to 250 million bases, must first be broken into much shorter pieces (subcloning step).
- Each short piece is used as a template to generate a set of fragments that differ in length from each other by a single base that will be identified in a later step (template preparation and sequencing reaction steps).See a figure depicting the sequencing reaction.
- The fragments in a set are separated by gel electrophoresis (separation step).New fluorescent dyes allow separation of all four fragments in a single lane on the gel.
See an example of an electropherogram using fluorescent dyes. Click on the image for a caption.
- The final base at the end of each fragment is identified (base-calling step). This process recreates the original sequence of As, Ts, Cs, and Gs for each short piece generated in the first step.Automated sequencers analyze the resulting electropherograms, and the output is a four-color chromatogram showing peaks that represent each of the four DNA bases.
After the bases are "read," computers are used to assemble the short sequences (in blocks of about 500 bases each, called the read length) into long continuous stretches that are analyzed for errors, gene-coding regions, and other characteristics.
To read about all the trouble researchers go through to "finish" this raw sequence from automated sequencers, click here (and scroll to bottom that begins "Here are our definitions of . . . ").
Finished sequences are submitted to major public sequence databases, such as GenBank. Human Genome Project sequence data are thus freely available to anyone around the world.
In May 2006, Human Genome Project (HGP) researchers announced the completion of the DNA sequence for the last of the 24 human chromosomes. How does this differ from the finished human genome announced by HGP researchers in 2003?
The DNA sequences announced in 2003 were only rough drafts for each human chromosome. While this draft already has advanced medical research, more detail was needed. The draft genomic sequences can be compared broadly to a cross-country road excavated by a bulldozer that leaves behind many gaps across difficult terrain that will require bridges and other refinements.
So, too, with charting the landscape of the human genome. Researchers have now filled in the gaps and provided far more detail for each chromosome. Much of this was accomplished by comparing particular DNA sequences across populations in genomic areas that may have contained anomalies in the initial samples. For example, some DNA segments have proven unstable during the process of copying them (cloning) for use in sequencing machines. (See an
example.) Correcting minor errors (estimated at 1 error in every 10,000 DNA subunits) and cataloging of mutations will continue for some time to come.
The entire collection of human chromosome DNA sequences is freely available to the worldwide research community.
For more details, see the
Nature HG Collection.
What is the difference between draft sequence and finished sequence?
In generating the draft sequence (released in June 2000), scientists determined the order of base pairs in each chromosomal area at least 4 to 5 times (4x to 5x) to ensure data accuracy and to help with reassembling DNA fragments in their original order. This repeated sequencing is known as genome "depth of coverage." Draft sequence data are mostly in the form of 10,000 base pair-sized fragments whose approximate chromosomal locations are known.
To generate a high-quality reference sequence, completed in April 2003, additional sequencing was done to close gaps, reduce ambiguities, and allow for only a single error every 10,000 bases, the agreed-upon standard for the HGP. Investigators believe a high-quality sequence is critical for recognizing gene-regulatory components important in understanding human biology and disorders such as heart disease, cancer, and diabetes. The finished version provides an estimated 8x to 9x coverage of each chromosome.
What genomes have been sequenced completely?
The small genomes of several viruses and bacteria and the much larger genomes of three higher organisms have been completely sequenced; they are bakers' or brewers' yeast (
Saccharomyces cerevisiae), the roundworm (
Caenorhabditis elegans), and the fruit fly (
Drosophila melanogaster). In October 2001, the draft sequence of the pufferfish
Fugu rubripes, the first vertebrate after the human, was completed; and scientists finished the first genetic sequence of a plant, that of the weed
Arabidopsis thaliana, in December 2000. Many more genome sequences have been completed since then.For information on published and unpublished genomes, see Genomes Online Database (
GOLD).
What nonhuman genome sequencing projects are supported by the U.S. Department of Energy?
A list of microbial genome sequencing projects supported by the U.S. Department of Energy Microbial Genome Program is
available here.
What happens now that the human genome sequence is completed?
The working-draft DNA sequence and the more polished 2003 version represent an enormous achievement, akin in scientific importance, some say, to developing the periodic table of elements. And, as in most major scientific advances, much work remains to realize the full potential of the accomplishment.
Early explorations of the human genome, now joined by projects on the genomes of several other organisms, are generating data whose volume and complex analyses are unprecedented in biology. Genomic-scale technologies will be needed to study and compare entire genomes, sets of expressed RNAs or proteins, gene families from a large number of species, variation among individuals, and the classes of gene regulatory elements.
Deriving meaningful knowledge from DNA sequences will define biological research through the coming decades and require the expertise and creativity of teams of biologists, chemists, engineers, and computational scientists, among others. A sampling follows of some research challenges in genetics--what we still don't know, even with the full human DNA sequence in hand.
- Gene number, exact locations, and functions
- Gene regulation
- DNA sequence organization
- Chromosomal structure and organization
- Noncoding DNA types, amount, distribution, information content, and functions
- Coordination of gene expression, protein synthesis, and post-translational events
- Interaction of proteins in complex molecular machines
- Predicted vs experimentally determined gene function
- Evolutionary conservation among organisms
- Protein conservation (structure and function)
- Proteomes (total protein content and function) in organisms
- Correlation of SNPs (single-base DNA variations among individuals) with health and disease
- Disease-susceptibility prediction based on gene sequence variation
- Genes involved in complex traits and multigene diseases
- Complex systems biology, including microbial consortia useful for environmental restoration
- Developmental genetics, genomics