DNA sequence
A DNA sequence or genetic sequence is a succession of letters representing the primary structure of a real or hypothetical DNAmolecule or strand, with the capacity to carry information as described by the central dogma of molecular biology.
The possible letters are A, C, G, and T, representing the four nucleotide bases of a DNA strand — adenine, cytosine, guanine, thymine— covalently linked to a phosphodiester backbone. In the typical case, the sequences are printed abutting one another without gaps, as in the sequence AAAGTCTGAC, read left to right in the 5' to 3' direction. Short sequences of nucleotides are referred to asoligonucleotides and are used in a range of laboratory applications in molecular biology. With regard to biological function, a DNA sequence may be considered sense or antisense, and either coding or noncoding. DNA sequences can also contain "junk DNA."
Sequences can be derived from the biological raw material through a process called DNA sequencing.
In some special cases, letters besides A, T, C, and G are present in a sequence. These letters represent ambiguity. Of all the molecules sampled, there is more than one kind of nucleotide at that position. The rules of the International Union of Pure and Applied Chemistry (IUPAC) are as follows:[1]
- A = adenine
- C = cytosine
- G = guanine
- T = thymine
- R = G A (purine)
- Y = T C (pyrimidine)
- K = G T (keto)
- M = A C (amino)
- S = G C (strong bonds)
- W = A T (weak bonds)
- B = G T C (all but A)
- D = G A T (all but C)
- H = A C T (all but G)
- V = G C A (all but T)
- N = A G C T (any)
gene idenfication:
4.1. Identification of Genes in a Genomic DNA Sequence
4.1.1. Prediction of protein-coding genes
Archaeal and bacterial genes typically comprise uninterrupted stretches of DNA between a start codon (usually ATG, but in a minority of genes, GTG, TTG, or CTG) and a stop codon (TAA, TGA, or TAG; alternative genetic codes of certain bacteria, such as mycoplasmas, have only two stop codons). Rare exceptions to this rule involve important but rare mechanisms, such as programmed frameshifts. There seem to be no strict limits on the length of the genes. Indeed, the gene rpmJ encoding the ribosomal protein L36 (Figure 2.1) is only 111 bp long in most bacteria, whereas the gene for B. subtilis polyketide synthase PksK is 13,343 bp long. In practice, mRNAs shorter than 30 codons are poorly translated, so protein-coding genes in prokaryotes are usually at least 100 bases in length. In prokaryotic genome-sequencing projects, open reading frames (ORFs) shorter than 100 bases are rarely taken into consideration, which does not seem to result in substantial underprediction. In contrast, in multicellular eukaryotes, most genes are interrupted by introns. The mean length of an exon is ~50 codons, but some exons are much shorter; many of the introns are extremely long, resulting in genes occupying up to several megabases of genomic DNA. This makes prediction of eukaryotic genes a far more complex (and still unsolved) problem than prediction of prokaryotic genes.
4.1.1.1. Prokaryotes
For most common purposes, a prokaryotic gene can be defined simply as the longest ORF for a given region of DNA. Translation of a DNA sequence in all six reading frames is a straightforward task, which can be performed on line using, for example, the Translate tool on the ExPASy server (http://www.expasy.org/tools/dna.html) or the ORF Finder at NCBI (http://www.ncbi.nlm.nih.gov/gorf/gorf.html).
Of course, this approach is oversimplified and may result in a certain number of incorrect gene predictions, although the error rate is rather low. Firstly, DNA sequencing errors may result in incorrectly assigned or missed start and/or stop codons, because of which a gene might be truncated, overextended, or missed altogether. Secondly, on rare occasions, among two overlapping ORFs (on the same or the opposite DNA strand), the shorter one might be the real gene. The existence of a long “shadow” ORF opposite a protein-coding sequence is more likely than in a random sequence because of the statistical properties of the coding regions. Indeed, consider the simple case where the first base in a codon is a purine and the third base is a pyrimidine (the RNY codon pattern). Obviously, the mirror frame in the complementary strand would follow the same pattern, resulting in a deficit of stop codons [235]. Figure 4.1 shows the ORFs of at least 100 bp located in a 10-kb fragment of the E. coli genome (from 3435250 to 3445250) that encodes potassium transport protein TrkA, mechanosensitive channel MscL, transcriptional regulator YhdM, RNA polymerase alpha subunit RpoA, preprotein translocase subunit SecY, and ribosomal proteins RplQ (L17), RpsD (S4), RpsK (S11), RpsM (S13), RpmJ (L36), RplO (L15), RpmD (L30), RpsE (S5), RplR (L18), RplF (L6), RpsH (S8), RpsN (S14), RplE (L5), and RplX (L24). Although the two ORFs in frame +1 (top line, on the right) are longer (207 aa and 185 aa) than the ORFs in frame −3 (bottom line, 117 aa, 177 aa, 130 aa, and 101 aa), it is the latter that encode real proteins, namely the ribosomal proteins RplR, RplF, RpsH, and RpsN.Because of these complications, it is always desirable to have some additional evidence that a particular ORF actually encodes a protein. Such evidence comes along many different lines and can be obtained using various methods, e.g. the following ones:
- The ORF in question encodes a protein that is similar to previously described ones (search the protein database for homologs of the given sequence).
- The ORF has a typical GC content, codon frequency, or oligonucleotide composition (calculate the codon bias and/or other statistical features of the sequence, compare to those for known protein-coding genes from the same organism).
- The ORF is preceded by a typical ribosome-binding site (search for a Shine-Dalgarno sequence in front of the predicted coding sequence).
- The ORF is preceded by a typical promoter (if consensus promoter sequences for the given organism are known, check for the presence of a similar upstream region).
The most reliable of these approaches is a database search for homologs. In several useful tools, DNA translation is seamlessly bound to the database searches. In the ORF finder, for example, the user can submit the translated sequence for a BLASTP or TBLASTN (see 4.4) search against the NCBI sequence databases. In addition, there is an opportunity to compare the translated sequence to the COG database (see 3.4). A largely similar Analysis and Annotation Tool (http://genome.cs.mtu.edu/aat.html), developed by Xiaoqiu Huang at Michigan Tech [361], also compares the translated protein sequences to nr and SWISS-PROT; in addition, it checks them against two cDNA databases, the dbEST at the NCBI and Human Gene Index at TIGR.
Other methods take advantage of the statistical properties of the coding sequences. For organisms with highly biased GC content, for example, the third position in each codon has a highly biased (very high or very low) frequency of G and C. FramePlot, a program that exploits this skew for gene recognition [380], is available at the Japanese Institute of Infectious Diseases (http://www.nih.go.jp/~jun/cgi-bin/frameplot.pl) and at the TIGR web site (http://tigrblast.tigr.org/cmr-blast/GC_Skew.cgi). The most useful and popular gene prediction programs, such as GeneMark and Glimmer (see 3.1.2), build Markov models of the known coding regions for the given organism and then employ them to estimate the coding potential of uncharacterized ORFs.
Inferring genes based on the coding potential and on the similarity of the encoded protein sequences to those of other proteins represent the intrinsic and extrinsic approaches to gene prediction [110], which ideally should be combined. Two programs that implement such a combination, developed specifically for analysis of prokaryotic genomes, are ORPHEUS (http://pedant.gsf.de/orpheus [249]) and CRITICA ([67], source code at http://www.math.uwaterloo.ca/~jhbadger/). Several other algorithms that incorporate both these approaches are aimed primarily at eukaryotic genomes and are discussed further in this section.
4.1.1.2. Unicellular eukaryotes
Genomes of unicellular eukaryotes are extremely diverse in size, the proportion of the genome that is occupied by protein-encoding genes and the frequency of introns. Clearly, the smaller the intergenic regions and the fewer introns are there, the easier it is to identify genes. Fortunately, genomes of at least some simple eukaryotes are quite compact and contain very few introns. Thus, in yeast S. cerevisiae, at least 67% of the genome is protein-coding, and only 233 genes (less than 4% of the total) appear to have introns [660]. Although these include some biologically important and extensively studied genes, e.g. those for aminopeptidase APE2, ubiquitin-protein ligase UBC8, subunit 1 of the mitochondrial cytochrome oxidase COX1, and many ribosomal proteins, introns comprise less than 1% of the yeast genome. The tiny genome of the intracellular eukaryotic parasite Encephalitozoon cuniculi appears to contain introns in only 12 genes and is practically prokaryote-like in terms of the “wall-to-wall” gene arrangement [425]. Malaria parasite Plasmodium falciparum is a more complex case, with ~43% of the genes located on chromosome 2 containing one or more introns [272]. Protists with larger genomes often have fairly high intron density. In the slime mold Physarum polycephalum, for example, the average gene has 3.7 introns [851]. Given that the average exon size in this organism (165 ± 85 bp) is comparable to the length of an average intron (138 ± 103 bp), homology-based prediction of genes becomes increasingly complicated.
Because of this genome diversity, there is no single way to efficiently predict protein-coding genes in different unicellular eukaryotes. For some of them, such as yeast, gene prediction can be done by using more or less the same approaches that are routinely employed in prokaryotic genome analysis. For those with intron-rich genomes, the gene model has to include information on the intron splice sites, which can be gained from a comparison of the genomic sequence against a set of ESTs from the same organism. This necessitates creating a comprehensive library of ESTs that have to be sequenced in a separate project. Such dual EST/genomic sequencing projects are currently under way for several unicellular eukaryotes (see Appendix 2).
4.1.1.3. Multicellular eukaryotes
In most multicellular eukaryotes, gene organization is so complex that gene identification poses a major problem. Indeed, eukaryotic genes are often separated by large intergenic regions, and the genes themselves contain numerous introns, many of them long. Figure 4.2 shows a typical distribution of exons and introns in a human gene, the X chromosome-located gene encoding iduronate 2-sulfatase (IDS_HUMAN), a lysosomal enzyme responsible for removing sulfate groups from heparan sulfate and dermatan sulfate. Mutations causing iduronate sulfatase deficiency result in the lysosomal accumulation of these glycosaminoglycans, clinically known as Hunter's syndrome or type II mucopolysaccharidosis (OMIM entry 309900) [896]. A number of clinical cases have been shown to result from aberrant alternative splicing of this gene's mRNA, which emphasizes the importance of reliable prediction of gene structure [631].Obviously, the coding regions compose only a minor portion of the gene. In this case, positions of the exons could be unequivocally determined by mapping the cDNA sequence (i.e. iduronate sulfatase mRNA) back to the chromosomal DNA. Because of the clinical phenotype of the mutations in the iduronate sulfatase gene, we already know the “correct” mRNA sequence and can identify various alternatively spliced variants as mutations. However, for many, perhaps the majority of the human genes, multiple alternative forms are part of the regular expression pattern [118,576], and correct gene prediction ideally should identify all of these forms, which immensely complicates the task.
Ideally, gene prediction should identify all exons and introns, including those in the 5′-untranslated region (5′-UTR) and the 3′-UTR of the mRNA, in order to precisely reconstruct the predominant mRNA species. For practical purposes, however, it is useful to assemble at least the coding exons correctly because this allows one to deduce the protein sequence.
Correct identification of the exon boundaries relies on the recognition of the splice sites, which is facilitated by the fact that the great majority of splice sites conform to consensus sequences that include two nearly invariant dinucleotides at the ends of each intron, a GT at the 5′ end and an AG at the 3′ end. Non-canonical splice signals are rare and come in several variants [329,582]. In the 5′ splice sites, the GC dinucleotide is sometimes found instead of GT. The second class of exceptions to the splice site consensus includes so-called “AT-AC” introns that have the highly conserved /(A,G)TATCCT(C,T) sequence at their 5′ sites. There are additional variants of non-canonical splice signals, which further complicate prediction of the gene structure.
The available assessments of the quality of eukaryotic gene prediction achieved by different programs show a rather gloomy picture of numerous errors in exon/intron recognition. Even the best tools correctly predict only ~40% of the genes [697]. The most serious errors come from genes with long introns, which may be predicted as intragenic sequences, resulting in erroneous gene fission, and pairs of genes with short intergenic regions, which may be predicted as introns, resulting in false gene fusion. Nevertheless, most of the popular gene prediction programs discussed in the next section show reasonable performance in predicting the coding regions in the sense that, even if a small exon is missed or overpredicted, the majority of exons are identified correctly.
Another important parameter that can affect ORF prediction is the fraction of sequencing errors in the analyzed sequence. Indeed, including frameshift corrections was found to substantially improve the overall quality of gene prediction [133]. Several algorithms were described that could detect frameshift errors based on the statistical properties of coding sequences [224]. On the other hand, error correction techniques should be used with caution because eukaryotic genomes contain numerous pseudogenes, and non-critical frameshift correction runs the risk of wrongly “rescuing” pseudogenes. The problem of discriminating between pseudogenes and frameshift errors is actually quite complex and will likely be solved only through whole-genome alignments of different species or, in certain cases, by direct experimentation, e.g., expression of the gene(s) in question.
SNP,s and applications
In molecular biology and bioinformatics, a SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. For example, there are around 10 million SNPs that have been identified in the human genome[1]. As SNPs are highly conserved throughout evolution and within a population, the map of SNPs serves as an excellent genotypic marker for research.
Principles
The basic principles of SNP array are the same as the DNA microarray. These are the convergence of DNA hybridization, fluorescence microscopy, and solid surface DNA capture. The three mandatory components of the SNP arrays are:
- The array that contains immobilized nucleic acid sequences or target;
- One or more labeled Allele specific oligonucleotide (ASO) probes;
- A detection system that records and interprets the hybridization signal.
To achieve relative concentration independence and minimal cross-hybridization, raw sequences and SNPs of multiple databases are scanned to design the probes. Each SNP on the array is interrogated with different probes. Depending on the purpose of experiments, the amount of SNPs present on an array is considered.
Applications
An SNP array is a useful tool to study the whole genome. The most important application of SNP array is in determining disease susceptibility and consequently, in pharmacogenomics by measuring the efficacy of drug therapies specifically for the individual. As each individual has many single nucleotide polymorphisms that together create a unique DNA sequence, SNP-based genetic linkage analysis could be performed to map disease loci, and hence determine disease susceptibility genes for an individual. The combination of SNP maps and high density SNP array allows the use of SNPs as the markers for Mendelian diseases with complex traits efficiently. For example, whole-genome genetic linkage analysis shows significant linkage for many diseases such as rheumatoid arthritis, prostate cancer, and neonatal diabetes. As a result, drugs can be personally designed to efficiently act on a group of individuals who share a common allele - or even a single individual. A SNP array can also be used to generate a virtual karyotype using specialized software to determine the copy number of each SNP on the array and then align the SNPs in chromosomal order.
In addition, SNP array can be used for studying the Loss of heterozygosity (LOH). LOH is a form of allelic imbalance that can result from the complete loss of an allele or from an increase in copy number of one allele relative to the other. While other chip-based methods (e.g. Comparative genomic hybridization) can detect only genomic gains or deletions, SNP array has the additional advantage of detecting copy number neutral LOH due to uniparental disomy (UPD). In UPD, one allele or whole chromosome from one parent are missing leading to reduplication of the other parental allele (uni-parental = from one parent, disomy = duplicated). In a disease setting this occurrence may be pathologic when the wildtype allelle (e.g. from the mother) is missing and instead two copies of the mutant allelle (e.g. from the father) are present. Using high density SNP array to detect LOH allows identification of pattern of allelic imbalance with potential prognostic and diagnostic utilities. This usage of SNP array has a huge potential in cancer diagnostics as LOH is a prominent characteristic of most human cancers. Recent studies based on the SNP array technology have shown that not only solid tumors (e.g. gastric cancer, liver cancer etc) but also hematologic malignancies (ALL, MDS, CML etc) have a high rate of LOH due to genomic deletions or UPD and genomic gains. The results of these studies may help to gain insights into mechanisms of these diseases and to create targeted drugs.