From EST to SSR Marker

Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. This deluge of genomic information has, in turn, led to an absolute requirement for computerized databases to store, organize, and index the data and for specialized tools to view and analyze the data.

Bioinformatics and Biological Databases

Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. At the beginning of the "genomic revolution", a bioinformatics concern was the creation and maintenance of a database to store biological information, such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data. It also includes the software that are required for the detailed analysis of the genes and proteins e.g. analysis of gene sequences for restriction sites and regulatory elements, open reading frames, comparison with the genes from other sources, designing of primers for PCR and hybridization studies, construction of three dimensional proteins encoded by them, delineation of the functional domains etc. Bioinformatics is the essential means of analysis and interpretation large component of genomes. It provides the analysis machinery for deriving results from genomics data. As genomics data and approaches grow in importance and size, bioinformatics is playing an increasingly important and central role in biology research.

A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name, the input sequence with a description of the type of molecule, the scientific name of the source organism from which it was isolated, and often, literature citations associated with the sequence.

EST database and SSR

ESTs: Gene Discovery Made Easier

Expressed Sequence Tags (ESTs) are short cDNA sequences that serve to "tag" the gene from which the messenger RNA (mRNA) originated and that can serve multiple important uses. Typically, anonymous ESTs are single-pass sequenced to yield a 200–700 bp sequence that can be used to search DNA and protein databases for similar genes (ADAMS et al. 1991). The Institute for Genome Research (TIGR, is one of the main producers of new sequence data, along with the other major humaqn genome sequencing centers and commercial enterprises such as Celera. TIGR’s main sequencing projects have been in microbial and crop genomes, and human chromosome 16. TIGR also maintains many genome specific databases focused on expressed sequence tags rather than complete genomic data. Investigators are working diligently to sequence and assemble the genomes of various organisms, including the mouse and human, for a number of important reasons. Although important goals of any sequencing project may be to obtain a genomic sequence and identify a complete set of genes, the ultimate goal is to gain an understanding of when, where, and how a gene is turned on, a process commonly referred to as gene expression. Once we begin to understand where and how a gene is expressed under normal circumstances, we can then study what happens in an altered state, such as in disease. To accomplish the latter goal, however, researchers must identify and study the protein, or proteins, coded for by a gene. As one can imagine, finding a gene that codes for a protein, or proteins, is not easy. An Expressed Sequence Tag is a tiny portion of an entire gene that can be used to help identify unknown genes and to map their positions within a genome. ESTs provide researchers with a quick and inexpensive route for discovering new genes, for obtaining data on gene expression and regulation, and for constructing genome maps. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns, the intervening DNA sequences interrupting the protein coding sequence of a gene. cDNA is a much more stable compound and, importantly, because it was generated from a mRNA in which the introns have been removed, cDNA represents only expressed DNA sequence. cDNA is a form of DNA prepared in the laboratory using an enzyme called reverse transcriptase. cDNA production is the reverse of the usual process of transcription in cells because the procedure uses mRNA as a template rather than DNA. Unlike genomic DNA, cDNA contains only expressed DNA sequences, or exons. Once cDNA representing an expressed gene has been isolated, scientists can then sequence a few hundred nucleotides from either end of the molecule to create two different kinds of ESTs. Sequencing only the beginning portion of the cDNA produces what is called a 5' EST. A 5' EST is obtained from the portion of a transcript that usually codes for a protein. These regions tend to be conserved across species and do not change much within a gene family. Sequencing the ending portion of the cDNA molecule produces what is called a 3' EST. Because these ESTs are generated from the 3' end of a transcript, they are likely to fall within non-coding, or untranslated regions (UTRs), and therefore tend to exhibit less cross-species conservation than do coding sequences. A UTR is that part of a gene that is not translated into protein. ESTs are generated by sequencing cDNA, which itself is synthesized from the mRNA molecules in a cell. The mRNAs in a cell are copies of the genes that are being expressed. Tools for Gene Mapping and Discovery ESTs as Genome Landmarks Just as a person driving a car may need a map to find a destination, scientists searching for genes also need genome maps to help them to navigate through the billions of nucleotides that make up the human genome. For a map to make navigational sense, it must include reliable landmarks or "markers". Currently, the most powerful mapping technique, and one that has been used to generate many genome maps, relies on Sequence Tagged Site (STS) mapping. An STS is a short DNA sequence that is easily recognizable and occurs only once in a genome (or chromosome). The 3' ESTs serve as a common source of STSs because of their likelihood of being unique to a particular species and provide the additional feature of pointing directly to an expressed gene. ESTs as Gene Discovery Resources ESTs are powerful tools in the hunt for known genes because they greatly reduce the time required to locate a gene. Because ESTs represent a copy of just the interesting part of a genome, that which is expressed, they have proven themselves again and again as powerful tools in the hunt for genes involved in hereditary diseases. ESTs also have a number of practical advantages in that their sequences can be generated rapidly and inexpensively, only one sequencing experiment is needed per each cDNA generated, and they do not have to be checked for sequencing errors because mistakes do not prevent identification of the gene from which the EST was derived. Using ESTs, scientists have rapidly isolated some of the genes involved in Alzheimer's disease and colon cancer. To find a disease gene using this approach, scientists first use observable biological clues to identify ESTs that may correspond to disease gene candidates. Scientists then examine the DNA of disease patients for mutations in one or more of these candidate genes to confirm gene identity. Using this method, scientists have already isolated genes involved in Alzheimer's disease, colon cancer, and many other diseases. It is easy to see why ESTs will pave the way to new horizons in genetic research. ESTs and NCBI For ESTs to be easily accessed and useful as gene discovery tools, they must be organized in a searchable database that also provides access to genome data. Because of their utility, speed with which they may be generated, and the low cost associated with this technology, many individual scientists as well as large genome sequencing centers have been generating hundreds of thousands of ESTs for public use. Once an EST was generated, scientists were submitting their tags to GenBank, the NIH sequence database operated by NCBI. With the rapid submission of so many ESTs, it became difficult to identify a sequence that had already been deposited in the database. It was becoming increasingly apparent to NCBI investigators that if ESTs were to be easily accessed and useful as gene discovery tools, they needed to be organized in a searchable database that also provided access to other genome data. Therefore, in 1992, scientists at NCBI developed a new database designed to serve as a collection point for ESTs. Once an EST that was submitted to GenBank had been screened and annotated, it was then deposited in this new database, called dbEST. dbEST: A Descriptive Catalog of ESTs Scientists at NCBI annotate EST records with text information regarding DNA and mRNA homologies. Scientists at NCBI created dbEST to organize, store, and provide access to the great mass of public EST data that has already accumulated and that continues to grow daily. The National Center for Biotechnology Information’s (GenBank) dbEST database contains (December 1, 2006) 39,826,554 ESTs – release 120106, ( Using dbEST, a scientist can access not only data on human ESTs but information on ESTs from over 300 other organisms as well. Whenever possible, NCBI scientists annotate the EST record with any known information. For example, if an EST matches a DNA sequence that codes for a known gene with a known function, that gene's name and function are placed on the EST record. The availability of large expressed sequence tag (EST) databases has led to a revolution in the way new genes are identified. Mining of these databases using known protein sequences as queries is a powerful technique for discovering orthologous and paralogous genes. The problem of high redundancy in EST databases is now well understood. A powerful way to manage this redundancy is to assemble clusters of ESTs representing the same message into longer virtual cDNA sequences. There are several advantages to working with assemblies rather than individual ESTs: first, there are fewer sequences to analyse; second, the ssembled sequences are longer and potentially contain more interpretable coding sequence than their individual component ESTs; third, sequencing errors present in individual ESTs may be corrected during the assembly process; fourth, the virtual cDNA sequences may extend to the 5' end of the mRNA, greatly facilitating cloning of the gene in the laboratory.

Important DNA assembling softwares are :

Phrap (Phil Green 1999,

CAP3 (Huang and Madan 1999,

Simple Sequence Repeat (SSR)

In the past few years, expressed sequence tag (EST) projects on plant species have generated a vast amount of publicly available sequence data that can be mined for simple sequence repeats (SSRs). However, these EST projects have largely focused on crop or otherwise economically important plants, and so far only few studies have been published on the use of intragenic SSRs in natural plant populations. These EST SSRs are useful as molecular markers in plant genetic and evolutionary studies because (i) they represent transcribed genes, (ii) a putative function

can often be deduced by a homology search, and (iii) since they are derived from transcripts, they are useful for assaying functional diversity in natural populations . Another important feature of EST-SSR markers is their expected higher levels of transferability to related species than genomic SSR markers. Several studies have now demonstrated not only high rates of infra-generic transferability but also transferability to other closely related genera , which is also very promising for comparative mapping and genomic investigations of natural populations. However, EST projects have largely focused on crop or otherwise economically important plants, although EST data from other plants are emerging.

Important SSR detecting tools are:

Tandem Repeat Finder (Benson 1999,

Spectral Repeat Finder (Sharma et al, 2004

Simple Sequence Repeat Finder (Sreenu et al. 2003,

SSRIT ( Ramesh et al 2001, )

TROLL (Martins et al. 2006,

SSR Primer ( Robinson et al 2004)


MISA (Thiel et al. 2003)

TRA (Bilgen. et al. 2004)

E-TRA (Karaca. et al 2005)

SSRScaner (Anwer et al., 2006)

REPuter (Kurtz et al. 2001),

STRING (Parisi et al. 2003),


Thursday, September 20, 2007

Zinc Finger Binding Protein

What is a zinc finger

Protein(s) which contains at least one zinc-finger.

A small, functional, independently folded domain that requires coordination of one or more zinc ions to stabilize its structure. These proteins use Zinc ions to fold properly into "Zn Fingers". Using a series of these Fingers, the transcription factor can recoginize a specific DNA sequence. So, folding of these zinc fingers is important

In eukaryotes, often complex sets of regulatory elements control the initiation of transcription of structure genes. Upstream of the RNA polymerase II initiation site there are different combinations of specific DNA sequences, each of which is recognized by a corresponding site-specific DNA-binding protein. These protein are called transcription factor.

Transcription factors have two functionally different domains, one that binds to specific DNA sequences and another that activates transcription. And now NMR methods recently have been used to determine the 3D structure of these motifs: zinc fingers, leucine zippers, and helix-turn-helix motifs.


One of the most abundant DNA-binding motifs. Proteins may contain more than one finger in a single chain; each motif consists of 2 anti-parallel beta-strands followed by an alpha-helix. A single zinc ion is tetrahedrally coordinated by conserved histidine and cysteine residues, stabilising the motif.

Zinc-finger-containing proteins constitute the most abundant protein super family in the mammalian genome, and are best known as transcriptional regulators. They are involved in a variety of cellular activities such as development, differentiation, and tumor suppression. The first zinc finger domain to be identified in Xenopus laevis, basal transcription factor TFIIIA (Miller et al. 1985), is the archetype for the most common form of zinc finger domain, the C2H2 domain. The three-dimensional structure of the basic C2H2 zinc finger is a small domain composed of a -hairpin followed by an -helix held in place by a zinc ion. Zinc fingers generally occur as tandem arrays, and in DNA-binding modules the number of sequential fingers determines specific binding to different DNA regions. One zinc finger binds the major groove of the double helix and interacts with 3 bp, and the minimal number of fingers required for specific DNA binding is two . One of the best characterized families of DNA-binding zinc fingers is the Sp/Krüppel-like factor. Members of this family share in common three highly conserved C2H2-type fingers in their C-terminal ends combined with transcriptional activator or repressor domains in the N terminus. Other families of DNA-binding zinc fingers differ from the C2H2-type basic module in the spacing and nature of their zinc-chelating residues (cysteine–histidine or cysteine–cysteine;. Additional families of zinc finger domains have been implicated in protein–protein interactions and lipid binding .

Zinc fingers are among the most common structural motifs in the proteome predicted from the genome sequences of Saccharomyces cerevisiae, Drosophila melanogaster, and Caenorhabditis elegans (Rubin et al. 2000) as well as the draft human genomic sequences.

The zinc finger domains are not only one of the most abundant domains in the eukaryotic genomes but are also one of the best examples of protein structure modularity. The abundance of zinc finger proteins in eukaryotic transcriptomes is believed to be a consequence of the high structural stability of the zinc-binding domains, the redox stability of the zinc ion to the ambient reducing conditions in a cell. These features make this domain a perfect structure for the formation of protein–protein and protein–nucleic acid complexes.

Znf domains are often found in clusters, where fingers can have different binding specificities. There are many superfamilies of Znf motifs, varying in both sequence and structure. They display considerable versatility in binding modes, even between members of the same class (e.g. some bind DNA, others protein), suggesting that Znf motifs are stable scaffolds that have evolved specialised functions. For example, Znf-containing proteins function in gene transcription, translation, mRNA trafficking, cytoskeleton organisation, epithelial development, cell adhesion, protein folding, chromatin remodelling and zinc sensing. Zinc-binding motifs are stable structures, and they rarely undergo conformational changes upon binding their target.

● The Cys2His2 zinc finger is one of the most common DNA-binding motifs in Eukaryota. A simple mode of DNA recognition by the Cys2His2 zinc finger domain provides an ideal scaffold for designing proteins with novel sequence specificities.

● Zinc-finger-containing proteins can be classified into evolutionary and functionally divergent protein families that share one or more domains in which a zinc ion is tetrahedrally coordinated by cysteines and histidines.

● The zinc finger domain defines one of the largest protein superfamilies in mammalian genomes;46 different conserved zinc finger domains are listed in InterPro ( Zinc finger proteins can bind to DNA, RNA, other proteins, or lipids as a modular domain in combination with other conserved structures.

The zinc finger is a sequence motif involved in binding of DNA:

The C2H2 class

Xaa - nonspecific amino acid.

This motif was first discoved in TFIIIA ( an RNA polymerase III associated transcription factor) isolated from Xenopus laevis (African clawed toad). In TFIIIA, this sequence is repeated nine times in the protein. Each repeat can coordinate a zinc ion with the two cysteines and two histidines:

       =      =
      =        =
       =      =
      CYS    HIS
       = \  / =
       =  Zn   =
      CYS/  \ =
       =     HIS
       =      =

The twelve residues between the cysteine and histidine loop out to form a DNA binding interface.

The Cx class

The Cx class of zinc fingers have a variable number of cysteines that can chelate a Zn ion. These are also involved in DNA binding such as the GAL4 protein (yeast transcripiton factor involved in galactose metabolism.) The cysteines are closely spaced and can vary from 4 to 6 in number.

       =      =
      =        =
       =      =
      CYS    CYS
       = \  / =
       =  Zn   =
      CYS/  \ =
       =     CYS
       =      =

Several different ZnF motifs have been characterised, and vary with regard to structure, as well as binding modes and affinities. ZnF motifs can coordinate one or more zinc atoms. They display considerable versatility in binding modes, even between members of the same class (e.g. some bind DNA, others protein), suggesting that ZnF motifs are stable scaffolds that have evolved specialised functions. Zinc-binding motifs are stable structures, and they rarely undergo conformational changes upon binding their target. Most ZnF proteins contain multiple finger-like protrusions that make tandem contacts with their target molecule, often recognising extended substrates. A few of the most common structurally defined ZnF motifs are described below.

Classical (C2H2) ZnF motifs

These motifs contain a short beta hairpin and an alpha helix (beta/beta/alpha structure), where a single zinc atom is held in place by Cys(2)His(2) (C2H2), Cys(2)HisCys (C2HC), or Cys(3)His (CCCH) residues. These are the most common DNA-binding motifs found in eukaryotic transcription factors. Transcription factors usually contain several zinc fingers (each with a conserved beta/beta/alpha structure) capable of making multiple contacts along the DNA.

GATA-type ZnF motifs

These motifs constitute type IV ZnFs with the general sequence C-X(2)-C-X(17-20)-C-X(2)-C, followed by a highly basic region. They can be subdivided into subgroups depending upon the length of the internal loop: type IVa have a 17-residue loop (CX2CX17CX2C), while type IVb have a 18-residue loop (CX2CX18CX2C). ZnF motifs with 19 or 20-residue loops are rare and found mainly in fungi. GATA factors play essential roles in development, differentiation and control of cell growth in eukaryotes. GATA proteins often contain more than one ZnF domain, where one domain binds DNA and the other modulates DNA binding, often by binding other factors.

RanBP-type ZnF motifs

These motifs consist of two short beta hairpins that sandwich a single zinc atom, and are similar in structure to the zinc-ribbon fold. These domains were first identified in the nuclear export protein RanBP2. RanBP ZnF domains are known to interact with ubiquitin.

A20-type ZnF motifs

These motifs bind a single zinc atom and were first identified in protein A20. These motifs are known to bind to ubiquitin, but contact a different region of ubiquitin from RanBP ZnF motifs.

LIM-type ZnF motifs

LIM domains coordinate one or more zinc atoms, and are named after the three proteins (LIN-11, Isl1 and MEC-3) in which they were first found. They consist of two zinc-binding motifs that resemble GATA-like ZnFs, however the residues holding the zinc atom(s) are variable, involving Cys, His, Asp or Glu residues. LIM domains are involved in proteins with differing functions, including gene expression, and cytoskeleton organisation and development. Protein containing LIM ZnF domains include the adaptor protein PINCH.

MYND-type ZnF motifs

MYND domains coordinate two zinc atoms, and are named after the three proteins (Myeloid translocation protein 8, Nervy, and DEAF-1) in which they were first found. They consist of two zinc-binding motifs, the first containing a short beta-hairpin, while the second consists of two short alpha-helices. Proteins containing MYND ZnF domains include the transcriptional co-repressor protein BS69.

RING-type ZnF motifs

RING (really interesting new gene) domains coordinate two zinc atoms. Protein containing RING ZnF domains include KAP-1, PML, and several E3 ubiquitin ligases (catalyse final step of protein ubiquitination pathway).

PHD-type ZnF motifs

PHD domains coordinate two zinc atoms, and are named after the class of proteins (plant homeodomain) in which they were first found. PHD ZnF domains differ from RING-type domains in containing a highly conserved Trp residue involved in the hydrophobic core; this residue is exposed to solvent in RING-type ZnF domains. Protein containing PHD ZnF domains include Ing2 (inhibitor of growth protein 2), BPTF, Pygopus (Wnt signalling pathway), WSTF transcription factor, and Datf1 (Death-associated transcription factor 1).

TAZ-type ZnF motifs

TAZ (transcriptional adaptor zinc-binding) domains consist of two ZnF motifs form a distinct fold unrelated to other ZnFs. Protein containing TAZ ZnF domains include CBP acetyltranscferase.

Protein which contains at least one zinc finger. A small, functional, independently folded domain that requires coordination of one or more zinc ions to stabilize its structure. Zinc fingers vary widely in structure, as well as in function, which ranges from DNA or RNA binding to protein-protein interactions and membrane association.


Miller, J., McLachlan, A.D., and Klug, A. 1985. Repetitive zinc-binding domains in the protein transcription factor IIIA from Xenopus oocytes. EMBO J. 4:1609 -1614.[Medline]

Rubin, G.M., Yandell, M.D., Wortman, J.R., Gabor Miklos, G.L., Nelson, C.R., Hariharan, I.K., Fortini, M.E., Li, P.W., Apweiler, R., Fleischmann, W., et al. 2000. Comparative genomics of the eukaryotes. Science 287:2204 -2215.[Abstract/Free Full Text]