S Kushwaha, M Shakya
cellular and subcellular, comparative genomics approach, confidence level, genome sequence
S Kushwaha, M Shakya. Protocol of Rice Genome Annotation through Comparative Functional Genomics Approach. The Internet Journal of Genomics and Proteomics. 2008 Volume 4 Number 1.
Identification & characterization of genes and proteins are very important task, but these are slow processes as compared to the genome sequencing due to lack of annotation protocol. In this paper, efforts have been made to characterize
Genome sequencing of animals, plants and microbes is going very fast and genomic data increasing at very rapid rate., So storage of data and transformation of these data into information are critically needed. Genomic analysis of cereal crops like rice, wheat and maize, will contribute greatly to improvement to their productivity. Rice genome is very important among the cereal crops because of its small genome size (430 Mb) and high degree of chromosomal co-linearity with other cereal crops  like maize, wheat, barley and sorghum. It is a major food supply source for more than half of the world's population. In the countries like Asia, Africa, and Latin America where the demand for rice is at the top priority, the population is continuously increasing . There is need to develop novel techniques to breed new varieties of rice. Following the successful completion of human genome project, a new era of whole genome science has emerged ranging from humans to plants and yeast . Comparisons between distantly related genomes provide insight into the universality of biological mechanisms and identify experimental models for studying complex processes. The IRGSP, a public consortium of publicly funded laboratories has generated finished quality sequence of the entire genome using the clone-by clone sequencing strategy [13,21] and made it available to public domain. With the completion of the sequencing process, annotation is a dynamic process essential to add the value to the genome [2,4,5].
The major task in genome annotation is to identify the genes termed as structural annotation, which relies on the computational methods. Considering the importance of comparative genetics in the forefront of new knowledge on plant genomes and genes, comparative bioinformatics remains an essential strategy to gain new insights on the needs and expectations on rice genomics. The information regarding genes, their proteins and their specificity is obtained from cellular and subcellular locations of proteins [1,8,3]. Bioinformatics approaches are helping in expedite the determination of protein cellular and subcellular locations [10,11,12,14]. To explore this problem, proteins were classified , according to their specific characterization and subcellular locations [16,17,19], into the following 12 groups: (1) Chloroplast, (2) Cytoplasm, (3) Cytoskeleton, (4) Endoplasmic Reticulum, (5) Extracellular, (6) Golgi Apparatus, (7) Lysosome, (8) Mitochondria, (9) Nucleus, (10) Peroxisome, (11) Plasma Membrane (12) Vacuole. ESTs are c-DNA clone that has been arbitrarily chosen and subjected to single-pass sequencing in both directions, which gives us a rough canvassing of a tissue or organisms transcriptional content . They provide a highly cost & time effective method of accessing the desired feature. The cellular location (tissue) identification by the ESTs (japonica variety11) from root is reported in NCBI EST database . EST similarity search is explored with the help of BLAST. It is anticipated that the classification scheme, concept and prediction protocol can expedite the property determination of new genes and their protein. It may also use in the prioritization of genes for potential molecular targets identification [20,21]. Here we, have made a comparative functional genomics analysis of results obtained through various tools. Then statistical approach is used i.e. we have assigned confidence level to the functionally annotated sequences.
Material & Methods
When work was started chromosome1 was containing of 500 hypothetical sequences. The relevance of work is increased as the genome sequencing is complete but genes detail is still to be explored in the gene bank. Each individual sequence is now run with the help of functional annotating tools. The tools used here are Interpro, SVMProt (Support vector machine based), Pfam (conserved domain based), GFSelector, MIPS BLAST, TAIR Tools, and PROTFun. For subcellular localization prediction, the tools used here are SubLoc, ESL Predicts, TargetP, Cello, Psort, Predator, Mito-II, Chloro-I, LOC-tree [15,17].These tools are used for identification and characterization of genes products i.e. Proteins. For calculation of confidence score (CS) the following formula is used
Results & Discussions
The results from all the tools (Interpro, Pfam, GFSelector, MIPS BLAST, TAIR Tools, PROTFun and SVMProt.) are stored in the form of excel sheet. Results from these tools are then compared by calculating confidence level for the functions i.e. it gives a statistical analysis of tools giving common function for one sequence. This can lead to a conclusion that how much confidence is there for any function assigned to a sequence. From the 500 hypothetical protein sequences, 25 sequences have been assigned a protein function with 75 % confidence level, 10 sequences with 62.5 % confidence level, 25 sequences with 50 % confidence level, 35 sequences with 37.5 % confidence level, 60 sequences with 25 % confidence level, 285 sequences with 12.5 % confidence level and remaining 60 sequences are with very low percentage confidence [Table-1]. The graphical representation of Table-1 is shown in Graph-1.
The characterization of protein which showed 12.5% cutoff score were selected for subcellular localization prediction and then the confidence level of these proteins was calculated statistically. For subcellular localization prediction, the tools used here are SubLoc, ESL Predicts, TargetP, Cello, Psort, Predator, Mito-II, Chloro-I, LOC-tree [14,15,16]. The results from all subcellular localization prediction tools are stored in the form of excel sheet. Results from these tools are then compared by calculating confidence level for the functions i.e. it gives a statistical analysis of tools giving common function for one sequence. This can lead to a conclusion that chromosome-1 is characterized by Nuclear (Nu.) (50%) Mitochondrial (Mito)(21%), plastid (Chlo.) (12%), Secretary Protein (SP)(2%), Plasma Membrane (PM) (6%), Cytoplasmic proteins (Cyt) (5%), Ext. Cellular Proteins (EC) (4%) [Table-2].
Here cellular location (tissue) identification is done by the ESTs . In NCBI EST database has 44 entries for the
The different tools with different logic and algorithms were used to analyse these sequences and their results indicate common functions for these sequences, but with different levels of confidence. The sequences with higher confidence levels can be given more priority for the purpose of research and development to improve rice cereal. The sequences with low confidence levels imply that they do not have significant homology with data sets already present in databases. In characterization process 60 sequences are not showing the results i.e. 12% of chromosome1, have no functional significance (non-coding sequences) [fig-1]and 88% of chromosome 1 has biological activity [Molecular Regulatory- Nuclear (50%), mitochondrial (21%), plastid (12%), Secretary Protein (2%), PlasmaMembrane (6%), Cytoplasmic (5%), Ext. CellularProteins (4%)] of Oryza [fig-2]. All these predictions are made on the basis of bioinformatics tools &techniques, by statistical analysis.
The present model can be further extended with same modifications, if necessary for analysis of other varieties of
We are grateful to Department of Bioinformatics, MANIT, Bhopal, India for support and cooperation.