Clustering and Classification of Anopheline Spacer Sequences using Self Organizing Maps
A Banerjee, N Arora, U Murty
Keywords
classification, clustering, its2, mosquito, secondary structure, self organizing map som
Citation
A Banerjee, N Arora, U Murty. Clustering and Classification of Anopheline Spacer Sequences using Self Organizing Maps. The Internet Journal of Genomics and Proteomics. 2008 Volume 4 Number 1.
Abstract
ITS2, a well known phylogenetic marker is widely used in taxonomic studies. This study exploits a novel approach to classify and cluster the
Introduction
Malaria is the most devastating parasitic disease of human, exacting an estimated toll of 300–500 million new infections and 1.5–3·0 million deaths annually (World Malaria Report 2005). 41% of population lives in endemic regions in 107 countries under constant threat of malaria (World Malaria Report 2005). The completion of triad of mosquito, parasite and human genome fuelled the effort to see malaria in new light and provided much -needed impetus to studies at molecular level (Aultman
Nuclear ribosomal RNA genes (rDNA) are organized in clusters containing the 18S, 5.8S and 28S subunits in eukaryotic organisms. Two internal transcribed spacers (ITS) namely, ITS1, separating 18S and 5.8S genes and ITS2 lying between 5.8S and 28S genes are known to occur (Fedoroff 1979). These spacer sequences are used extensively as reliable markers for taxonomic classification across taxa and exploited for phylogenetic reconstruction by virtue of its fast evolution. (Coleman 2003, Alvarez and Wendel 2003).
Studies focusing on ITS2 find a common place in taxonomic studies more so in case of mosquito genera. ITS2 region has been exploited extensively for differentiating among closely related mosquito species (Crabtree
Spacer sequences often projected as the most efficient weapon in arsenal for resolving phylogenetic relationships at different divergence levels (Hillis and Dixon 1991) do suffer from certain shortcomings. Though ITS2 sequences are used to resolve phylogenies at intra-individual, population and interspecies levels yet owing to their high variability, their use is often restricted to closely related species, finding little usage in phylogeny. The divergence in ITS sequences that stems from recombinant and pseudogenic variants often leads to misleading results and hence, reliance on ITS sequence only can prove costly making this marker a double edged sword. Role of secondary conformations of the ITS regions in defining the cleavage sites to release the ribosomal genes during the maturation process is a well known phenomenon. If not more, secondary structure of RNA is as important as the sequence for the function. ITS2 Secondary structure prediction serves in furnishing additional information for phylogenic inferences and differentiating in functional and pseudogenic ITSs (Wesson
Although the tertiary structure of a functional RNA molecule is crucial determinant of its function, but prediction of its three dimensional structure from the sequence is difficult and cumbersome. However, the secondary structure is known to be conserved in functional RNAs and important to the function of the RNA Secondary structure models can be used for improving alignments at higher systematic levels even with strongly divergent regions such as the ITS, and the framework dictated by the secondary structure is considered as a tool for expanding the preliminary molecular phylogenies. Hence, the secondary structure is usually considered a sufficient approximation of the tertiary structure and several methods for predicting the secondary structures have been developed and implemented.
Since ITS2 secondary structure of numerous eukaryotes has been elucidated in the recent past (Joseph
Materials And Methods
Data collection and data set preparation
ITS2 sequences of
Among the parameters known to contribute towards RNA secondary structure stability, the parameters considered for this study are listed in table1:
Secondary structure prediction
RNA secondary structure consists of stems and loops. Mainly five types of loops are present in RNA secondary structure, namely, interior, hairpin, exterior, multi and bulge. For in depth analysis, calculation of secondary structure and determination of structural conservation is essential.
RNA folds analysis
Probable target accessibility (loops) was determined using
Structural energy calculation
Structural energy seems to be most important factor influencing the structural stability. The secondary structure with the lowest possible free energy value, the minimum free energy (MFE) structure, is predicted to be the most stable secondary structure for the strand. Among the sub-optimal structures calculated by Sribo program, lowest energy holding stable structures were considered and utilized for data mining analysis to interpret the influence of different factors on secondary structure stabilization.
GC content calculation
GC content is known to influence structural energy. GC percentage was determined using GC calculator (http://www.genomicsplace.com/gc_calc.html). All non-DNA characters except N were stripped before computing.
Besides the above mentioned parameters, other features like total bases were calculated manually.
Data mining analysis
We are living in a data rich information poor world where the magnitude of data generated from the high through put methods is overwhelming; Data mining opens a new window of opportunity in this arena. In the present study, data mining approach was utilized to find out the concealed information inherent in the sequence that finally affects the structural stabilization.
Self organizing Maps (SOM): Artificial Neural Networks (ANNs) is an abstract simulation of a real nervous system that contains a collection of neuron units communicating with each other via axon connections.
In SOM, neurons compete with each other to earn the right of representing the input data (Kohonen 2001). As a result, data in the multidimensional attribute space can be abstracted to a much smaller number of latent dimensions organized on a basis of a predefined geometry in a space of lower dimensionality, usually a regular two-dimensional array of neurons. By this way the structures embedded in the input data can be revealed which is placed in the input space and is spanned over the inputs distribution. Using a SOM network, it is possible to obtain a map of input space where closeness between units or clusters in the map represents closeness of the input data. Processing units in the SOM lattice is associated with weights of the same dimension of the input data. Using the weights of each processing unit as a set of coordinates the lattice can be positioned in the input space. During the learning stage the weights of the units change their position and “move” towards the input points. This “movement” becomes slower and at the end of the learning stage, the network is “frozen” in the input space. After the learning stage the inputs can be associated to the nearest network unit. When the map is visualized, the inputs can be associated to each cell on the map. One or more cell that clearly contains similar objects can be considered as a cluster on the map. These clusters are generated during the learning phase without any other information. It is not necessary to supply to the network cluster prototypes or examples. SOMs cluster the data in a manner similar to cluster analysis, but have an additional benefit of ordering the clusters and enabling the visualization of large numbers of clusters. These clusters are arranged in a low-dimensional topology-usually a grid structure that preserves the neighborhood relations in the high dimensional data (Kohonen T 1982, Nurnberger A. and Detyniecki 2002, Cuadros-Vargas
Parameters identified for SOM:
Structural parameters like Hairpin Loop, Internal Loop, Bulge Loop, Multi Loop, External Loop, Energy and inherent sequence parameters like total bases, G/C ratio, and GC content% were considered for this study.
Data Normalization:
Results And Discussion
In short:
Total no of sequences selected for study = 123
Total number of input parameters = 9
Total iterations per sequence to form a neuron = 10, 0000
Total iterations to form 4 grid (2X2) structure = 12300000
Successful or winning neurons = 4
Unsuccessful neuron = 0
Figure 5
Cluster (1, 1): This cluster contains 4 sequences and is characterized by moderate values for all the parameters. External loop shows the least variation while maximum variation was observed in bulge loop.
Cluster (1, 2): This cluster comprises of total 69 sequences. Maximum variation was observed in internal loop followed by hairpin, bulge, multi and external loop. Structural energy is high in all the sequences except for
Cluster (2, 1): The sequences falling in this cluster show uniformly high energies and similarly high G/C ratio while GC content% for these sequences is found to be quite low contrary to the popular belief of GC content being the most important parameter in determining the Structural energy. Highest variation is observed in internal loop followed by hairpin, multi, bulge and external loop
Cluster (2, 2): Total 28 sequences fall in this cluster. This cluster is characterized by variation in structural energies which is reflected in the gradient. This cluster has sequences that show differences in base number unlike other cluster. External loop showed lowest variation followed by multi loop. Variation in structural energy for different
Discussion
Molecular taxonomists are generally overwhelmed by complexity of smothering sequence information owing to their number and sibling status of
Since its inception by McCulloch and Pitts in 1993, ANN has come a long way and now encompasses a wide range of fields. Application of neural networks within the medical domain for clinical diagnosis, image analysis and interpretation and drug development have been reviewed in past. SOM is a novel approach that belongs to the class of unsupervised neural networks with competitive learning algorithm ability. The SOM approach is useful for extracting implicit, valuable, and interesting data from vast quantities of information. In this approach, neurons compete with each other to earn the right of representing the input data (Oja and Kaski 1999, Kohonen 2001). As a result, data in the multidimensional attribute space can be abstracted to a much smaller number of latent dimensions organized on a basis of a predefined geometry in a space of lower dimensionality, usually a regular two-dimensional array of neurons. Using this approach, the patterns embedded in the input data can be revealed. SOMs cluster the data in a manner similar to cluster analysis, but have an additional benefit of ordering the clusters and enabling the visualization of large numbers of clusters (Bock 2004). This technique is particularly useful for the analysis of large datasets where similarity matching plays a very important role. SOM compresses information while preserving the most important topological and metric relationships of the primary data items (Kirk and Zurada 1999). SOMs have successfully been applied for classification of DNA sequences based on codon usage (Kanaya
In the data set considered, GC content ranges from 44.6 % to 70.8% where
Clustering and visualization of sequence data using SOM according to inherent features enable efficient interpretation and analysis. The relationship of structural energy with sequence composition features and structural parameters can be explained using this technique. SOM reduces the complexity of multidimensional data hence can be effectively used for finding explicit relationships in such cases.
Concluding Remarks
RNA secondary structure is crucial to three dimensional structure but determination of the correct structure and folding pattern of ITS2 is cumbersome. It is practically unfeasible to calculate the effect of parameters influencing the structural energy of the RNA structure by conventional experimental approaches. With exponential increase in sequences, complexities in deriving interpretation and inferences from the accumulated data will pose an infinite challenge. Data mining approaches can streamline and facilitate in elucidating inherent explicit hidden information in these cases and will empower us in determining not- so- obvious interrelationships. Different RNA folding algorithms also take into account the structural energy as the major determinant in furnishing RNA secondary structure models and conformation. Clustering and visualization of such data will definitely add meaningful dimensions to our understanding of the relationships among the sequence features and structural parameters that come into play in determining the structural energy. This approach can be further fine-tuned in resolving ambiguities using differences at the RNA structural level for identification of sibling species complexes.
Acknowledgement
The authors are grateful to the Director, Indian Institute of Chemical Technology, Hyderabad for his continuous support and encouragement.
Correspondence to
Dr. Upadhyayula Suryanarayana Murty Scientist “F”/ Deputy Director Head, Biology Division, Indian Institute of Chemical Technology, Hyderabad-500007, India. E-mail: murty_usn@yahoo.com Phone: +91 40 27193134; Fax: +91 40 27193227