A Computational Approach To Classify HIV Secondary Structure Of Enzymes
A Dubey, U Chouhan
Citation
A Dubey, U Chouhan. A Computational Approach To Classify HIV Secondary Structure Of Enzymes. The Internet Journal of Medical Informatics. 2009 Volume 5 Number 2.
Abstract
The structure of a protein can reveal its function and its evolutionary history. Extracting this information requires knowledge of the structure and its relationship with other proteins. Secondary structures of protein are compact with helices and strands. Hence there is a need for development of computational techniques for prediction and classification of HIV-1and HIV-2 protein (enzymes) structures. In this paper a machine learning model has been developed for classification of alpha, beta and residues of HIV ribonuclease, HIV reverse transcriptase, protease, integrase, and these four types of HIV enzymes are present in HIV1 & HIV2 cycle. Various machine learning algorithms such as J48, Rotation Forest, and Random Forest have been used to classify alpha, beta and residues of HIV reverse transcriptase, protease, ribonuclease, integrase and model developed gives fair accuracy. The information generated from these models can be of great use in clinical applications.
Introduction
Secondary structures of protein are compact with helices and strands. Hence there is a need for development of computational techniques for prediction and classification of HIV-1and HIV-2 protein (enzymes) structures. In this paper a machine learning model has been developed for classification of alpha, beta and residues of HIV ribonuclease, HIV reverse transcriptase, protease, integrase, and these four types of HIV enzymes are present in HIV1 &HIV2 cycle [19,20,21,22] as given in Figure1. Various machine learning algorithms such as J48, Rotation Forest, and Random Forest have been used to classify alpha, beta and residues of HIV reverse transcriptase, protease, ribonuclease, integrase and model developed gives fair accuracy. The information generated from these models can be of great use in clinical applications and to understand HIV structure better. As these are the better drug targets.
Random Forest is a class of ensemble method specially designed for decision tree classifiers .It combines the prediction made by multiple decision trees where each tree is generated based on the value of an independent set of random vectors .The random vectors are generated from a fixed probability distribution .Bagging using decision trees is a special case of random forests ,where randomness is injected into the model building process by randomly choosing N samples with replacement ,from the original training set. It has been theoretically proved that the upper bound for generalization error of random forests converges to the following expression when the number of trees is sufficiently large.
Where ρ is the average correlation among the trees and s is a quantity that measures the strength of the tree classifier. The strength of a set of classifier refers to the average performance of the classifier where performance is measured probabilistically in terms of the classifier margin.
Where Yθ is the predicted class of X according to a classifier built from some random vector θ. The higher the margin is, the more likely it is that the classifier correctly predicts a given example X [17].
Result & Discussion
To achieve our goal and develop our methodology we obtained the dataset from Protein Data Bank (PDB) for both HIV-1 & HIV-2. The following six cases arises for classification of HIV-1 & HIV-2 enzymes. PDB Classification according to HIV Reverse Transcriptase, HIV Protease, and HIV ribonuclease by J48, Random forest, Rotation Forest will give the following results.
The confusion matrix of alpha+beta of HIV-1 & HIV-2 generated from the above is given as under:
The Detailed Accuracy developed By Class is shown as-
the instances for a selection of different cost ratios train the scheme on each weighted set, count the true positives and false positives in the test set, and plot the resulting point on the ROC axes. The ROC curves for different classes have been plotted as shown in
The accuracy of results obtained by different algorithms is presented in Table -1
Thus we observe that out of the 346 instances of HIV reverse transcriptase, protease, integrase, and ribonuclease taken for cross validation 322 were classified correctly whereas 24 were classified incorrectly by rotation forest classifier. This accounts to 93.0636 % accuracy which was the highest among all the three classifier used here so far. Thus the above classifier is able to classify HIV-1 and HIV-2 for which no algorithm has been reported in the literature so far. We can increase the instances by adding secondary structure data of other organisms like mouse, rat, pig and others but it does not give any significant change. This implies that the human instances are alone sufficient to develop the classifier. The reason is that similarity is 75-85% for enzyme structures among human and other organism. Hence inclusion of secondary structure data of other organisms will not only increase the instances but also increase the redundancy. The same model can be applied for organism like mouse, rat etc. for which secondary structure information is available in Protein Data Bank which is structure database of protein.
Conclusion
The above classifier takes into account the secondary structure of all the known 346 HIV enzymes as the rotation forest classifier performs the best among all the three classifiers, it qualifies as most suitable choice for classification and prediction. The authors wish to incorporate it as soon as more information is available in the future. The above model is useful for generating information which can be of great use in prediction of structure and function of all the enzyme structures present since they are key drug targets. The protein structure belonging to a particular class will have functional domains, alpha and beta sheet corresponding to that class which will ease in locating the active site(s) as well as the binding site(s) in the classified protein and hence it can be the potential active site or binding site for the drug. As more structures of HIV enzymes are discovered the above classifier can be trained to improve the accuracy of results.
Acknowledgement
The authors are highly thankful to Department of biotechnology, New Delhi for providing Bioinformatics Infra Structures Facility at MANIT, Bhopal for carrying out this work.