Application of Correlation & Regression Tree (CART) for management of Malaria in Arunachal Pradesh, India
U Murty, N Arora
U Murty, N Arora. Application of Correlation & Regression Tree (CART) for management of Malaria in Arunachal Pradesh, India. The Internet Journal of Tropical Medicine. 2007 Volume 5 Number 1.
Malaria is a focal disease with multitudinous variations in its epidemiological pattern in relation to topographical features. The present paper demonstrates the application of CART (Classification & Regression Trees) for control of malaria in Arunachal Pradesh, India. Baseline epidemiological data of 12 districts of Arunachal Pradesh was employed for deriving prediction rules. The data was categorized into 2 different aspects, namely (1) Epidemiological (2) Meteorological. The intricate and complex interactions that exist between diverse input data sets, as they relate to the target features, are learned and modeled through exhaustive analysis. Predictor variables (maximum temperature, minimum temperature, rainfall, relative humidity, number of rainy days and month) were ranked by CART according to their influence on the target variable (MPI). Application of these easily conceptualized rules, rather than more abstract epidemiological principles, enables even non-specialists to gain an understanding of the malaria problem and in forecasting the malaria transmission dynamics to formulate the intervention strategies to combat malaria effectively.
Malaria, the third leading cause of death attributable to an infectious disease worldwide, has plagued mankind for countless generations. The problem of Malaria is deeply entrenched in more than 90 countries of the world (WHO, 1998) and result in approximately 300 million acute illnesses and at least one million deaths annually (WHO, 1999). India being a tropical country is a malarial paradise with annual burden estimated to be nearly 2 to 2.5 million cases. North-Eastern region of India is in the Indo-Chinese hill zone of Macdonald's classification of stable malaria (MacDonald, 1957) and contributes nearly 9% of total malaria cases in India (Shiv Lal et al, 2000). In this region, perennial transmission of malaria slashes potential economic growth and thus is a major impediment to the overall development and progress of these areas. Despite several anti-malaria programmes, this region has seen little tangible progress in alleviating the burden of malaria (Mohapatra et al, 1998; Sen et al, 1994). Apparently, there are definite inadequacies that continue to dampen the spirit of public health specialists even since the halcyon days of malaria eradication. On closer scrutiny, operational difficulties stemming from the financial constraints and lack of definite knowledge about the malaria transmission trends are hampering the effective malaria control in the North-Eastern region (Mohapatra et al, 2003). Inaccessible areas owing to floods bear the maximum brunt of malaria. Main factors leading to failures in combating malaria in such regions are predominance of
Materials and Methods
Arunachal Pradesh is the largest state area-wise situated in the North-East region of India, sharing a long international border with Bhutan, China and Myanmar. This state is situated between latitude 26° 30' N and 29° 30 ' N and longitude 91° 30' E and 97° 30' E. The climate of the state is dominated by the Himalayan system and variations in altitude. The climate is highly hot and humid at the lower altitudes and in the valleys covered by swampy dense forest particularly in the eastern section, while it becomes exceedingly cold in the higher altitudes. Average temperature during the winter months ranges from 150 C to 210C and 220C to 300C during monsoon. Forested terrain and perennial streams are congenial for rapid multiplication and longevity of malaria vectors. Population of state is estimated to be 1091117 according to 2001 census. The state has a major population of 20 scheduled tribes and numerous sub-tribes. Agriculture is the primary driver of the economy. Nearly 80% of the population is engaged in.
12 districts of Arunachal Pradesh were randomly selected for the study. Epidemiological and meteorological data from 1999-2004 was collected from Directorate of Health, State Government of Arunachal Pradesh.
A dataset consisting of meteorological and epidemiological parameters from 12 districts of Arunachal Pradesh with 12 attributes was used.
Monthly parasite incidence (MPI) expressed as positive blood smears for malaria/total population*1000 was considered as the malariometric index in this study.
Data mining tool
CART version 5.0 from Salford Systems, California, USA, was used for the current analysis (http://www.salford-systems.com). CART automatically searches for important patterns and relationship uncovering hidden structure even in highly complex data, which can then be used to generate highly accurate and reliable predictive models for various applications (Breiman
The steps used in the analyses are summarized as follow:
1. Preprocessing of Data: Conversion of *.xls to *.csv format
2. Variable selection: The data consists of several fields describing each attribute. The attributes include (1) Name of Primary Health Centre (PHC) (2) locality (3) district (4) state (5) country (6) month (7)maximum temperature, (8) minimum temperature, (9) total rainfall, (10) relative humidity, (11) Number of rainy days (12) Monthly Parasite Incidence(MPI). Seven of the twelve attributes were further used for developing association rules. These include (1) maximum temperature, (2) minimum temperature, (3) total rainfall, (4) relative humidity, (5) number of rainy days and (6) month. These attributes form the independent (predictor) variables. The dependent (predictive) variable is MPI. All variables except month are continuous; hence, regression model was selected for this analysis.
3. Specification of the Tree type: Two tree types available in the CART version are classification and regression. Regression tree type was applied because of the predictive variable “MPI” is in continuous in nature in this study.
4. Splitting method selection: In choosing the best splitter, the program seeks to maximize the average “purity” of the two child nodes. A number of different measures of purity can be selected, loosely called “splitting criteria” or “splitting functions.” 8 splitting methods were incorporated in the CART interface i.e GINI (default), Splitting GINI, Entropy, Class Probability, Twoing, Ordered Twoing, Least square and LAD. Since Least Square method is the preferred method for regression trees, it was selected for the generation of trees in this study.This approach resulted in generation of 22 trees with different relative error and complexity (Table 1). Out of the 22 trees generated, LS splitting model gave optimal tree with 23 nodes with the minimum Resubstitution relative error and minimum complexity (Table1).
5. Selection of testing criteria: As the target class was having more distinct values (i.e. 10) than the folds specified, V fold cross validation method was selected with a value of 10 for testing the data. Default parameters i.e. search intensity and threshold level for enabling intelligent search were selected at 200 and 15 respectively
CART generated 22 trees having different number of terminal nodes with different relative error (Table 1).
The optimal tree obtained using LS method possessed 13 terminal nodes with a cross-validated error of 0.56743 ± 0.05412. A node is partitioned in such a way that left child node gets all cases with lower value of the splitting variable. Each decision rule is represented as a terminal node in the tree. The tree was further grown elevating each level at a time for comparison of rules and relative cost. The maximal grown tree showed 23 nodes with a relative cost of 0.567.
The decision rules (IF – THEN) used in this analysis are given in Table 2.
Variable importance of different predictors is summarized in Table3
The disease transmission dynamics is modeled using the parameters such as vector (pathogen transmitting agent) surveillance, parasitic load in the human community and sudden environmental changes. We used data mining tools in CART to find relationships between epidemiological data and the Monthly Parasite Incidence (MPI). These relations are generally hidden in a large dataset. The
These observed results could be used as predictive system and also used as a ‘rules-of-thumb-guide' in controlling the transmission of Malaria in a more effective way. The interpretation of the rules is as follows:
Rule # 1. If RAINFALL= 147.565, RELATIVE HUMIDITY<=89.305, MINIMUM TEMPERATURE<=3.50 C & MONTH= APRIL, AUGUST , FEBRUARY JANUARY , JULY , JUNE , MARCH , MAY THEN MPI=0.257817.
Rule # 22. If RAINFALL >533.55, RELATIVE HUMIDITY>87.75 and MAXIMUM TEMPERATURE> 370 C & MONTH = JUNE, JULY, AUGUST, THEN MPI= 24.9901.
This is in accordance to the general trends that malaria follows, showing very high peaks of incidence in monsoon months. Lowest MPI is predicted to occur when the rainfall limits the availability of surface water required for mosquito breeding and cold temperature limits vector survival, hence showing negligible malaria incidence. While a very high MPI as reflected in rule #22 is expected when optimum temperature is accompanied with significant rainfall amount in Monsoon months.
Results indicate that the 6 predictors, namely, (1) Rainfall (2) Relative Humidity (3) Month,(4)Maximum temperature (5) Rainy Days (6) Minimum temperature influenced the target variable in descending order. This is helpful in ranking the predictor variables. Thus, decision trees play an important role in the management of vector -borne diseases. The above decision rules will be helpful in assessing the parasite incidence in the particular month in any locality. Hence, by observing above decision rules appropriate control measure can be implemented in an appropriate time to reduce the parasite load by the public health officials.
The decision rules obtained by employing CART 5.0 can be used as a prediction tool for any malaria endemic areas in India as well as abroad. The predictive model will have a vital role in estimating the parasite load in the ensuing seasons of study area. Hence, necessary precautionary measures can be undertaken for successful implementation of control strategies. Further, CART 5.0 could rank the predictor variables according to their level of influence on the target variable. From the present study it was observed that RAINFALL, RELATIVE HUMIDITY, MONTH, MAXIMUM TEMPERATURE, RAINY DAYS and MINIMUM TEMPERATURE were found to be influencing the target variable in the descending order. Therefore, it was concluded from this study that data mining tools like CART could be successfully employed for predicting the course of vector-borne diseases.
Authors are grateful to Dr. J.S. Yadav, Director, IICT for his constant support and encouragement.