Evolution Of Epidemiologic Methodology From The Statistical Perspective
M Nurminen
Citation
M Nurminen. Evolution Of Epidemiologic Methodology From The Statistical Perspective. The Internet Journal of Epidemiology. 2002 Volume 1 Number 1.
Abstract
This paper reviews notable developments in theoretical epidemiology from the statistical perspective. [The article is draws mainly from the author's experience of the evolution of epidemiologic methodology while working as a statistical researcher at the Finnish Institute of Occupational Health (FIOH), 1972-2002.] It starts with introductory remarks on some basic methods in epidemiologic data analysis, and proceeds emphasizing the importance of the Mantel-Haenszel era in epidemiology. Thereupon the paper outlines a shift towards more modern epidemiologic study designs, and mentions likelihood-based methods useful for epidemiologic regression analysis. This review further discusses multilevel modeling to allow for different levels of information to be combined in a hierarchical regression analysis. Finally, this paper is concerned with the statistical 'frequentist' versus Bayesian approaches to causal inference in epidemiologic research.
Basic methods for epidemiologic research
My first professional assignment at the FIOH was a mortality study of workers in an anthophyllite asbestos quarry and mine in Finland.1 In the data analysis, I used the standardized mortality ratio (SMR) as a summary measure of the fatalities. This statistic summarizes the observed survival experience of a cohort relatively to that expected from the vital statistics of a 'standard' population. The statistical inference using the SMR is not based solely on empirical observations. Rather, it is founded on the convolution of the data and the underlying statistical model, which, by the researcher's selection, is adopted to imitate the stochastic process that generated the data. In the case of the SMR, a viable model assumes that the observed number of deaths follows a Poisson distribution with the force of mortality (the intensity of risk of death) 2 as the single parameter of the model. A test of the hypothesis that there is no excess mortality can be derived as a score statistic from the likelihood function 3 for the intensity of the Poisson probability model. Likelihood-based intervals for the SMR are obtainable as a function of the specified intensity, given that the observed data are regarded as fixed.
Unfortunately, the simple analysis described above does not extract fully the information contained in a study of the mortality of the asbestos worker cohort. The cohort life-table technique 2 offers a more advanced approach to the description and analysis of the survival experience. This method is a stochastic representation of the process, modeled usefully as a Markov process, which produces the health realizations with death as the final state. The life expectation at a given age is an easily calculated measure that has a direct probabilistic interpretation. This measure can be communicated to the decision-makers of health policies, for example, in terms of years of lives lost due to exposure to asbestos, or, alternatively, the years of life gained by the reduction of asbestos exposure following preventive measures, and through legislation. The results of the analysis can be depicted graphically by comparing the age-specific survival trend of the exposed cohort and that of the general population. One should keep in mind, however, the flattening of the true effect resulting from the so-called 'healthy worker effect':4 the people who keep on working and do not fall into the state of work disability are the surviving fitter ones. We know today that this problem can be analytically handled via the structured equations modeling approach 5 that is rather ingenious, but difficult to grasp.
“Prevalence odds = Incidence Density Average Duration (of illness)” is a basic intuitive demographic identity that is taken to hold in stationary populations.6 The methodologic folklore in epidemiology has contained inaccuracies of the relation. Keiding 7 interpreted the term incidence as intensity (hazard) and prevalence as probability, and provided a rigorous proof of the particular relation assuming that average incidence is age independent. Alho 8 derived a more general version of the relation that permits both incidence and discounted disease duration to be age-dependent. The treatment of the epidemiologic concepts in terms of mathematical and probabilistic models has strengthened the theoretical basis of the field.
Developments during the Mantel-Haenszel era in epidemiology
Cornfield 9 was the forerunner of modern epidemiology in the application of biostatistics. His key contribution to the development of the case-control study was to point out two observations. First, the relative risk of developing the illness in exposed persons as compared to non-exposed ones can be approximated for illnesses with a low incidence, by the ratio of the odds of having been exposed, contrasting the illness cases to noncases. The second observation was that the exposure-odds ratio (OR) can be estimated in a case-control study. In order for the OR and other statistics estimated from the data to be unbiased, Cornfield 9 assumed that the case and control groups are representative samples from the case and control study domains in the 'general' population. His solution to the statistical problem of the interval estimation of the OR arising from 'retrospective' studies was based on the likelihood function of two independent binomials for the cases and noncases.10 The same likelihood-based result was derived independently by Fisher.11 This elegant one-page paper, published 40 years after Fisher proposed the idea of likelihood,12 exemplifies that a publication can be concise when it is to the core of the problem.
In a landmark paper, Mantel and Haenszel 13 clarified the relation between case-control (retrospective) and cohort (prospective) studies by observing that the only conceptual difference between these two approaches was that the former involved sampling from the cohort rather than conducting a census of its population.
For the analysis of epidemiologic data in the form of multiple fourfold contingency tables, Mantel and Haenszel 13 derived a Chi-squared statistic with 1 degree of freedom by using an argument that involved conditioning (unnecessarily) by the marginal rates of the tables. Cochran 14 using an unconditional formulation earlier derived this efficient test. The test is used in a stratified analysis to control for the confounding bias by an extraneous determinant of the disease outcome. Moreover, Mantel and Haenszel 13 gave an estimator of the summary OR parameter across the strata of the confounder. The estimator was useful for epidemiologists, and it was being used for two different types of data layouts: a small number of tables with large frequencies, and a large number of tables with small frequencies (e.g. matched series). However, it took 25 years to develop a simple and robust formula for interval estimation of the Mantel-Haenszel OR.15 There exists also a summary risk ratio (RR) estimator of the common RR for cohort studies that is completely analogous with the Mantel-Haenszel OR estimator, and which is almost as efficient as the corresponding iterative maximum likelihood estimator.16
The Mantel-Haenszel procedure is simple and free of assumptions, and yields a consistent estimate that converges in probability to the true risk parameter as the sample size increases. The paper 13 had a huge impact, and is still widely popular. From 1974 to 2002, it received over 5,700 citations, and it continues to be cited at the rate of about 160 per year (source: Institute for Scientific Information, Web of Science).
A shift towards modern epidemiology
In the cohort sampling scheme, according to traditional statistical theory,17 independent representative samples are drawn from the exposed and non-exposed populations (in statistical lingo: 'infinite super-populations'). As described above, the classic case-control study inverted the cohort design by drawing independent random samples from the sub-populations of cases and noncases. In a remarkable paper, Miettinen 6 demonstrated that the estimation of OR in case-referent studies on incidence rates could be done without any assumption about the 'rarity' of illness. For the derivation, he abandoned the classic sampling model for case-referent studies. Instead, a modern epidemiologist designs the study base by choosing from the source population the relevant experience that (s)he desires to study. The study population is either a cohort (closed) population or a dynamic one (open with population turnover), and the researcher's task is to record the cases of illness that arise in the base population and to draw a reference sample of the study base.18 The cases and the referents are then classified by the categories of the etiologic determinant. The case series provides the numerators of the compared rates, whereas the referent series provides the denominators. Since then, this (case-base) design option in epidemiologic research has become the model that underlies many modern variants of the case-referent study. Examples of designs with efficient sampling of the referents include a nested case-referent design,19 a case-cohort design,18
20 two-stage sampling design,21
,
22 and different case-pseudocontrol sampling designs.23 Rothman 24 has concluded, “
Likelihood-based inference on epidemiologic parameters
Modern approaches to the analysis of epidemiologic data originate from the development of likelihood inference based on explicit probability models.12 Fisher 26 introduced the likelihood inference for the OR parameter in a fourfold table for which he assumed an extended hypergeometric distribution. Likelihoods for the risk difference (RD) and RR can be modeled in a similar manner.27 The extended definitions of likelihood assume multiple formulations: conditional-likelihood,28 29 partial-likelihood,30 marginal-likelihood,28,31 quasi-likelihood,32 and, profile-likelihood.27,,Appendix C
In the early 1980s, Prof. Olli S. Miettinen, among others, developed statistical methodology for epidemiology. His work culminated in the publication of the textbook Theoretical Epidemiology.33 In this book, he considers the comparative analysis of epidemiologic rates, in terms of the RD, RR, and OR parameters, both for stratified data and under a regression model. The likelihood-based inference on the comparative parameters provided a unified approach for significance testing and parameter estimation. The relative theoretical merits of the Fieller-type,34 likelihood score and likelihood ratio statistics were examined.35 , 36. Simulation studies 38 , 39 , 40 , 41 , 42 have evidenced that the proposed (asymmetric) interval estimation method with a constrained maximum likelihood estimate of the variance performs better than the usual asymptotic intervals in small samples in terms of the actual confidence level.
Approaches to epidemiologic regression analysis
Regression analysis encompasses a vast array of techniques.43 , 44 A large variety of extensions to the linear regression model are available today for epidemiologists. In what follows only the basic methods used for modeling in epidemiology will be mentioned.
A logistic regression model can be applied to investigate the simultaneous effects of variables on disease risk. The response can be binary or ordinal-scaled. Several exposure variables, effect-modifiers and confounding factors may be accommodated. The methodology was developed in the 1960s for the needs of large cohort studies on cardiovascular disease, particularly the Framingham study in the USA. Statistically, the methods were derived using the discriminant function 45 and maximum likelihood 46 approaches. The logistic method has been applied in many other fields. In the occupational health field, Alho 47 developed a conditional logistic estimation procedure to solve a dual registration problem of the occupational disease registry at the FIOH.
The logistic model can be used to analyze case-referent data even if no external information is available to allow estimation of risks in the source population. Prentice 48 used Cornfield's 10 classic sampling model when he presented a binary logistic regression for case-referent data. The outcome parameter was the probability of having been exposed to a risk factor, and the illness status was entered as an explanatory variable in the regression equation. Although the causal relation was inverted in this model, it allowed the estimation of the OR as an exponential function of the model coefficients.
Epidemiologists have somewhat neglected the sensitivity of the maximum likelihood parameter estimates to model misspecification. If one posits a logistic model for the disease rates in the population that depends linearly on the determinants, but the true model form is quadratic, the regression coefficients estimated from the case-referent sample may differ markedly from the coefficients that one would estimate from a cohort study of the same population.49 For small or unbalanced data sets, and for highly stratified data, the asymptotic maximum likelihood methods are unreliable for parameter estimation. In these situations, the software package LogXact 50 can be used to compute exact logistic regression.
Cox's 51 regression model, or the proportional hazards model, is based on the notion of partial likelihood,30 and it is applicable to the analysis of survival data or event history data.52 The model is semi-parametric in that the hazard or momentary risk depends on time non-parametrically but the risk ratio is a parametric function of the covariates. Due to computational difficulties, the method was seldom used in the 1970s, but today it is applied generally. In Finland, Hakulinen 53 , 54 has developed analytic methods and computing procedures of survival analysis for studies on cancer epidemiology.
The log-linear (or exponential) risk model 56 is a most effective approach for the analysis of count (or aggregate) data, and especially for studying interdependencies. For example, the model was fitted to the 15-year follow-up data of a cohort of Finnish workers exposed to carbon disulfide.58 This intervention study was designed and conducted by Hernberg.56 For the analysis, the follow-up period was divided into five subperiods in which the deaths from ischemic heart disease were assumed to occur according to time-homogeneous Poisson processes. A piece-wise exponential model was fitted to the data; it indicated that the declining trend in mortality reflected the reduced levels of carbon disulfide exposure.58
Modern ('smooth') regression methods,59 such as additive models and scatterplot smoothers as well as projection pursuit regression, are powerful tools, for example, to detect nonlinearities in the data. However, they are computer-intensive, and the distribution theory does not give them much support. One should be cautious in applying these new methods, because it is very easy to over-fit models and over-interpret features of the data.
Multilevel modeling and hierarchical regression
Many statistical problems involve multiple parameters. There is need to reflect on the complexity of observed data and different patterns of heterogeneity, dependence, mismeasurement, etc. In epidemiology, multiple parameters are involved in analyses of: 'subject effect' in growth curves; 'frailty' in correlated or familial survival data; 'center effect' in multi-center studies; risk ratios for a disease outcome in different areas or time periods; and risks ratios for different tumor sites in toxicological studies. In occupational and clinical epidemiology, the analysis of longitudinal data or repeated measurements 60 involves multiparametric modeling. Environmental epidemiology uses methods such as ecologic analysis, time-series analysis, and quantitative risk assessment, for linking data on the environment and health.61 The relations are often complex and fraught with uncertainties. When a model has many parameters, we may consider them as a sample from some distribution. In this way, we model the parameters with another set of ('hyper'-) parameters and build a model with different levels of hierarchy.
Greenland 62 argues that regression models with random coefficients offer a more scientifically defensible framework for epidemiologic analysis than the fixed-effects models now prevalent in epidemiology. The data often consist of multiple levels that have effects on the results. For example, in the study of disease outcomes, there are patients involved (level 1) who are treated by physicians (level 2) who, in turn, are working in different hospitals (level 3). The characteristics of each may influence health outcomes, such as the patient's level of education, the physician's practice style, and the hospital's level of technical equipment. Sometimes characteristics from different levels influence each other to produce a certain outcome.
Conventional statistical methods assume that the observations are independent of each other. In a hierarchy, the observations of the same subpopulation are usually alike in some respects, that is, the data are correlated. The so-called multilevel models or hierarchical regressions offer a more realistic and flexible description of the factors that create uncertainty than do fixed-effects models. An advantage of regression with random coefficients is that it can be used to solve the often-encountered problem of under-identification of causal effects in epidemiologic data. The approach is to stochastically constrain the analysis by imposing a distribution on some parameters. The analysis of the data can be done on an individual level or on a higher aggregate level, depending on the objective of the study. Theoretically, multilevel modeling is well suited to analyzing the influence of macrolevel contexts on microlevel behavior. Statistically, hierarchical analysis solves the problems that occur when we either aggregate the data to one, higher level (loss of information) or disaggregate the data to the lower level (overestimated precision).
Multilevel modeling increases greatly the statistical precision and robustness of data analysis. A hierarchical regression is modeled in two stages. First, an ordinary (e.g. logistic) regression model is written for the effects of fixed parameters. In the second stage, a random distribution is defined for some of the parameters of the first-stage model, for example, to describe the presence of error in exposure measurement. By combining the stage 1 and stage 2 models one gets a mixed model with coefficients both for fixed-effects and for random-effects.
Multilevel models can also be estimated using a Bayesian analysis.63 The Bayesian approach provides a natural framework to handle models of almost arbitrary complexity. There are many applied situations in which multilevel models and Bayesian estimation methods allow better analyses than more traditional methods. In a way, the hierarchical approach unifies the traditional and Bayesian methods.64
Statistical inference in epidemiologic research
Conventional ('frequentist') statisticians think of probabilities as frequencies observed in the long run of repeated experiments. Epidemiologic studies generally concentrate on nonexperimental research into causality in the health field. In these types of studies, there is little or no need for random sampling in the selection of the study base. However, randomization is needed for causal inferences from conventional statistics, for example, in the study of intended effects of medical intervention in clinical trials.65 In the context of nonrandomized studies, Greenland 66 has questioned the interpretation of probabilistic measures such as a p-value 67 and a confidence interval as summaries of the variability of the results stemming from unidentified confounders.68 An unknown distribution of confounders cannot safely be assumed to be equivalent to what randomization would produce. According to this view, these statistics are merely rough descriptors of data variability. Causal inference should concern: (i) the search for explanations for patterns recognized in the data by statistical methods and (ii) criticism of the proposed theories about the physical mechanisms that generated the data.69
There are many alternative methods available for the description of variation in the data. In sensitivity analysis, the data can be modeled, for example, by leaving out some observations from the analysis and observing how much the results would change. In influence analysis, the model can be altered systematically to see whether the results are prone to change or whether they remain fairly similar. The uncertainty in the model specification can be reduced by the use of robust procedures. In random-effects modeling, one can enter variables into the model to stochastically limit, for example, the effects of measurement error. Semiparametric methods such as the generalized additive model 70 allow epidemiologists to visualize their data in novel ways, especially in the presence of nonlinear associations, leading to new insight and new hypotheses.
In Bayesian statistics, probability is used as a fundamental measure of uncertainty. Probabilities are interpreted as subjective beliefs, which are modified (according to the Bayes rule) as new information accumulates. Technically, prior information is convoluted with the data at hand, and the result is presented in the form of a posterior distribution. Epidemiologists, who felt that the specification of the prior distribution was difficult, previously seldom used Bayesian methods. Even today few epidemiologists apply Bayesian methods. The finding of Bayesian solutions presents a challenge even in the case of simple problems. The exact Bayesian analysis of the comparative epidemiologic parameters RD, RR, and OR in a two-by-two table furnishes an example.71
The empirical Bayesian analysis is an alternative approach in which a prior distribution is most easily specified to be reciprocally related to the distribution of the data, and the parameters of the conjugate distribution are estimated from the data. The Bayesian framework offers a possibility for the hierarchical modeling of case-referent studies that can be extended to deal with any number of categorical or discretized continuous exposure variables, and to identify suitable prior distributions.72 One can also perform a semi-Bayesian analysis by specifying some features of the prior distribution from existing knowledge and estimating given parameters from the data. In an epidemiologic application, for example, one can insert background information on relative risks into conjugate prior distributions.73
In a Bayesian analysis, to produce exact results from the posterior distribution, it is often necessary to evaluate integrals over large-dimensional parameter spaces, and this can be computationally intractable. However, new computer programs such as AD Model Builder 74 provides feasible approximations to these integrals in the form of a profile likelihood. The profile likelihood can then be used to estimate extreme values such as the tails of Bayesian credible intervals. The program also supports the Markov Chain Monte Carlo (MCMC) simulation for an 'exact' Bayesian analysis. The development of powerful MCMC methods has meant that computational issues are no longer a major obstacle to Bayesian inference. But model convergence must nevertheless be checked carefully, for example, when using the BUGS (Bayesian Inference Using Gibbs Sampling) program (fttp://www.mrc-bsu.cam.ac.uk/bugs).
Epidemiology has its limitations because there is not enough variation, for example, in many life-style factors within the studied population to observe RRs of sufficient magnitude to overcome the measurement error and confounding bias.75 Effective solutions may be seen in randomized intervention programs, but these can be prohibitively costly and difficult to design in nonexperimental settings. In purely observational studies one can make better inferences by thinking about the causal relations among variables and by integrating causal structures into the data analysis. For example, if one wants to estimate the probability of causation for individuals in cases of liability, it is important to explicitly specify the underlying biologic model that has been assumed.76 Such methods include instrumental variable analysis used in econometrics.77 Rubin's 78 causal model, Robin's 5 G-computation algorithm for longitudinal data, and Pearl's procedures for causal reasoning based on directed, acyclic graphs.79 80
Future challenges of biostatistics in epidemiology
Biostatisticians have contributed for a long time to the conceptualization, development, and successful usage of epidemiologic methods for the study of disease causation and prevention. The International Biometric Society was established already in 1947.81 The Finnish Biostatistical Society was founded 40 years later in 1987. The activity of the Biostatistical Society has fostered the application of statistical and mathematical methods in epidemiology, medicine and biology in Finland. Especially the work carried out at the Research Division of Biometry at the Rolf Nevanlinna Institute for Mathematics of the University of Helsinki under the leadership of Professor Elja Arjas in the application of the Bayesian statistical inference and MCMC methods 82 deserves mentioning.
An extensive coverage of the statistical aspects of most areas of established epidemiologic methods, including more recent developments, is contained in the Encyclopedia of Epidemiologic Methods.83 There are, nevertheless, two current sources of concern.84 The first is the apparently irreversible over-mathematization of biostatistics. This trend is reflected in journals such as Biometrika and Biometrics that initially set out to be comprehensible to the less academic practitioners. Newer journals such as Statistics in Medicine and Biostatistics are more application-oriented. The second concern is that the evolution of biostatistics, which relies increasingly on important contributions from computing, can lead to the over-emphasis of the role of theory at the expense of practice in the teaching of epidemiologic methods for researchers. Although theory may be the best guide in practice, the stress in the application of biostatistics should be on the prefix bio.
Acknowledgments
I wish to thank Tuula Nurminen for her methodologic support in the preparation of this article and Terttu Kaustia for the English language revision.
Correspondence to
Dr Markku Nurminen, Department of Epidemiology and Biostatistics, Finnish Institute of Occupational Health, Topeliuksenkatu 41a A, FIN-00250 Helsinki, Finland. Phone: +358 9 4747 2408 Fax: +358 9 4747 2423 Email: markku.nurminen@ttl.fi