J Reed III
analysis-of-variance, randomized block analysis-of-variance, repeated measures analysis-of-variance, single-factor analysis-of-variance
J Reed III. Analysis of Variance (ANOVA) Models in Emergency Medicine. The Internet Journal of Emergency and Intensive Care Medicine. 2003 Volume 7 Number 2.
Four basic analysis of variance (ANOVA) designs are reviewed in this paper. Hypothetical examples drawn from the literature to illustrate a single factor ANOVA design, two a repeated measures ANOVA designs, and a randomized block ANOVA design. ANOVA designs provide "life beyond t-tests". These ANOVA designs are not by themselves a cure-all to all research problems. By carefully considering their treatment combination options, possible confounding factors, and subject availability, an efficient clinical trial may use one of these ANOVA designs.
Consider a study in which two new treatments (A and B) are being compared with a control group (C). One way to compare outcomes would simply be to compare A with C, B with C, and A with B using three t-tests. If we were to compare four groups, then we would need to use six t-tests. The difficulty with using multiple t-tests is that as the number of groups increases, so will the likelihood of finding a difference between any pair of groups simply by change when no real difference exists. If we are performing one test the Type I error rate for the experimental error rate (EWER) is equal to the Type I error. This significance level set by the researcher (typically at 0.05) is a measure of the risk the researcher is willing to accept of making a Type I error.
When multiple t-tests are performed the risk of committing a Type I error (EWER) increases exponentially (EWER = 1 - (1 - α)t , t is the number of statistical tests performed) as the number of tests increases. For instance, if we were to perform 3 separate t-tests each at α = 0.05, the EWER is 0.14. The probability of making a Type I error (saying that there is a difference when there is none has increased to 14 in 100). As the number of multiple t-tests increases, the EWER increases dramatically as shown in Table 1.
Obviously, multiple t-tests are not a good analytical strategy. One solution to the EWER problem is to use analysis-of-variance (ANOVA) methods. ANOVA methods maintain the EWER at a pre selected α. When the null hypothesis is rejected, the conclusion is that there is a statistical difference somewhere between the treatment groups. However, the researcher does not know exactly where the difference is. There are post-hoc or follow-up tests that allow detection of the differences while maintaining the researcher's pre-determined α level.
The objective of this paper is to review four basic ANOVA designs: Simple randomized design ANOVA (SRD), a single factor repeated measures design ANOVA (RMD), a two factor with repeated measures on one factor ANOVA, and a randomized block design ANOVA (RBD). These computations are efficiently handled by statistical algorithms that are readily available to the researcher. It is more important to be aware of the design choices and insure that the analytical method match the study design.
Simple Randomized Design (SRD ANOVA)
To test a null hypothesis of between k (k ≥ 3) number of experimental groups, or samples, a SRD should be used. In the simplest form, several groups or factors are compared on the same independent variable. For example, treatment is a factor with the various types of treatments comprising the levels. The levels should be exclusive, that is, an experimental unit (for example patients ) should appear under only one level. The terms
An acute traumatic aortic injury (ATAI) typically results in several characteristic chest radiographic findings, most notably is meadiastinal widening. A study based on the hypothesis that blood or fluid in the widened mediastinum might track up into the neck and be detected on lateral cervical radiographs was designed to test this hypothesis (Plewa, 1997). Thirteen consecutive adult cases of ATAI were identified and compare with 19 cases of negative aortography (NAO) and 18 multiple trauma victims (MT) without aortography. Measurements included the cervical soft-tissue (ST) width at the third (C3) and sixth (C6) cervical vertebrae and mediastinal width. The C3 ST averages 9.1 ± 2.8 mm, 8.5 ± 2.7 mm, and 6.9 ± 2.2 mm for the ATAI, NAO, and MT groups (p = 0.188). The C6 ST averages were 19.2 ± 4.5 mm, 18.6 ± 3.9, and 16.5 ± 3.5 for the ATAI, NAO, and MT groups (p = 0.148). These cervical differences were not significantly different between groups. These authors concluded that cervical ST swelling is not a useful marker for ATAI.
McCullough and colleagues evaluated the performance of multiple cardiac markers of myocyte injury, a rapid point-of-care assessment in patients with a broad spectrum of renal dysfunction (McCullough, 2002). The baseline renal function in 808 patients were stratified by corrected creatinine clearance into quartiles and those on dialysis. A multi-marker panel of biomarkers; myoglobin (MYO), cardiac troponin (iTnI), and creatine kinase myocardial band (CK-MB) was evaluated 0, 1.5, 3, and 9 hours. Mean MYO levels were elevated in the presence of renal dysfunction in those with and without myocardial infarction. There was a linear trend for MYO to become elevated as renal dysfunction worsened in both those who ruled in and those ruled out for AMI for timepoints 90 minutes and beyond. MYO mean values for those in quartile 4 and ESRD fell above the optimal cutoff point os 200 μg/L at all timepoints in those without AMI. In all timepoints 90 minutes and beyond, mean cTnI was below the detection limit for those who ruled out for AMI, including those with ESRD. The values for CK-MB were markedly elevated at all timepoints for all subgroups with AMI and were below the detection limit of 0.6 ng/mL for quartiles 1-4 without AMI.
Janda, et. al (Janda, 2001) reported on sliding injuries commonly associated with softball and baseball. This study quantified biomechanical stress responses of a test dummy on the amount of force experienced by the ankle (Fx), foot (Fy), tibia/fibula (Fz), and the moment in the motion plane of inversion/eversion (Mx) (Fz) on eight different types of bases (Standard Base, Rogers Pro, Rogers Adult, Rogers Teen, Rogers Youth, Meg-Nett large, Meg-Net Small, and Stay Down). The breakaway bases (Youth, Teen, Adult and Pro) were the only bases that led to a statistically significant reduction in force in Fx, Fy and Mx. Fy was larger in the stay Down and Meg-Net Large bases as compared to the stationary base while the remainder of the breakaway bases generated forces similar to the stationary base.
SRD: Hypothesis (equality of means)
We borrowed elements of the Janda study (3) to generate a hypothetical data set for three base types (Standard Base, Rogers Teen, and Rogers Youth) that might be encountered in little league baseball. Our hypothetical data and basic descriptive statistics are presented in Table 2B. Are there are any differences in these three bases on ankle stress (Fx) when a little leaguer slides into second base. Fx means ± standard deviations for each base type are presented in Table 2B.
From Table 2B, the mean Fx of three bases appear to be very different. A box-plot of the data Figure 1 shows that the Standard Base has the larger ankle stress measurements and shows that the variability within ankle stress of each base is not large so the mean values are a reasonable representation of the ankle stress for each of the bases. It also appears that the means of the ankle stress of the three bases are not the same. A SRD procedure for testing the equality of means quantifies this observation. It also appears as if the mean ankle stress of the Rogers-Teen and Rogers-Youth are nearly equal. A SRD table for this hypothetical data is:
The test statistic (F) is 313.3 and has an associated p-value of 0.0001. This indicates that there are at least two bases that differ in force Fx on ankle stress. The next step in the analysis plan is to find those differences using a post hoc testing procedure that preserves the EWER rate of α = 0.05.
There are a number of post-hoc tests for comparing all possible pairs of means in a SF ANOVA that preserve the EWER (Scheffe, Student-Newman-Keuls, and Tukey). These procedures are generally conservative and are options typically provided by most statistical packages. For this example, all three methods produced the same set of post hoc results. Those results are typically given in table form as follows:
The post-hoc analysis indicates that there are two distinct subsets of bases, the two Rogers bases and the standard base. Further, there is no difference between the Rogers - Youth and Rogers - Teen bases with regard to the stress exerted on the foot of a little leaguer sliding into second base.
Repeated Measures Design (RMD) - Single Factor
Since no two patients are equal in all respects individual responses to treatment may demonstrate a relatively large variability. If these differences between subjects can be separated from treatment effects and experimental error, then the sensitivity of the experiment may be increased. If this source of variability cannot be estimated, it remains part of the uncontrolled source of variability and is considered part of the experimental error found in the SRD. The primary purpose of experiments in which the same subject is observed under each of the treatments (factors) is to provide a control on differences between subjects. In this type of experiment, treatment effects for subject
Experimental designs in which the same subjects (elements) are used under all
RMDs are flexible and have many different labels. For example, a one-way repeated measures ANOVA is also known as a one-factor within-subjects ANOVA or a treatments-by-subjects ANOVA. Two RMD studies are briefly described in the following paragraphs.
The primary difference between the SRD and the RMD has to do with individual differences. The RMD is able to remove variance due to individual differences from the denominator of the F-ratio. The total variability is partitioned into two components: variance between treatments and variance within treatments. This stage is identical to the SRD. The second stage partitions the within groups variance into a variance between treatments (groups) variance and an error variance. The general notation and data structure for a RMD ANOVA is shown in Table 3A.
Diagnostic peritoneal lavage (DPL) fluid must be recovered from abdominal trauma patients to avoid falsely low red blood cell (RBC) counts. A convenience sample of eleven adult abdominal trauma patients in a Level 1 university trauma center who were undergoing DPL with 1 L crystalloid were eligible for enrollment into a study designed to define the amount of lavage fluid that must be collected before a sample can reliably be investigated by RBC analysis (Sullivan, 1997). RBC counts collected at 200, 400, 600, and 800 mL or returned fluid were analyzed for differences. The Mean RBC counts collected at 200 and 400 mL were lower than the final mean RBC measured at 800 mL. Mean RBC counts at 500 mL were less than those obtained at 800 mL but were not statistically lower. These authors concluded that at least 600 mL of effluent should be collected to avoid misleading, low RBC counts.
Occupational stress affects hemodynamic variables. Adams, et al (Adams, et al, 1998). Measured ambulatory BP, heart rate (HR), and heart rate variability of twelve emergency physicians during a 24-hour period before, during, and after an 8 hour night shift (Adams, et al, 1998). An elevation of systolic blood pressure was seen when comparing the night shift awake with nonshift awake, but was not significant. However, diastolic blood pressure was elevated an average of 5.5 mm Hg during the night shift when compared with nonshift awake. Both prework and midshift HRs were significantly higher than postwork rates.
RMD: Hypotheses (Equality of Means)
RMD - Example
We constructed hypothetical data based on the Adams, et al study (5) for 8 emergency physicians. Our hypothetical data with means and standard deviations are given in Table 3B. Given this data, are there any differences in systolic blood pressures of the emergency physicians during ED Awake, Non-ED Awake, or Non-ED Sleep periods?
The test statistic has an associated p-value of 0.721 which indicates that there are no significant differences in the mean SBP over the three time periods studied.
Two-factor with Repeated Measures on one Factor Design (TRMD)
In repeated measures designs, two terms differentiate among repeated and non-repeated factors. A "between" variable is a non-repeated or grouping factor, such as a treatment group. For a 'between" variable, subjects will appear in only level. A "within" variable is a repeated factor for which subjects participate in each level of that factor. These designs offer greater statistical power relative to the total sample size than other designs. The most important benefit of a repeated measures design is statistical power relative to the sample size. Repeated measures designs use the same subjects throughout different treatments thereby requiring fewer subjects. Since the subjects are constant, the variance due to subjects may be partitioned out of the error variance term which in turn increases the power of any statistical tests.
The general notation and data structure for a for a two-factor (factors A and B) with repeated measures on one factor (TRMD) is illustrated in Tables 4A. A TRMD in which there are repeated measures on factor B may be represented by the following.
G1 represents a group of n subjects, G2 represents a second group of subjects. The subjects in G1 are observed under treatment conditions ab11, ab12, ab13, and ab14. The subjects in G2 are observed under treatment conditions ab21, ab22, ab23, ab24. The subject factor is crossed with factor B but nested under factor A. In this design, comparisons between treatment combinations at different levels of factor A involve differences between groups as well as differences associated with factor A. Comparisons between different levels of factor B at the same level of A do not involve differences between groups. The net result is that the test for differences between different levels of factor A is accomplished by a "between subjects" statistical test is used. To test for differences between levels of B a "within subjects" statistical test is used.
To test the hypothesis of main effects or interactions, three assumptions must be met: 1) the observations for each subject are drawn from a multivariate normal distribution, 2) subjects are independently sampled, and 3) the sampling variances for all pairwise differences among means are equal. This third assumption is known as sphericity or circularity assumption. The F test is robust to violations of the multivariate normal assumption but not to the sphericity assumption. There are tests that researchers may use to test the sphericity assumption. However, these tests are sensitive to departures for multivariate normality. Consequently, the Type I error rate is underestimated.
The main advantage of the TRMD is that it is an effective design in that it controls for subject heterogeneity. Since only one group of subjects serves in all levels of one factor, we are reducing but not eliminating the error component of the model. Subjects are still likely to respond differently over repeated measures, but the intrasubject fluctuations are likely to be less than intersubject variations found in a SRD. This reduction in variance in the TRMD represents an increase in economy and power. Examples of a TRMD follow.
Turturro, Paris, ad Larkin (Turturro, 1998) conducted a randomized, double-blind clinical trial comparing oral tramadol with hydrocodone-acetaminophen in acute musculoskeletal pain. Pain was evaluated by a 100-mm visual analog scale at baseline and at 30, 60, 90, 120, and 180 minutes after dosing. At baseline, the mean pain scores did not differ but were significantly lower in the hydrocodone-acetaminophen group beginning at 30 minutes through 180 minutes.
Luhmann and colleagues conducted a prospective randomized clinical trial using 4 study groups who required laceration repair: (1) children who received standard care alone (SC); (2) children who received standard care and oral midazolam (M); (3) children who received standard care and nitrous oxide (N); and (4) children who received standard care, oral midazolam, and nitrous oxide (MN). The Observational Scale of Behavioral Distress-Revised (OSBD-R) was used to assess distress during baseline, wound cleaning, lidocaine injecting, suturing, and recovery. Mean OSBD-R scores were lower for groups that received nitrous oxide during wound cleaning, lidocaine injecting, and suturing. The authors concluded that continuous-flow nitrous oxide were more effective in reducing distress and had fewer adverse effects and shorter recovery times than midazolam.
TRMD Hypothesis (Equality of Means)
We constructed hypothetical data based on the Turturo study (Table 4B). We are interested in answering three questions: 1) is there a difference in VAS between oral tramadol with hydrocodone-acetaminophen, and 2) are there any differences in VAS at the different time intervals, and 3) are there any differences in VAS scores between treatments by time intervals.
From the TRMD ANOVA table we conclude that there is a difference between VAS scores and the two treatments oral tramadol and hydrocodone-acetaminophen (hypothesis 1). There are also differences among the VAS scores over time (hypothesis 2). Likewise, there are differences between the two VAS scores and treatments over time (hypothesis 3). We need to determine just where those differences are.
The first is relatively easy. By examining the mean VAS scores, we see that the hydrocodone-acetaminophen is consistently lower than the tramadol (hypothesis 1). A plot of the overall mean VAS scores over time give a visual confirmation of the pairwise reduction in mean VAS scores from baseline to 30 minutes, 30 minutes to 60 minutes, etc (hypothesis 2). Another plot of the treatment mean VAS scores over time reveals that the hydrocodone-acetaminophen over time is superior in reducing the mean VAS scores (Figure 2).
Randomized Block Design (RBD)
A widely used design is the randomized complete block design (RBD). Like a SRD, there is one factor or variable that is of primary interest. There may be nuisance or noise factors or blocking factors that may affect the outcome variable. These noise factors are those that may affect the treatment outcome and are not a major interest. A RBD is constructed to reduce noise in the data by separating the error variance found in a SRD into two components: a variance due to the nuisance or blocking factor and the error variance. It is very worthwhile spending some time deciding which nuisance factors are important enough to control during a clinical trial. In controlling nuisance factors, we can reduce or even eliminate their contribution to experimental error . The basic idea is to create homogeneous blocks in which the blocking factors are held constant and the factor of interest is allowed to vary.
The key idea in a RBD is that the variability within each block is less than the variability of the entire sample. Each estimate of the treatment effect within a block is more efficient than estimates across the entire sample. When these more efficient estimates are pooled across blocks, we should get an overall more efficient estimate than we would without blocking. Blocking is a strategy for grouping subjects in the data analysis in order to reduce noise i.e. it is an analysis strategy. We will only benefit from a blocking design if the blocks are more homogeneous than the entire sample . If the blocks are not relatively homogeneous with respect to the outcome measure, blocking could adversely affect the analysis by reducing the overall power of the study.
An important constraint in a RBD is that a blocking factor must occur on every level of the treatment factor and the same number of times with each level of the blocking factor. The important consideration is to block for a few of the most important nuisance factors. The net effect is to remove the effects of a few of the most important nuisance variables. Randomization is then used to reduce the contaminating effects of the remaining nuisance variables. RBD examples follow.
Croteau assessed the effectiveness of commercially available local exhaust ventilation systems for controlling inhalable dust and crystalline silica exposures during concrete cutting and grinding work (Croteau, ). Three ventilation rates (0, 30, and 75 cfm) were tested for each of four tools (tuck-point grinding, surface grinding, paver and brick cutting, and concrete block cutting). Mean exposure levels for the 75cfm treatments were less than that of the 30 cfm treatments. However, these differences were only significant for paver block cutting.
Storm and colleagues (Storm, ) compared lipids, glycemic control and diurnal blood pressure of three diets (stearic acid, palmitic acid, and a carbohydrate rich diet) in non insulin dependent diabetes mellitus (NIDDM) patients. After a 2-week run-in period, 15 NIDDM patients were randomly assigned to three-3 week dietary treatments. Of interest was that the total cholesterol decreased 11-16% during the stearic acid-rich and carbohydrate rich diets while there was no noticeable change in cholesterol level seen with the palmitic acid-rich diet. If we were to add a blocking factor (LP insulin and regular insulin) and recruit insulin dependent diabetic patients we have defined a RB ANOVA.
For simplicity, assume that there is only one experimental unit (subject) for each treatment-block combination. Let Yij be the observation made on experimental unit corresponding to block j under treatment i, then an assumed model for a RB design is: Yij = μ + τi + βj + εij, with
RBD: Test of Homogeneity (Equality of Means).
We borrow essentials of the Storm study and generated a hypothetical data set for the three diet types (Stearic acid-rich, Palmitic acid-rich, and carbohydrate rich) and for two blocking factors (patients taking LP insulin and patients taking regular insulin). Our hypothetical outcome variable was the end total cholesterol/HDL ratio. Means ± standard deviations are given in Table 4B.
Since the F test statistic for the treatment 7.46 has an associated p-value of 0.003, there are differences between the three diet types. A post-hoc analysis of the treatment effects is warranted. Note that we require the blocking factors to be homogeneous. This analysis confirms the blocking factor homogeneity (F = 0.11, p = 0.746).
A Scheffe post-hoc analysis showed that the Stearic acid and Carbohydrate acid diets were equivalent and both were different from the Palmitic acid diet.
ANOVA models provide "life beyond t-tests". The primary objective of a clinical trial is to compare the relative effectiveness of two or more treatments on a common criterion. The most efficient clinical trial design from a pure mathematical point of view may be too costly in terms of time, financial support, and effort to make it workable. Generally, the smaller the variation due to experimental error, the more efficient the trial design. In turn, the experimental error may be reduced by introducing various kinds of controls or by increasing the sample size. Both methods are used to reduce experimental error, but which one provides the greater reduction per unit of cost depends on features unique to the trial design.
These ANOVA designs are not by themselves a cure-all to all research problems. By carefully considering their treatment combination options, possible confounding factors, and subject availability, an efficient clinical trial may use one of these ANOVA designs. All good clinical trials benefit from a pre-study determination of sample sizes. Significant clinical effect sizes are determined by the researcher and guide the determination of appropriate sample sizes in order to maximize the efficiency of any clinical trial. Most statistics textbook and internet sources provide nomograms or interactive programs to pre-determine required sample sizes for varying clinical effect sizes. As a standard analytic strategy, we strongly encourage that the researcher take time to examine visual plots of their data (box-plots and/or error plots). Often, the plots will point to data problems and indicate what to expect from the data analysis. Efficient and effective clinical trials should be appropriate for the experimental setting, provide maximum information per minimum amount of experimental effort, and be feasible within the working conditions that exist.