# Bias-adjusted exposure odds ratio for misclassified data

T Lee

###### Keywords

case-control study, exposure misclassification, odds ratio, sensitivity, specificity

###### Citation

T Lee. *Bias-adjusted exposure odds ratio for misclassified data*. The Internet Journal of Epidemiology. 2008 Volume 6 Number 2.

###### Abstract

If a dichotomous exposure variable is misclassified in a case-control study, a bias-adjusted exposure odds ratio with its asymptotic variance is presented to account for the misclassification bias. A simple, yet powerful, method is given to calculate the true sensitivity and specificity based only on the data available in the main study, regardless of whether a validation sample is available or not. Two practical examples without or with validation data are given to illustrate how to calculate first the true sensitivity and specificity for cases and controls and then the bias-adjusted exposure odds ratio with its 95% confidence interval.

### Abbreviations

BAOR Bias-adjusted [exposure] odds ratio

CI Confidence interval

COR Crude [exposure] odds ratio

### Introduction

In the realm of epidemiology the problem of misclassification has been thoroughly studied. In practical applications, the exposure misclassification mainly occurs when proxy respondents are used in the survey interview to classify the subject’s exposure status. For example, in a study of identifying the possible etiologic factors for Alzheimer’s disease, information were uniformly obtained only from close family members, usually spouse, because of the patient’s mental impairment (_{31}_{32}).

Historically, this problem was first studied in _{3} and later included other related issues that were investigated by other people (_{1}_{4}_{5}_{10}_{12}_{13}_{14}_{15}_{16}_{17}_{18}_{19}_{20}_{22}_{23}_{24}_{26}_{27}_{29}_{30}_{33}). Epidemiologic examples about the effect of misclassification bias were also widely studied. See, for example, _{6}_{7}_{8}_{21}_{35}_{37}_{39}_{40}_{41}.

So far, all proposed methods for correcting the misclassification bias either require a second validation sample to estimate the sensitivity and specificity of the classified procedure or conduct a conventional/probabilistic sensitivity analysis. No methods available in the literature are able to calculate the true sensitivity and specificity. The aim of this paper is to present a method to calculate the true sensitivity and specificity from the data in the main study only, regardless of whether a validation sample is available.

### Background

Consider a case-control study in which there is no disease misclassification, but misclassification has occurred in determining the subject’s exposure status. First, three random variables, E, E^{*} and D, are defined as follows:

E = 1 if a subject is truly exposed, 0 otherwise,
E^{*} = 1 if a subject is classified as exposed, 0 otherwise.
D = 1 if a subject belongs to the case group, 0 otherwise.

Note that E^{*} is a surrogate classification variable for the exposure variable E and D is a disease variable. Let p_{0} and p_{1} denote, respectively, the true proportions of subjects in the control and the case population, who are exposed to a certain risk factor under study. The probability distributions for cases and controls are given, respectively, by the first and second column of Table 1.

Table 1: The true distributions of the cell probabilities for the cases and controls

As a measure of the relative risk in case-control studies, the [exposure] odds ratio of exposed versus unexposed is given by (_{9})

where 0 < p_{0} < 1 (q_{0} = 1- p_{0}) and 0 < p_{1} < 1 (q_{1} = 1 – p_{1}) are defined in Table 1.

Suppose that n_{0} controls and n_{1} cases are sampled with the positive count frequencies n_{ij}, i, j = 0, 1. By the method of moments, we obtain from Table 2

and

where _{0}
_{1}
_{0} and p_{1} in Table 1, provided that there is no misclassification on the exposure variable E.

Table 2: The observed cell counts for a case-control study.

However, _{i}
_{i} of Table 1 whenever a surrogate variable E^{*} of the exposure variable E for the study subjects is misclassified (_{13}). Indeed, once the exposure misclassification has occurred, it is easily shown that

where φ_{i} and ψ_{i}, i = 0, 1, called bias parameters (_{17}), denote sensitivity and specificity probabilities for controls and cases, respectively, and are defined by (_{9})

and

Moreover, if _{i} · p̂_{i}
_{i} · [ p_{i} · ( φ_{i} + ψ_{i} - 1) - ψ _{i} ]_{i}

From equation 4, it is easily seen that _{i}
_{i} unless there is no exposure misclassification for both cases and controls, that is, φ_{i} = ψ_{i} = 1, i = 0, 1. As a result, the bias unavoidably appears in the crude [exposure] odds ratio given by

since it does not account for the misclassification bias. This motivates epidemiologists and statisticians to search for the corrected [exposure] odds ratio which is able to account for the misclassification bias (_{6}). In this paper, an estimator, called bias-adjusted [exposure] odds ratio, is proposed which is able to account for the misclassification bias in the estimation of the true R of equation 1.

### Method

In epidemiology, an exposure misclassification is said to be non-differential if sensitivity and specificity are the same for cases and controls, that is, classification rates are independent of the disease; otherwise, the exposure misclassification is called differential. Because non-differential misclassification is a special case of differential misclassification, I only consider differential misclassification in my derivation.

By using equations 2-4 with an approximation, _{i}) ≈ p̂_{i}

and

are unbiased estimators, respectively, for p_{i} and q_{i}, conditioned on that both _{i}
_{i}
_{i} is given by

Clearly, equation 11 must not equal to zero; otherwise, equations 9-10 are undefined.

Now, ^{*}
_{i}
^{*}
_{i}
_{i}
_{i}
_{1}
_{1}
_{0}
_{0}

and

It can be easily shown that _{i}
_{i}
_{i}
_{i}
_{i}
_{i}
_{i} and q_{i}. By simply reversing the direction of inequalities in equations 12-14, we could obtain another set of feasibility constraints. Mathematically, these two sets of feasibility constraints are equivalent because they are just mirror images of one another with respect to the straight line of _{i} + ψ_{i}
_{i}
_{i}
_{38}). But, equation 14 is preferable because it has a practical implication, that is, a good classification procedure should perform better than random (_{16}). In addition, the variance of equation 9 is readily given by

where _{i})

By replacing the true unknown parameters p_{i} and q_{i}, i = 0, 1, in equation 1 with equations 9-10, the bias-adjusted [exposure] odds ratio (BAOR) ^{*}

where _{i}
_{i}
^{*}
_{i}
^{*}
_{i}

By using the delta method, the asymptotic variance of ln(^{*}
_{11})

where p_{i} and q_{i} , i = 0, 1, are given, respectively, by Table 1, _{i}
_{i}
_{i})_{0} and n_{1}, sample sizes for cases and controls, are sufficiently large. When equation 17 is used in practical applications, the unknown parameters of p_{i} and q_{i} , i = 0, 1, are replaced, respectively, by ^{*}
_{i}
^{*}
_{i}

According to the asymptotic theory of large sample distribution (_{2}), the sampling distribution of the following test statistic

can be shown to follow a standard normal distribution, where ^{*}))

where _{1-(α/2)}

To use equations 16-19 we need to know the true sensitivity and specificity for cases and controls. I will show below by using two practical examples how to calculate the true sensitivity and specificity from the data of the main-study. Basically, we need to know what the truly classified table is. This information is contained in the observed data of the main study. Even though we do not know exactly what the truly classified table is, our reverse thinking hints us that the truly classified table must be one of the reclassified tables from the observed one in the main study. Hence, we can obtain the truly classified table by assuming hypothetically that it is simply a table which is (either under- or over-) misclassified from the observed one by 1 subject, or 2 subjects, or … in the exposed category. Once we obtain the [hypothetically] true table, we’re thus able to calculate the sensitivity (or specificity) from the observed and this true table according to the following formula, that is,

### Results

In our first example validation data is not available, while a validation sample is available in the second example.

_{34}). For an illustrative purpose we only take one table from their study regarding whether a person see natural warning signs (Table 3a). Here the exposure variable E is defined by the event that a person did not see any natural warning signs. By inspection of Table 3a, 95% (= 37/39) of people for cases did not see any natural warning signs, but only 52% (= 27/52) for controls did not either. Indeed, the crude odds ratio (COR) is obtained as 17.1 with p < 5.3×10^{-6} by Fisher’s exact test (_{36}). Hence, based on this COR value, a tentative inference is drawn that whether seeing natural warning sign is a significant risk factor for the death caused by landslides.

However, since the data were collected through an interview survey from proxies or survivors, there was a possibility that misclassification might have occurred. Suppose that Table 3a is misclassified. To account for the misclassification bias, we have to use equation 16 to calculate the bias-adjusted exposure odds ratio. In this study no validation data were collected at all. Nevertheless, I’ll show first how to calculate the true sensitivity and specificity for cases. Evidently, it depends on our knowledge about what a truly classified table is. Even though we do not know exactly what the truly classified table is, we’re confident that it must be one of those 37 tables by under-misclassifying 1 subject or over-misclassifying 1, or 2, …, or 36 subjects in the category of “No signs” as shown in the first two columns and continued in the fifth and sixth columns of Table 3bi. Since I do not know exactly which one of these 37 possible scenarios is a truly classified table, I simply assume that each one of them is a desired correctly classified table and then calculate one by one the values of sensitivity and specificity accordingly. By using equation 20, the sensitivity and specificity pairs for cases and controls are given, respectively, in Tables 3b(i-ii). By taking the first entry in column 3 of Table 3bi as an example, we obtain that φ_{1} = 1- |38 – 37|/(38+37) = 1- 0.0133 = 0.9867. However, after checking out if feasibility constraints [or equations 12-14] were satisfied, only the first four pairs of sensitivity and specificity were found to be feasible (highlighted in Table 3bi). Similarly, although there were altogether 50 possible truly classified tables for controls, only 34 pairs of sensitivity and specificity were found to be feasible for controls (Table 3bii).

To see in what direction the misclassification might bias the BAOR from the null-hypothesis value, I used all four feasible pairs of sensitivity and specificity for cases and only ten pairs, under- or over-misclassified up to five subjects in the same category of “No signs”, for controls to compute 40 (= 4×10) BAORs. The results of these 40 BAORs with its 95% confidence intervals (CI) are given in Tables 3c(i-iv). By inspection of 95% CI in Tables 3c(i-ii), the BAORs were overall significantly biased further away from the null value (R = 1) than the COR if just one subject was over-misclassified in the category of “No signs” for cases, and up to five persons were either under- or over-misclassified for controls (Tables 3cii), while they were significant and biased yet a little toward the null value than the COR if just one subject was under-misclassified in the category of “No signs” for cases and up to five persons were either under- or over-misclassified for controls (Tables 3ci). However, Tables 3c(iii-iv) painted a totally different picture, namely, the BAOR was overall biased away, yet not significant, from the null value, provided that more than one person was under-misclassified in the category of “No signs” for cases and up to up to five persons were under- or over-misclassified in the same category for controls.

Table 3(a): The survey data whether a person saw natural warning signs for cases and controls.

Table 3b(i): All possible pairs of sensitivity and specificity for cases.

Table 3b(ii): All feasible pairs of sensitivity and specificity for controls.

^{a} The “+” (or “-”) sign inside the parenthesis denotes the number of persons to be over- (or under-) misclassified.

Table 3c: Bias-adjusted exposure odds ratios (95% CI) for all 40 feasible pairs of sensitivity and specificity for cases and controls when:

_{15}). Among women for whom only interview data were examined, 122 out of 564 case mothers and 101 out of 580 control mothers reported antibiotic use during pregnancy. By using equation 1, the obtained crude odds ratio 1.31 with a 95% CI: 0.98–1.76 (p-value = 0.07) is not statistically significant. In this study a second external validation sample based on the medical record (a gold standard) was available. From the data of this validation sample, the estimated sensitivity and specificity for cases and controls were given by (_{1}
_{1}
_{0}
_{0}
^{*}
_{11}
^{*}
_{10}
^{*}
_{01}
^{*}
_{00}
^{*}
_{11}
^{*}
_{01}
^{*}
_{10}
^{*}
_{00}
^{**}
_{11}
^{**}
_{10}
^{**}
_{01}
^{**}
_{00}
_{1}
_{1}
_{0}
_{0}

However, suppose that the validation sample is reliable. Then, according to the above calculation, 42 and 50 women are over-misclassified in the antibiotic use during pregnancy for cases and controls, respectively. These numbers of misclassification on the antibiotic use seem quite big. I therefore conducted a sensitivity analysis by assuming arbitrarily (under- or over-) misclassified numbers ranging from 4 to 40 subjects in the category of the antibiotic use. First, I calculated the sensitivity and specificity for cases and controls, respectively (Table 4b). Next, I calculated the BAORs by using the 14 pairs of sensitivity and specificity given in Table 4b. By browsing the result in Table 4c, almost all 14 BAORs were significant except the last two values which used the last two pairs of sensitivity and specificity of Table 4b. The last two pairs of sensitivity and specificity in Table 4b correspond to that 34 and 40 women are over-misclassified in the category of “use” for both cases and controls. As a result, this implies that the BAORs will become significant and biased away from the null value if less than 34 women are over-misclassified in the category of “use” for both cases and controls in the main study. If this error of misclassification sounds more reasonable, then the inference drawn from using the validation data is clearly misleading.

Table 4a: The data of SIDS study of the exposure variable of interview response between cases and controls.

Table 4b: Pairs of the [true] sensitivity and specificity for cases and controls.

^{a} The “+” (or “-”) sign inside the parenthesis denotes the number of persons to be over- (or under-) misclassified.

Table 4c: Bias-adjusted exposure odds ratios with its p-value and 95% confidence interval.

### Discussion

Some observations are worthwhile for a discussion below:

A unique strength of this paper is that a simple, yet powerful, method is presented to calculate the true sensitivity and specificity based on the data available in the main study only, regardless whether the validation data is available or not. Hence, this new method can free researchers in epidemiology from the no-validation-data scare (Example 1). Of course, we do not know exactly in example 1 what the truly classified table is. Nevertheless, we can get a general feeling about the misclassification effect on the estimation of the true exposure odds ratio. When the validation data are available, our method can not only identify the truly classified table, but also assist to assess the reliability of the validation data as I did in Example 2.

For both cases and controls, the sensitivity and specificity have to be calculated in pairs. This is resulted from a fact that the marginal totals are required to be fixed in case-control studies. As a consequence, we can not arbitrarily assign values to sensitivity and specificity separately as was done in the probabilistic/traditional sensitivity analysis (_{10}_{16}) or in the simulation study (_{5}).

The significance of the BAOR depends not only on its sheer magnitude, but also on its standard error (Table 4c(iii-iv)). Note that equation 17 is a nonlinear function of the true sensitivity and specificity parameters; hence its value can become very large too.

Even just misclassifying one subject the direction of the misclassification effect for the under- and over-misclassified scenario can be very different (Example 1).

In both of the above examples I could conduct an exhaustive sensitivity analysis, but chose not do it, because my intention was just for the sake of illustration.

### Appendix

Let γ_{i} denote the estimation errors between _{i} · p̂_{i}
_{i} · E(p̂_{i})

where p_{i}’s are defined in Table 1, _{i}
_{i}, and Δ_{i} are given, respectively, by equations 2-3, 5-6, and 11. By using equations 2-3, equation A.1 can be also expressed in terms of _{i}

{image:31}

It is easily shown that the first two moments of γ_{i} are given, respectively, by

{image:32}

and

{image:33}

Let ε_{i} be the estimation errors on estimating the “logit” transformation of the odds that are defined by (_{11})

{image:34}

where _{i}
_{i}

By using equations 9-10 with 2-3 and A.1-A.2, we obtain from applying to equation A.5 a Taylor series expansion of the natural logarithmic function (_{25})

{image:35}

By dropping all the terms of γ_{i} with its power greater than or equal to two, we have from using equation A.6

{image:36}

Equation 17 follows immediately from taking the variance of equation A.7 with the use of equations A.3-A.4 and independence property of γ_{i}, i = 0, 1.

### Acknowledgements

This research was motivated by the author’s participation in the project of the Chuuk Landslide Study. The author would like to thank Dr. J. Malilay for her invitation to take part in that study.

Part of the results in this paper was presented by the author in the 2004 ASA Joint Statistical Meetings held in Toronto, Canada and subsequently being published in the ASA Proceedings (_{28}).

### Correspondence to

Tze-San Lee, Centers for Disease Control and Prevention, Mail Stop F-58, 4771 Buford Highway, Chamblee, GA 30341-3717, USA Phone: +1-770-488-3729; Fax: +1-770-488-1540 E-mail: tjl3@cdc.gov