# Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision

Z Wang

###### Keywords

confounding, logistic regression, propensity score method, simulation

###### Citation

Z Wang. *Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision*. The Internet Journal of Epidemiology. 2008 Volume 7 Number 2.

###### Abstract

There is an increasing interest in the use of propensity score (PS) methods for confounding control, with generally three ways of estimating adjusted treatment effects in pharmacoepidemiological studies: 1) stratification on PS, 2) matching on PS and 3) using PS as a covariate. To assess bias and precision of different methods, we conducted simulations in three scenarios: 1) treatment had no effect but the crude estimate showed a protective effect; 2) treatment was protective and the crude estimate was more extreme; and 3) treatment increased the risk but the crude estimate showed protective. Adjusting for confounders in all methods shifted the effect estimates toward the true values. Adjusted odds ratios using the PS stratification and the method using PS as a covariate were biased due to either residual confounding or over-adjustment. Matching on PS produced less biased average estimates than other methods but the precision of effect estimates was lower.

Sponsor: The National Health and Medical Research Council (NH&MRC) of Australia (511013).

### Introduction

Propensity score, introduced by Rosenbaum and Rubin, ^{1} is the conditional probability of a subject’s receiving the treatment of interest given a set of covariates. The use of propensity score is increasing for confounding control, especially for evaluating treatment effect using observational data. ^{2} However, as suggested by Sturmer et al, there is little evidence that propensity score methods yield substantially different estimates compared with conventional regression methods. ^{2} Several simulation studies have conducted to evaluate the performance of propensity score methods. ^{3-6} In a Monte Carlo simulation study, Austin et al shows that conditioning on the propensity score produces a biased estimation of the true conditional odds ratio and the true conditional hazard ratio. ^{5} In another Monte Carlo simulation study, Brookhart et al suggest that standard model building tools designed to create good predictive models of the exposure will not always lead to optimal propensity score models. ^{6} On the other hand, Cepeda et al found that propensity score estimates were less biased than the logistic regression estimates when there were six or fewer events per confounder. ^{3}

Generally, there are three ways to apply propensity scores: 1) stratification on the propensity score, 2) matching on the propensity score, and 3) using the propensity score as a covariate. ^{2} Little is known about the effect of different ways of using propensity scores on the bias and precision of treatment effect estimates. Simulation studies use computer intensive procedures to assess the performance of statistical methods in relation to a known truth. ^{7} In this study, we used simulations to examine different propensity score methods and logistic regression methods in assessing the treatment effects. We mainly focused on comparing biases and precisions of those methods under different scenarios with various sample sizes.

### Methods

### Data simulation procedures

As in typical epidemiological studies of assessing treatment effect, we started with two groups: treatment and non-treatment groups. The variable X was coded 1 for treatment and 0 for non-treatment. The random number generator in Stata ^{8} was used to generate five confounding variables and the outcome variable. Among the five confounding variables, two were continuous and three dichotomous. Three random dichotomous variables were coded 1 and 0. First, we generated three uniform variables U1, U2 and U3 with values between 0 and 1. For the non-treatment group, we set a dichotomous variable W1 to be 1 if U1< 0.20 and 0 otherwise, W2 = 1 if U2< 0.10 and 0 otherwise, and W3 = 1 if U4 0.4 < 0.40 and 0 otherwise. For the treatment group, we set W1 = 1 if U1< 0.60, W2=1 if U2< 0.50 and W3 = 1 if U3 < 0.20. Two random continuous variables were generated with expected means of 0.25 and 0.20, respectively, in the treatment group and -0.25 and -0.20 in the non-treatment group for W4 and W5 respectively. The standard deviations for both variables in both populations were 1. The above procedures generated five variables (W1-W5) associated with the treatment (X).

Outcome variable (Y) was modeled using logistic regression as a function of (confounding variables (W1-W5) and treatment (X) variable in three scenarios:

Scenario 1. The odds ratio as a measure of treatment effect was set to be 0.70. The odds ratios for confounders W1, W2, W3, W4 and W5 were 0.3, 0.5, 3.0, 0.4 and 0.5. Baseline probability of having the outcome (Y=1) was 0.30 when all Ws and X were 0. The probability of a subject with a specific combination of Ws and X was estimated:

logit(Y) = ln(0.3/0.7) + ln(0.3)*W1+ln(0.5)*W2+ln(3.0)*W3 + ln(0.4)*W4 + ln(0.5)*W5 + ln(true OR)*X

where the true OR = 0.70 in the scenario 1.

Pr(Y|X, Wi) =exp(logit(Y))/(1 + exp(logit(Y))

The outcome variable (Y) was set to be 1 if the randomly generated uniform number was less than Pr(Y|X, Wi), and to be 0 if otherwise.

Scenario 2. The associations between confounders (Wi) and the outcome (Y) were the same as those in Scenario 1 but there was no treatment effect (the true OR = 1).

Scenario 3. The associations between confounders (Wi) and the outcome (Y) were the same as those in Scenario 1 and 2 but the true OR = 1.6.

### Sample sizes and numbers of simulations

We performed 4 different sets of simulations with 50, 100, 500 and 1000 subjects, respectively, in the treatment group, and the same numbers in the control group. We generated 36000 dataset with 3000 datasets for each combination of scenarios and sample sizes.

### Adjustment for confounding

In each of the 36000 simulated studies, we estimated the crude and adjusted odds ratios using conventional logistic regression and three propensity score methods.

Logistic regression method: To estimate the effect of the treatment on the outcome, we applied logistic regression with the outcome (Y) as dependent variable and all confounding factors (Wi) and treatment variable (X) as independent variables.

Propensity score stratification: We obtained the propensity score of the treatment (X), the probability of being treated, using logistic regression with the treatment (X) as a dependent variable and all confounders (Wi) as independent variables. The propensity scores were divided into five strata with 20 ^{th} , 40 ^{th} , 60 ^{th} and 80 ^{th} percentiles as the cutoffs. Then, we used the outcome variable (Y) as the dependent variable and treatment (X) and the categories of the propensity score were independent variables in logistic regressions.

Propensity score matching: The propensity score matching refers to the pairing of treated and untreated subjects with similar values of the propensity scores and the discarding unmatched subjects. As proposed by Rubin, all propensity scores were transformed to the logit scale, which is referred to as the linear propensity score. ^{9, 10} We matched each treated subject to a untreated subject with the closest propensity score (1:1 matching) within the range of linear propensity score ±0.25. If there were no untreated subjects within the range for a treated subject, this subject would not be included in the conditional logistic regression. A unique identification number was assigned to each matched pair, and this variable was used as the identifier variable for the matched groups in the conditional logistic regression, in which the dependent variable was the outcome (Y) and the independent variable was the treatment (X).

Propensity modeling: We took the linear propensity score, a continuous variable, as a covariate in the logistic regression. The dependent variable was the outcome (Y) and independent variables were the treatment (X) and the linear propensity score. In this study, we only assessed linear relationship between the propensity score and the outcome.

### Measures of interest

Bias: Odds ratios by four different methods were calculated and compared with the true values, which were 0.7 in scenario 1, 1.0 in scenario 2 and 1.6 in scenario 3. The differences between the true and estimated odds ratios indicated the bias of the effect estimates. Average differences of log odds ratios were presented according to the methods, sample sizes and scenarios.

Precision: We calculated standard errors of log odds ratios as a measure for precision. Since the same data sources were used for all four methods, the average standard errors among different methods were compared.

We conducted all simulations and analyses using Stata 10. ^{8} Matching was a tedious and time consuming procedure, so we developed a Stata program (CMATCH) to perform this task. The change-in-estimate approach has been recommended for selecting confounders for control. ^{11} All confounding variables in this study were true confounders and all were included in the analyses. Confounder selection was not the focus of this study. However, the distortion of these confounders to the odds ratio will be demonstrated using a Stata program. ^{12}

### Results

### Characteristics of dataset simulations

Table 1 shows the characteristics of treated and untreated groups. Those confounding variables were substantially different between two groups. Table 2 shows the numbers of cases in 3 scenarios according to sample sizes. The numbers of cases were very small, ranging from 3 to 10 per confounder, when sample size was 50 in the treatment group. When the sample size was 1000, there were over 100 cases per confounder.

##### Figure 1

##### Figure 2

One example data set (one of the 3000 sets with the sample size of 1000) was randomly selected to demonstrate the presence of confounding effects from five variables W1-W5, using the change-in-effect estimate method. ^{12} Figure 1 shows the effect estimates after adjusting for each of confounders according to the magnitude of the change-in-effect estimate in a stepwise fashion. All five confounding variables contributed to the distortion of odds ratio estimates (the change-in-estimate) in three scenarios. Adjusting for confounding variables altered the effect estimates from huge protective effects (crude odds ratios) to the true effect values.

### Effect estimates

Table 3 shows the average odds ratios according to sample size and scenario combinations. The crude odds ratios in three scenarios showed a strong protective effect of the treatment. The odds ratios adjusting for confounding factors using logistic regression and three propensity score methods were closer to the true values, indicating that the treatment had 1) no effect in scenario 1, 2) a protective effect in scenario 2 but to a much less extent than the crude estimate, and 3) a risk effect in scenario 3, opposite to the crude estimate. However, those methods performed differently in terms of biases and precisions of their effect estimates.

### Bias

Figure 2 shows the differences between the estimated and true log odds ratios. The propensity score stratification method consistently produced an odds ratio away from the true value in a direction towards the crude odds ratio regardless of the sample size and the magnitude of the true value. In two of the three scenarios (1 and 2), the propensity score linear modeling yielded average odds ratios that were higher than the true values and in a direction that was further away from the crude effect estimate. The bias from the propensity score matching tended to be less extreme than those from propensity score stratification and propensity score modeling.

### Precision

Mean standard errors are shown in Figure 3. The propensity score stratification and propensity score modeling had lower mean standard errors than the conventional logistic regression. The propensity matching had highest mean standard errors.

##### Figure 6

We calculated the numbers of pairs used in the propensity score matching methods. On average, 42 (min: 23, max: 61) and 220 (175, 264) and 443 (378, 513) treated subjects were matched to the untreated subjects, which were 42%, 44% and 44% of the total subjects, for simulations with 100, 500 and 1000 treated subjects, respectively.

To further explore possible explanations of the higher bias using the stratification method and higher mean standard errors using the propensity score matching method. Using the example dataset for Figure 1, we generated Figure 4, which demonstrates the striking difference in the distribution of propensity scores between treatment and non-treatment groups. Even within each propensity score stratum, two groups still had different propensity scores. The figure also shows that a small proportion (shaded area) of treated and untreated subjects could be matched.

### Discussion

In this study, we found that the propensity score methods provided biased effect estimates. Residual confounding persists in the propensity score stratification method regardless of sample size and the strength and direction of the true treatment effects. Using the propensity score as a linear predictor also produced biased effect estimates but the direction of this bias can be different from that of residual confounding. Matching by propensity scores excluded a large proportion of subjects and resulted in the effect estimates with less precision.

Several systematic reviews have been conducted on this topic. ^{2, 13, 14} Sturmer et al found 13% studies using a propensity score method had an effect estimate that differed by more than 20% from that obtained with a conventional regression model. ^{2} Shah et al found the statistical significance of the association differed between two methods in 10% of the effect estimates, in which the association was statistically significant using conventional regress but not significant using propensity score methods. ^{13} Most observational studies had similar results whether using conventional regression or using propensity scores to adjust for confounding. ^{2, 13} Drake shows that omitting a confounder in the propensity score method produces biases comparable to those in a conventional regression model. ^{15} Using an example dataset from Hosmer and Lemeshow, ^{16} Drake and Fisher reported that the propensity score method leads to a different conclusion with regard to the effects of smoking on birthweight. ^{17} However, it is difficult to assess the performance of different methods using real data sets because the true values of the treatment effects are unknown.

Simulations studies provide an opportunity to assess the performance of different statistical methods in relation to a known truth using computer intensive procedures. ^{7} Several simulation studies have been conducted on the propensity score methods. ^{3, 5, 6, 18, 19} Brookhart et al revealed that the model best predicted exposure did not yield the optimal propensity score model in terms of efficiency when including a non-confounder in the propensity score model. ^{6} They suggested that variables that are unrelated to the exposure but related to the outcome should be always included in a propensity score model. ^{6, 9} Austin et al found that failure to include an important confounding variable in the propensity score model can result in variable imbalance between exposed and unexposed subjects and result in biased estimation of the effect. ^{5} In this study, all variables were true confounding factors and performances of different propensity score methods were assessed using the same data set with the same confounding variables.

Cepeda et al found that the propensity score estimates were less biased that the logistic regression estimates when there were six or fewer events per confounder. Overall the propensity score was more robust, more precise and had more power than logistic regression. ^{3} The purpose of this study was not to compare the logistic regression estimates with those of different propensity score methods. Since we carried out the simulations according to the known logistic regression models to generate data, logistical regression models were theoretically correct. However, we demonstrated some potential problems of different propensity score methods. In this study, even when the number of cases was about six per confounder (when treatment group n = 50), the propensity score methods produced biased estimates. The statistical power with such a small sample size is too low to provide a reasonable effect estimates regardless of the methods. Even if there were no confounding in this study, the sample sizes required to detect an odds ratio of 0.7 and 1.6 should be 638 and 314 with 80% power.

Among the three ways of applying propensity scores, the propensity stratification method produced biased effect estimates toward the crude estimate, indicating the presence of residual confounding. We did not check if the distribution of the confounders in the treated and un-treated groups in each stratum were similar. However, the presence of residual confounding is likely to be a common phenomenon because within each stratum the treatment subjects can still have higher propensity scores than their untreated counterparts, as illustrated in Figure 4.

Matching by propensity scores can efficiently balance the propensity scores between two groups at the expense of losing a large proportion of the subjects. In our simulated data only 42% to 44% subjects were matched, the precision of the estimates of the matching method were lower than those of other methods.

The linear modeling of propensity scores as a continuous variable can also produce biased estimates. In two of the three scenarios, the linear modeling produced biased estimates to the opposite side of the crude estimate, indicating an over-adjustment. We did not explore whether fitting a non-linear relationship in the model would change the magnitude of bias or alter the direction of bias. Only one set of confounding variables were used in all three scenarios and four methods. In the real world, the relationships among confounders, treatment and outcome can be more complicated. Therefore, the magnitude and the direction of bias of propensity score methods are likely to vary accordingly.

### Conclusion

Propensity score methods potentially produce biased effect estimates. Residual confounding is common when using the propensity stratification method. Propensity matching results in lower precision of effect estimates. Linear modeling of propensity scores may not appropriate for all data. Better understanding of the benefits, limitations and appropriate use of the propensity score methods are needed before they are widely used.

### Acknowledgement

Zhiqiang Wang was supported by the National Health and Medical Research Council (NHMRC) of Australia (511013).