Do faculty show the ‘halo effect’ in rating students compared with standardized patients during a clinical examination?
D Lie, J Encinas, F Stephens, M Prislin
Keywords
clinical skills, faculty evaluation, standardized examination, student assessment
Citation
D Lie, J Encinas, F Stephens, M Prislin. Do faculty show the ‘halo effect’ in rating students compared with standardized patients during a clinical examination? . The Internet Journal of Family Practice. 2009 Volume 8 Number 2.
Abstract
Introduction
Faculty feedback to medical students regarding their performance in clinical encounters contributes to learning and improvement of clinical skills1-2. Observations of student performance are an integral part of student evaluation in clerkships and can be achieved in different ways, such as observed history and physicals in the inpatient or outpatient3-4 setting, or in a standardized clinical examination5-7. However, constraints of limited faculty time and busy practices often limit opportunities for observation with feedback in actual or standardized encounters. Among a myriad of assessment tools available8 the mini-clinical evaluation exercise or mini-CEX was developed and validated9-10 with medical students and residents3 to address some of the faculty time constraints. The mini-CEX allows observation, evaluation and feedback by faculty that focuses separately on the skills of history taking, physical examination, clinical reasoning, humanistic qualities and overall clinical competence. The mini-CEX has been adapted for use in medical student11-13 and residency14 settings for summative as well as formative evaluation by faculty. Many preclinical Doctoring courses and clerkships now incorporate the Objective Structured Clinical Examination (OSCE) or other standardized patient scenarios5,7,15 to assure that students are not only observed and rated16 but given feedback by standardized patients (SPs) and faculty17. In such settings, faculty or course directors also use SP scores to provide feedback to learners.
A comprehensive review of assessment tools used for feedback in clinical teaching18 suggested that characteristics of teacher evaluations vary across educational settings and that future studies should focus on narrowly defined study populations. It is unclear how well measures of performance by trained SPs correlate with faculty ratings of similar skills for students, since the two types of observations are often independent of one another and occur in separate settings (standardized vs. actual) using different scales with different construct validity. It has been suggested that faculty may exhibit the ‘halo effect’ and tend to overestimate performance when asked to assess the skills of trainees well known to them, applying their global impressions rather than objective ratings of specific behaviors19-21, while SPs trained to use checklists observe specific verbal and nonverbal behaviors in standardized encounters are more likely to be “““objective””“ in their ratings, relying less on global impressions22. Providing consistent feedback based on faculty and SP ratings becomes a challenge when the ratings or scores from SPs and faculty diverge, especially when faculty ratings are consistently higher than SP ratings. Despite this potential for discordance, there is a paucity of studies examining the correlation between ratings and rating patterns produced by faculty and SPs in the same clinical encounters18, 20.
We therefore conducted a study to test the hypothesis that faculty overestimate their students’ clinical performance resulting in discordance with SP ratings. Our purpose was first to establish the validity and reliability of two commonly used validated assessment tools: the faculty-completed mini-CEX10 and the patient-completed Patient Physician Interaction Scale or PPI15 in a four station OSCE for second year medical students. Our second objective was to examine the distribution pattern of ratings by SPs and by faculty. In addition we examined the correlation between faculty and SP raters for four individual skill domains within the modified mini-CEX. Rather than asking faculty to use the exact same scales for clinical performance as trained SPs we used different rating measures for SP and faculty to reflect real world clinical teaching where faculty do not generally use case-specific SP checklists to rate student skills. The mini-CEX is not case-specific and was constructed and validated for use by faculty for use in most clinical encounters, and not for use by SPs. In addition, SP measures are considerably longer to complete, require time-consuming training, and impractical for faculty to complete without case-specific training. The institutional review board of the University approved the study.
Methods
Setting and Curriculum
The study was conducted at one US medical school with class sizes of 92 to 104. Three consecutive classes of students were tested for their clinical skills using the same four-station OSCE at the end of 18 months of a required longitudinal Doctoring course. The primary goal of the 180-hour Doctoring course was to introduce basic interviewing, physical examination, communication, and clinical reasoning skills to prepare students for third year clerkships. The course started in the fall of the first year and continued to the spring of the second year, and took place concurrently with other required basic science courses including Anatomy, Physiology, Pharmacology, and Pathology. The course comprised eight consecutive organ-system-based clinical cases (matched to teaching in the basic science courses) with opportunities for structured observed interview and physical examination of standardized and real patients in weekly two-hour problem-based learning (PBL) sessions. Students were taught in groups of four to six by a pair of faculty, one a physician, and the other a related health professional (e.g. a psychologist, nurse, or social worker). Small group teaching was standardized using a structured schedule with learning objectives, SP cases, and homework assignments, supported by intensive faculty development, which occurred monthly. PBL sessions were videotaped and feedback on interviewing, physical examination skills, communication, and group participation was given to students. Small group teaching was supplemented by teaching of content themes relevant to the cases, weekly half day preceptorships with practicing community physicians during the second year, and required readings from textbooks. Student evaluations consisted of: individual assessments by PBL instructors twice a year, , assessments by community preceptors, a written test, and the final OSCE given in the second year. Physician PBL instructors had at least 32 hours of face-to-face teaching contact with their small groups and evaluated students weekly on their homework and class performance as well as meeting individually with each of their students once every 2 to 3 months.
The OSCE was administered during the Winter Quarter of the second year to the entire class over a three-week period on eight half days at the medical school’s Clinical Skills Training Center with a level of test security equivalent to other high stakes clinical examinations offered at the school.
Study Participants
Student participants were second year medical students. Faculty raters were instructors in the PBL groups for the students. SPs were trained actors with prior experience as raters in high stakes clinical practice examinations.
OSCE Cases
The goal of the OSCE was to determine that students had basic interviewing, physical examination and clinical reasoning skills appropriate to the course goals and content. Four cases were developed and pilot-tested in the year prior to the study to address learning in the skill domains and organ systems taught. Each station was 25 minutes long, with a 10-minute inter-station exercise for writing up the case or to answer case-specific questions. The cases were: a 45-year old businessman presenting with atypical chest pain with a differential diagnosis of ischemic heart disease/angina, gastroesophageal reflux and pleurisy; a 60-year old woman with transient neurological symptoms and vascular disease; a 30-year old man or woman with acute back pain secondary to muscle strain from work; and a 23-year old male college student with acute right lower quadrant abdominal pain with a differential diagnosis of appendicitis, urinary infection, kidney stone or testicular torsion.
During the OSCE, faculty observed the encounters live from a television monitor outside the room. They were provided with written instructions for the modified mini-CEX to rate each skill domain based on their impression of skill appropriate for a second year medical student and to provide written narratives to support their ratings. All faculty were introduced to the modified mini-CEX during the Doctoring course, and were trained in the use of anchors for the rating scales within the mini-CEX. We chose to use the mini-CEX for the OSCE because of faculty familiarity with the tool and for consistency with clerkships which also use the mini-CEX for student assessment.
Rating of Student Clinical Skills
Statistical Analysis
Statistical analysis was performed using the software package JMP version 7.0.1 (SAS Institute Inc., Cary NC). Because analysis by student class showed no differences in means and SD among the 3 classes, the data was analyzed in aggregate. The means and standard deviations for the modified mini-CEX and PPI were calculated. Multivariate analysis was used to calculate Cronbach’s α to establish internal consistency reliability for individual items and combined items of the modified mini-CEX (faculty raters) and the PPI (SP raters).
Pearson
Subgroup analysis was performed to examine the correlations using Spearman ρ, by student gender, student class, faculty gender, and clinical case.
Results
Study Participants
The OSCE was conducted in the winter of 2005, 2006 and 2007. Each of the three classes (N=303 total) was similar in demographics. Mean age was 23 years, 50% were male, and distribution of ethnicity was as follows: 40-50% white, 40% Asian, 10% Hispanic, 1% black, and 10% did not self-categorize. The majority of students were science majors and residents of California.
There were 34 physician faculty raters (16 male) from primary care settings of Family Medicine (n=16), Internal Medicine (n=15) Pediatrics (n= 2), and Psychiatry (n=1). Faculty ranks ranged from Assistant to Full Professor. Individual faculty experience teaching in the Doctoring course ranged from 3 to 10 years. Each faculty rated 3 to 17 students. The majority of faculty (25 of 34) taught students during all 3 years of the study.
Twenty-four SPs performed in the three OSCEs over 3 years and each performed in the same case. There were two SPs for each clinical case who were trained concurrently for 4 hours each, using videotapes of student performance as “““gold standards””“ for rating. Returning SPs were retrained each year for 2 hours with review of the case and rating standards.
Overall Student Performance
Mean (SD) scores out of 5 for modified mini-CEX clinical skill domains assessed by faculty ranged from the lowest mean score of 3.04 (± 0.76) for physical examination, to the highest mean score of 3.52 (± 0.58), for humanistic qualities (Table 1), equivalent to percentage scores of 60.80% to 70.40% respectively. Mean (SD) SP percentage scores ranged from a low of 52.90% (± 16.99%) for physical examination, to a high of 74.85% (± 17.72%) for history taking. The mean score for the PPI summed total was 25.45 (± 3.02), out of a maximum of 35 points and the mean OS score from SPs was 3.69 (± 0.58), out of a maximum of 5 points (figure 1).
Mean scores for each case and by skill domain assessed by SPs did not improve across the three years of the study (data not shown), suggesting good test security across the years of OSCE administration.
Internal consistency reliability of the PPI and the modified mini-CEX
The overall Cronbach’s α for the PPI was 0.86. The overall Cronbach’s α for the entire mini-CEX was .89 (table 1).
Faculty-SP Ratings – Score Distribution and Concordance
Both faculty and SPs used the full range of the individual scales (see figure 1 for distribution of scores for measures) with scores demonstrating a normal distribution. The Pearson correlation coefficient
Subgroup analysis (table 2) showed that the Spearman ρ was lower in all four modified mini-CEX domains for female compared with male students. The range was 0.10 to 0.18 for females and 0.23 to 0.33 for males. There were otherwise no consistent patterns of correlation by case, student class, or faculty gender (Table 2). However, one of the four cases (abdominal pain, case 3) showed a higher range of variability in concordance (Spearman ρ
Figure 1
Figure 2
Figure 3
Note - Number for each measure is lower than 303 due to some missing data
Discussion And Conclusion
We examined the internal consistency reliability of two measures (one for use by SPs and one for faculty) used for clinical skills assessment, in second year medical students performing in an OSCE for 3 consecutive classes of students. We confirmed prior reports of reliability and validity of both measures, with Cronbach’s α of .86 for the PPI and 0.89 for the mini-CEX. We also tested the concordance of faculty and SP measures of equivalent clinical skills, for history taking, physical examination, communication/humanistic qualities and overall skills across four stations, using faculty raters who taught the students and had experience and training using the modified mini-CEX. As expected24, student performance by faculty and SP rating was lowest in the area of physical examination, consistent with our previous observations of second year student performance. Faculty and SPs diverged in the area of highest mean scores: faculty scores were highest for humanistic qualities (CEX4) while SP scores were highest in the area of history taking (Hx).
Although the concordance between SP and faculty raters was modest with a range of 0.214 to 0.288 for the 4 different skills examined, both sets of raters in aggregate showed a consistent normal distribution pattern of rating across the entire scales of their respective assessment tools. There was no suggestion of a consistent “““halo effect””“ in either group of raters. The concordance between faculty and SPs was somewhat lower for female compared with male students but no other pattern related to class, clinical case or faculty gender was observed.
Strengths of our study are the large number of students assessed with almost 100% of data available for analysis. The same curriculum was delivered for each of the 3 groups of students. Student demographics were similar across the 3 years allowing aggregate analysis. The total number of faculty and SPs was small and consistent across the study years. The majority of faculty was experienced in PBL instruction and clinical skills rating and all had similar contact hours with their students. The cases and SPs used were the same across the three years of the study. The number of students was sufficiently high to permit meaningful subgroup analysis.
There are some limitations to this study. First, we used aggregate data with means representing SP and faculty scores instead of individual paired comparisons. This was because each student-SP encounter was observed and rated by a single unique faculty and a unique SP. However, we had sufficient observations of each faculty and still found no “““outlier””“ faculty with tendencies to score narrowly at the upper or lower end of the modified mini-CEX scale25. Second, we converted SP scores to match skill domains for the modified mini-CEX but the skills measured using this conversion are not equivalent in construct. For example, the sum of the PPI items do not necessarily equate to “““humanistic qualities””“ as measured by the modified mini-CEX4. The reason we chose this strategy, instead of having faculty use the same rating scales as the SPs, was to reflect the “““real world””“ conditions of faculty assessment of clinical skills. In clinical precepting by faculty, case-specific checklists for encounters are not generally adopted because of the unpredictability of each encounter and each case type. This mismatch in construct likely accounts for the low to modest concordance we found. Lastly, this study was conducted at one medical school and results may not be generalizable across other schools with different faculty and SP characteristics and experience.
The finding that this diverse (by gender, specialty and teaching experience) group of faculty showed no overall tendency to overestimate skills of students known to them is reassuring and may be explained by faculty development for this particular faculty group. Our study suggests that agreement between faculty and SPs about student skills in an observed encounter is not high when different rating scales are used, which supports prior literature differentiating global ratings and checklist ratings26-27. It is likely that faculty in our study adopted a global approach for each skill, while SPs relied on itemized behavior-specific checklists for all the skills except for the overall score. Correlation between faculty and SP was not high (
Word count for main text: 3318
Acknowledgements
The authors gratefully acknowledge the dedication and commitment of the many volunteer faculty who taught in the Doctoring course, our medical students, Sue Ahearn, Charlotte Fesko and the Clinical Skills Center for maintaining the highest standards of case development. This project is partially supported by funding from an award to DAL from the National Institutes of Health (NIH), National Heart, Lung and Blood Institute, K07 HL079256-01 (2004-10) and from the Bureau of Health Professions, grants D55HP03362-01-00 (2004-9). The manuscript contents are solely those of the authors and do not reflect on views of the National Institutes of Health.