Multiple Diseases Discriminated by Quantitation of Blood Transcriptome
S Chao, K Marshall, F El Khettabi, H Tang, C Liew, S Mok
Keywords
blood, diagnostics, gene profiling, microarray, transcriptome
Citation
S Chao, K Marshall, F El Khettabi, H Tang, C Liew, S Mok. Multiple Diseases Discriminated by Quantitation of Blood Transcriptome. The Internet Journal of Genomics and Proteomics. 2009 Volume 6 Number 1.
Abstract
The steady circulation and physiologically interactive nature of blood ensures that this dynamic system encounters, transmits, and responds to a wide range of biological signals. In this context, we hypothesized that quantitative measurement of the blood transcriptome can enable the identification and validation of RNA transcripts that are specifically associated with the presence of a particular disease or clinical condition. In the present study, we have used 631 blood RNA expression profiling in conjunction with microarray technology to generate highly discriminative panels of 10 pairs of probe sets for each of four separate clinical conditions (gender, colorectal cancer, prostate cancer and osteoarthritis). The robust training set performance for each disease - or condition-specific multi-gene panel was corroborated with an independent test set, with areas under the receiver-operating characteristic curves ranging from 0.87 to 0.93 for each of the four conditions in the test set population. This study demonstrates that quantitative measurement of the blood transcriptome, in conjunction with microarray technology, can be used to generate highly discriminative multi-gene panels for many clinical conditions. This approach has great potential to enable the simultaneous monitoring of multiple disease states or clinical conditions from a single blood sample.
Introduction
In 1932, physiologist Walter Cannon penned his classic
The circulating peripheral blood system is a critical integrative force by virtue of the blood’s ongoing real-time involvement in the regulation, coordination, metabolism and immune maintenance of essentially all cells, tissues and organs. Functions of blood cells include transporting nutrients, oxygen and biomolecules, and removing cellular wastes. Blood is further intimately involved in immune surveillance throughout the body, and delivery of immune factors and healing mediators to sites of disease, infection and injury. Thus, the steady circulation and physiologically interactive nature of blood ensures that this dynamic system encounters, transmits, and responds to, a wide range of biological signals [2-5].
These dynamic, integrative features of blood, considered in context with the need for maintaining homeostasis, suggest that the presence of a specific disease or clinical condition will be reflected in specific patterns of gene expression in blood, i.e. transcriptomic signatures. The transcriptome is the complete set of RNA transcripts present in a cell or tissue at any one time. Although a particular cell or tissue’s DNA, or genome, is essentially unchanging, its transcriptome will vary according to the current physiological status of the cell or tissue. Thus, we have hypothesized that transcriptomic signatures in blood which are specific to states of health or disease can be identified and used to diagnose such states via transcriptional profiling of blood [2].
Advances of the past decade have made it possible for transcriptomes to be quantitatively profiled and compared on a genome-wide scale using powerful nucleic acid probe microarray technology [reviewed in 6]. Traditional microarray analyses are tissue biopsy-based, which limits the application of array technology to a limited number of clinical situations in which tissue is readily available. By contrast, the use of blood samples enables broadening of the application of transcriptional profiling analysis to a wider range of diseases and clinical conditions. Thus, blood is an ideal sample type which overcomes many of the limitations of traditional microarray studies [7].
In a series of studies, we and others have demonstrated that RNA profiles generated from circulating blood can be used to identify patients with a number of conditions[7,8], including: lung cancer [9] bladder cancer [10], colorectal cancer (CRC) [5], osteoarthritis [4], schizophrenia and bipolar disorder [11,12], kidney diseases [13,14], cardiovascular diseases [15-17] , Crohn’s disease [18] and diabetes [19]. In the present study, we have extended this approach and used single subject transcriptional signatures from a single blood sample to simultaneously assay for the detection of multiple diseases in a heterogenous human population.
Materials and Methodology
Patient samples
We recruited more than 1500 patients from multiple institutions between January, 2004 and July, 2008. We selected 631 patients for the current study of three distinct diseases. Informed consent was obtained according to the research protocols approved by the research ethics boards of each institution involved.
Blood collection and RNA isolation.
Samples of peripheral whole blood (10 ml) were collected in EDTA VacutainerTM tubes (Becton Dickinson, Franklin Lakes, N.J.), and stored at 4ºC until processing (within 6 hours). RNA was isolated at six different centers according to a standardized protocol. Plasma was removed after centrifugation and a hypotonic buffer (1.6 mM EDTA, 10 mM KHCO3, 153 mM NH4Cl, pH 7.4) was added at a 3:1 volume ratio to lyse the red blood cells. The mixture was centrifuged to yield a pellet containing predominantly white blood cells, and the pellet was re-suspended into 1.0 mL of TRIzol® Reagent (Invitrogen Corp., Carlsbad, CA) and 0.2 mL of chloroform. RNA quality was assessed on an Agilent 2100 Bioanalyzer RNA 6000 Nano Chip. RNA quantity was determined by absorbance at 260 nm/280 nm in a Beckman-Coulter DU640 Spectrophotometer. The samples were then stored at -80ºC at a single center.
Microarray hybridization.
Five-microgram samples of purified total RNA were labelled and analyzed using Affymetrix U133Plus 2.0 GeneChip oligonucleotide arrays (Affymetrix; Santa Clara, CA). Hybridization signals were adjusted in the Affymetrix GCOS software (version 1.1.1), using a scaling factor that adjusted the global trimmed mean signal intensity value to 500 for each array. The CEL files were expressed using MAS5 methods. Hybridizations were carried out in batches across 30 lots of chips (3005291 to 4033799) between 2004 and 2008. All samples passed the recommended quality checks (background, present call, Raw Q, Scale Factor, and 3’/5’ ratios for Gapdh and ActB).
Data analysis
The expression levels from probe sets labelled “present” were log-transformed (base 2). Only the data from probe sets labelled as “present” in all samples across all studies were used (7,226 probe sets). For each study, samples with the condition of interest were labelled as “with condition of interest”, while all other samples were labelled as “without condition of interest”.
The probe set data for each condition of interest were organized into combinations of 10 pairs of genes. Each combination was evaluated for its discriminative power on the training set by calculating the receiver-operating characteristic (ROC) area under the curve (AUC). The combination that achieved the best ROC AUC was selected as the panel for the condition of interest. The process was repeated for each condition of interest.
Unique discriminative panels were determined for each condition of interest, namely: gender, colorectal cancer, prostate cancer and osteoarthritis. After the training set panels were determined for each condition of interest, a second set of studies was performed using an independent test set to further assess the discriminative power of the panels.
Analysis of the final results and generation of charts was performed using Microsoft Excel and MedCalc (www.medcalc.be).
Samples with Condition of Interest
1)
First, we evaluated the reliability of the measurements and data analysis by searching for gene panels that can discriminate between genders. This was conducted in two phases.
In this first phase, we searched for probe sets that exhibit consistent differential expression based on copy number difference.
In this second phase, as a model of general disease, we searched for discriminatory probe sets for autosomal genes that were effective at discriminating gender.
The training set was composed of 352 subjects (121F:231M). The test set had 279 subjects (96F: 183M).
Training set = 80; test set = 68.
Training set = 80; test set = 63.
Training set = 103; test set = 93.
Samples without Condition of Interest
Training set = 30; test set = 18.
Training set = 29; test set = 7.
Training set = 30; test set = 12.
Training set = 0; test set = 18.
Results
Gender discrimination
Each row represents one sample. The first column represents the panel using x-linked genes, while the second column uses only autosomal genes for discrimination. Dark grey indicates a “female” prediction. Light grey indicates a “male” prediction. Accuracy for each gender is defined as the percentage of correctly predicted subjects from the total number of subjects.
The panels of 10 pairs of probe sets with corresponding genes used for gender and disease predictions
1) X-chromosome located genes
The final panel of 12 genes represented by 10 pairs of probe sets is detailed in Table 1A. Accuracy was greater than 99% for both male and females in the training set and 98% or greater for both genders in the test set. Several of the probe sets were expressed differentially at 1.4-fold and 1.6-fold which suggests that detection at less than 2-fold is possible with microarray technology.
2) Autosomal genes
As a general model of disease, we searched for discriminatory probe sets for autosomal genes (Table 1B) and were able to achieve a ROC AUC of 0.96 in the training set (accuracy: 92%F; 87%M) and 0.87 on the test set (accuracy: 78%F; 82%M).
Multi-disease discrimination
The 20-gene probe set panels for each of the three different diseases are detailed in Table 1C-E. The discriminative power of each of these classification panels is detailed in Figure 2 with samples arranged by rows, grouped by disease type.
Each row represents one sample. Each column represents a disease prediction. Dark grey indicates a “positive prediction”; light grey indicates a “negative prediction”. Sensitivity is defined as the percentage of subjects predicted to have the disease of interest from the number of subjects which actually have the disease of interest. Specificity is defined as the percentage of correctly predicted subjects as not having the disease of interest from all subjects that do not have the disease of interest.
The training set predictions (Figure 2A) achieved ROC AUC values of 0.96, 0.91 and 0.95 for colorectal cancer, prostate cancer and osteoarthritis, respectively. Sensitivity was 90%, 89% and 90% and specificity was 87%, 80% and 88% for colon cancer, prostate cancer and osteoarthritis. Each of the three disease-specific panels was able to reject most of the samples from conditions that were not included in the training phase (no disease, ovarian cancer, bladder cancer).
The independent test set results (Figure 2B) confirmed that each of the three disease-specific panels was effective at discriminating the particular disease it had been trained on, but was not discriminatory for either of the other two diseases nor for any of the three other sets of samples (conditions not of interest, bladder and ovarian cancers and Crohn’s disease). Colorectal cancer had ROC AUC of 0.90, prostate cancer, of 0.93, and osteoarthritis of 0.89; corresponding sensitivities were 88% (colorectal cancer), 94% (prostate cancer), and 82% (osteoarthritis) with specificity at 74% (colorectal cancer), 79% (prostate cancer), and 82% (osteoarthritis).
Discussion
To our knowledge, all studies to date on blood-based disease biomarkers have focused on identification of biomarkers for single diseases which can, at times, hide poor false positive results when predicting subjects with other diseases. In this study, we have expanded this approach to enable us to detect several diseases at once using one blood sample. The clinical utility of this approach should be immediately apparent. The ability to detect numerous pathologies at one time would simplify population disease screening. As this type of blood based tool becomes refined and applicable to be used in general populations, a patient could be screened at one visit for a range of diseases, for example, colorectal cancer and prostate cancer, rather than using several different and invasive tests.
As we show in this report, using integrated multi-disease analysis our laboratory can identify disease-specific gene expression signatures by quantitative measurement of the blood transcriptome. We have succeeded in markedly reducing crosstalk noise from confounding factors and were able to generate formulae for detecting the presence of certain diseases with specificity > 90% for various types of organ-specific diseases including cancers. The inclusion of multiple diseases in this study increases the sample variability which can lead to improved performance [3,20].
It should also be pointed out that this approach inherently allows detection of multiple simultaneous conditions as the discrimination is not an either/or decision, but rather a set of independent parallel decisions.
This independent parallel approach is partially supported by preliminary results from a small set of patients with Crohn’s disease. It is known that Crohn’s disease is associated with a higher incidence of colorectal cancer [21], and it would therefore be beneficial for a colorectal cancer diagnosis to be able to differentiate patients with Crohn’s disease that will eventually progress to colorectal cancer from those patients who will not.
At the time of this study, there were not a sufficiently large number of samples with Crohn’s disease available to construct a training set. The medical record for these samples was incomplete and the presence of colorectal cancer was unknown. As a result, these samples were only included in the test set to evaluate the performance of the three diagnosis panels.
Six of the 18 samples with Crohn’s disease had a positive call on the colorectal cancer panel, of these, one also had a positive call on prostate cancer along with two others and only one had a positive call for osteoarthritis. While these numbers are small, they do show a trend toward more positive calls for colorectal cancer than for prostate cancer or osteoarthritis, which is what would be expected if the panels are actually recognizing a biological signal rather than merely some random noise.
We initially demonstrated that subtle gene expression differences of less than 2-fold can be measured reliably as evidenced by the discrimination of gender differences with 100% accuracy from X-linked genes. We also attempted to discriminate gender using only autosomal genes (ROC AUC: 0.87, accuracy: 80%) in order to demonstrate the level of performance that is likely achievable using this method for disease/clinical condition discrimination.
The generation of unique, 10 pairs of probe sets for each of the three disease conditions of interest (colorectal cancer, prostate cancer, osteoarthritis) resulted in promising discriminatory training set panels for each disease with the ROC AUCs ranging from 0.91 to 0.96. The robust discriminatory capacity for each disease panel was confirmed by the independent test set (ROC AUC range: 0.89 to 0.93). The close concordance between the training and test set data was reassuring, given that there were a number of potentially confounding variables including: multiple clinical sites used for sample collection; multiple laboratories used for RNA extraction; multiple different chip and reagent lots, the small expression-fold changes seen in blood RNA profiling as compared with the large changes seen in tissue expression profiling, varying durations of RNA storage and multiple microarray hybridizations extending across a four-year span.
We have recently reported on the use of blood RNA profiling with a seven-gene panel utilizing quantitative real-time polymerase chain reaction (qRT-PCR) to discriminate subjects with colorectal cancer from those with no cancer [4]. This approach allows an individual’s relative risk of currently having colorectal cancer to be determined, thereby providing clinically actionable information about the need for further investigations such as colonoscopy. Although this method is a powerful tool for improving early detection of disease and provides novel information to enhance clinical decision-making, extending qRT-PCR technology to simultaneously assay for multiple diseases or clinical conditions is unsustainable. This relates to the practical limitations on the number of genes that can be included in a panel using qRT-PCR. The present study suggests that, microarray technology, with its ability to simultaneously measure the activity of a large number of RNA transcripts, can facilitate the application of blood transcriptome profiling to generate and assay multiple disease panels.
Conclusion
In this study we used gender data to show in a straightforward and non-controversial manner the clinical utility of the integrated multi-disease analysis test. The test clearly differentiates male and female (98-99% accuracy) for both male and females even when sex chromosomal factors are excluded. That is, sexes were shown to be different in blood samples using (autosomal) genes, not genes related to the sex chromosomes. The test differentiates male and female regardless of confounding factors such as using samples from different clinics, over several years and use different microarray lots. Similarly, the test methodology is able clearly to indicate the presence or absence of various diseases (colorectal cancer, osteoarthritis, prostate cancer) in the samples. Such a test can be expanded to include other types of cancer and other diseases.
Thus the quantitative transcriptomic approach has significant advantages as a potential tool for personalized medicine. This approach is not a genetic DNA marker test or polymorphism biomarker test which are the tests currently available and which have been sharply criticised as failing to agree in disease prediction between laboratories and failing to capture genetic contributions to disease risk. [22, 23] Rather, our quantitative measurement of the blood transcriptome reflects in real-time, gene expression alterations occurring over the whole transcriptome as this in turn guides phenotypic disease phenomena.
The present study demonstrates that blood transcriptome profiling in conjunction with microarray technology can be used to generate highly discriminative multi-gene panels for many diseases. This approach has great potential to enable the simultaneous monitoring of multiple disease states or clinical conditions from a single blood sample, which could, as the process is refined and developed, hold great promise as a population multiple disease screening tool.
Acknowledgements
We would like to thank Dimitri Stamatiou, and Jay Ying for their technical assistance and Ma Jun for helpful comments and criticism of this manuscript. CC Liew, Samuel Chao, Faysal El Khettabi, Hongchang Tang and K Wayne Marshall are all employed by GeneNews Ltd, who sponsored this research.