
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Department of Electrical Engineering, Texas A&M University, College Station, Texas 77840 [S. K., E. R. D.]; Departments of Pathology [E. R. D., I. S., S. R. H., G. N. F., W. Z.] and Biostatistics [K. R. H.], The University of Texas M. D. Anderson Cancer Center, Houston, Texas 77030; and Cancer Genetics Branch, National Human Genome Research Institute, NIH, Bethesda, Maryland 20892-4470 [S. K., J. M. T.]
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
An important consideration is that the number of genes in such gene feature sets should be sufficiently small so as to be potentially useful for clinical diagnosis/prognosis or as candidates for functional analysis to determine whether they could serve as useful targets for therapy. A number of classification approaches have been used to exploit the class-separating power of expression data; however, the size of the gene sets (sometimes as large as 70) renders the construction of practical immunohistochemical diagnostic/prognostic panels and the experimental design for functional testing problematic (3, 8, 9).
We use a recently proposed algorithm to identify strong gene feature sets that are responsible for distinct patient groups (10). These gene sets are "strong" in the sense that the algorithm builds classifiers from a probability distribution resulting from spreading the mass of the sample points to make the classification more difficult, while maintaining sample geometry. In an effort to identify the strong feature genes among the different histological diagnoses in patients with gliomas, we applied this method, in a proof-of-principle study, to glioma tissue specimens from 25 patients with four different types of glioma: (a) GM;3 (b) AA; (c) AO; and (d) low-grade OL. After finding the sets of genes that are capable of accurately classifying the different types of glioma, we have also identified strong features (genes) that are seemingly responsible for the distinct phenotype of each type of cancer.
Gliomas are the most common malignant primary brain tumors (11, 12). These tumors are derived from neuroepithelial cells and can be divided into two principal lineages: astrocytomas and OLs. Current glioma classification schemes are based on morphological feature assessment and remain highly subjective and problematic for many atypical cases. Diagnoses are often dependent on the relative weighting of specific morphological features by individual pathologists. We reason that by identification of robust signature gene classifiers using typical cases, the atypical cases can be classified based on the signature classifier genes in the future.
| Materials and Methods |
|---|
|
|
|---|
Isolation of Total RNA and mRNA from Tissues.
The tissues were ground to powder under frozen conditions, and tissue powder (0.31.5 g) was lysed in the lysis buffer TRI Reagent (Molecular Research Center, Cincinnati, OH). The RNA isolation was done as described previously (13).
Hybridization to the Human Atlas cDNA Expression Array Blots.
The cDNA microarray containing fragments representing 597 human genes with known functions and known tight transcriptional controls (Clontech Laboratories, Inc., Palo Alto, CA) was used for our experiments, as described previously (13). After a high-stringency wash, the hybridization pattern was analyzed by autoradiography and quantified by phosphorimaging.
Development of an Algorithm for Finding Strong Feature (Gene) Sets.
We desire classifiers that categorize sample tissues based on gene expression values. There are two reasons why we desire classifiers involving small numbers of genes: (a) the limited number of samples often available in clinical studies makes classifier design and error estimation problematic for large feature sets (14); and (b) small gene sets facilitate design of practical immunohistochemical diagnostic panels. Thus, we use a simple classifier and a small number of genes (at most three in this study) to form classifiers (10).
Given a set of features on which to base a classifier, two issues must be addressed: (a) design of a classifier from sample data; and (b) estimation of its error. When selecting features from a large class of potential features, the key issue is whether a particular feature set provides good classification. A key concern is the precision with which the error of the designed classifier estimates the error of the optimal classifier. When data are limited, an error estimator may be unbiased but may have a large variance and therefore may often be low. This can produce many feature sets and classifiers with low error estimates. The algorithm we use mitigates this problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points. The algorithm is parameterized by the variance of the distribution. The error gives a measure of the strength of the feature set as a function of the variance.
When the data are limited, and all of it is used to design the classifier, there are several ways to estimate the classifier error. We comment on two of these. The resubstitution estimate,
n, for a sample of size n is the fraction of errors made by the designed classifier on the sample. Typically, it is low-biased, meaning E[
n]
E[
n], the expected value of the actual error. For LOO estimation, n classifiers are designed from sample subsets formed by leaving out one data point at a time. Each is applied to the left-out point, and the estimator
n is 1/n times the number of errors made by the n classifiers. It is an unbiased estimator of
n-1, meaning that E[
n] = E[
n-1]. This unbiasedness comes at a cost: the variance of the LOO estimator is greater than that of resubstitution (15).
For
0, the algorithm we use constructs from the sample data a linear classifier 
, where
2 gives the variance of the distribution used to spread the data. Both 
and its error, 
, are computed analytically. For
= 0, which means there is no spreading of the sample mass, 
is equal to the resubstitution error estimate for the sample. Thus, the standard theory informs us that the variance of
0 is less than that of the LOO estimator. Moreover, model-based studies indicate that the variance of 
decreases as
increases. To standardize the interpretation of the results,
is normalized relative to the variance of the data. Under this normalization, simulation studies with Gaussian distributions show 
to be an unbiased estimator of the optimal linear classifier for
= 0.4 and to be increasingly high-biased for increasing
. To obtain conservative estimates of the optimal error, we take
0.4. Moreover, for very small feature sets, we normalize by the maximum variance of the features. By being conservative, we reduce the chance that the resulting error estimate is optimistic. When considering a large number of potential feature sets in the presence of a small amount of data, the salient issue is one of data mining. Taking a conservative approach reduces the number of optimistic error estimates while at the same time selecting feature sets that perform well on a distribution that is significantly more dispersed than the actual data.
The concept of forming spread distributions from the data can be appreciated by reference to Fig. 1, which shows sample points from two classes (red and blue) based on measurements of genes g1 (horizontal axis) and g2 (vertical axis). Fig. 1a shows a linear classifier derived solely from the sample points. Fig. 1, bd, shows samples constructed from the original sample points by deliberately adding artificial random noise of increasing variance to the original points to form larger samples that are spread about the original sample. A linear classifier has been derived for each synthetic sample. Increasing the variance increases the error. A classifier that has a small error for a large variance is desirable because its performance is more likely to be robust relative to new data. Because the implementation of this approach takes a long time if the Monte-Carlo method is used, the actual algorithm used does not use random synthetic data to find the classifier and its error but instead constructs class distributions from the sample data and then finds both the classifier and its error analytically via simple matrix operations (10).
|
| Results and Discussions |
|---|
|
|
|---|

denote the error of the optimal classifier for the feature set, and we let
(
) denote the largest decrease in error for the full feature set relative to all of its subsets. The feature sets are first ranked based on the
-error, and they are ranked again based on the improvement,
(
). For multiple-gene classifiers, we will focus on feature sets with high rank in both lists. Indeed, this is our major focus: to find strong feature sets in which all genes contribute to glioma discrimination. To aid in understanding the gene expression characteristics of the selected feature sets, all of the genes in the data set are clustered in such a way as to be close to other genes with similar expression. This is accomplished via hierarchical clustering using the Pearson correlation and average linkage. An added value to the clustering is that genes with known behavior can be used to analyze the results, and genes with unknown behavior can be placed into certain pathways for future functional testing.
Classification Analysis for Glioma Data.
We applied the algorithm (10), which was described briefly in "Materials and Methods," to a set of gene expression profile data derived from 25 human glioma surgical tissue samples. The cDNA microarray experiments were carried out to gain expression information for 597 known cellular genes.
We designed two-class classifiers for the classification of OL from others, AO from others, AA from others, and GM from others. We limited the number of genes for each classifier to only three, and the dispersion levels (amount of spread) of samples were varied from
= 0.4 to
= 0.8. We focus on
= 0.6 because it provides conservative error estimation, but not too conservative (10). Even with analytic classifier design and error estimation, due to the number of potential feature sets and the various cases considered, the computations were done on a Beowulf-based supercomputer (16) at the Center for Information Technology at NIH.
Tables 1![]()
4 show the feature sets identified for each classification category. The tables are constructed so that feature sets ranked high in both
-error, 
, and improvement,
(
), of
-error are listed. This is accomplished according to the following scheme: (a) the top three single-gene classifiers for the category are listed in each table; (b) two-gene classifiers ranked in the top N2 pairs for both
-error and improvement of
-error are included (N2 table-dependent); and (c) three-gene classifiers included in the top N3 triples for both 
and
(
) are included (N3 table dependent). For comparison purposes, the LOO error estimate is also shown in the tables. As expected, overall the
-error is more conservative, so that when the
-error is very small, usually the LOO error is also very small or zero.
|
|
|
|
-error is very low (at least as low as for the gene itself). Because of our desire to avoid this kind of redundancy in the tables, there are gene sets omitted from the two- or three-gene lists that possess smaller
-errors than those shown in the table. For instance, in Table 1, the
-error for the top-listed two-gene set is substantially greater than for any pair involving transducin ß2 subunit 2, simply because adjoining genes to transducin ß2 subunit 2 produce a
-error less than that of transducin ß2 subunit 2 itself. The complete performance lists for both error and improvement in error can be found in the supplementary information.4
The advantage of reporting the results in the way we have is that multivariate discriminatory power is revealed. This is clearly demonstrated in Table 1 with regard to cell surface glycoprotein MUC18. The gene does not appear on the single-gene list, indicating that its
-error exceeds 0.1115; however, it appears with clusterin (CLU) in the two-gene list and both with and without clusterin (CLU) in the three-gene list. The substantial improvement in each case demonstrates the significant contributions of the genes within each gene set.
There are other instances where the improvement of classification error is sufficient to warrant inclusion in a table. In Table 2, even though IGFBP2 is by itself a decent discriminator, when it is combined with others, such as ephrin type A receptor 1 (EPHA1), the error is significantly improved. The
-error decreases by more than 0.05, from 0.1392 (data not shown) to 0.0862. The improvement for the LOO error is more significant, from 0.16 (4 of 25) to 0 (0 of 25). Because of this, feature sets including IGFBP2 are shown in the table. We recently studied IGFBP2 expression in 256 cases of gliomas of different grades using tissue microarray and found that IGFBP2 is overexpressed in 80% of GBMs (Ref. 17; data not shown). Further testing with suitable antibodies will be able to test whether combination of IGFBP2 and EPHA1 will provide more accurate classifications. Some of these multivariate discriminators are shown in Fig. 2.
|
|
Gliomas are very complex cancers involving different growth characteristics and cell lineage features (12). Because the original clone of tumor cells may exist at any stage of cell differentiation and may have different transformation events, the boundaries between tumor grades and tumor lineages can be blurred. This is reflected in the current morphologically based tumor classification schemes that often mix cell lineage features with tumor growth characteristics. The results are frequently subjective, and disagreements among pathologists regarding the identity of t4he tumor are not uncommon. The gene expression activities yielded by the study of molecular biology and genomic biology may provide a more objective method to classify diseases. This belief is based on the assumption that cell phenotypes have genotypic origins. Recent successes in subclassification of neoplasms within a disease group using gene expression profiles (37) provide support for such a belief.
Thus, the issue is how to best identify the strong feature genes that are closely linked to specific phenotypes from among the thousands of genes in gene expression profiles and how to determine whether this information really aids classification of tumors. There are many technical challenges in the path to accomplishing the task of finding the key links.
The first major roadblock is the small sample size issue inherent to microarray-based classification efforts (14). Contributing to this are the limited numbers of human tissues for study and the cost of such gene expression profiling projects. Because classifiers are designed from observed expression vectors that have randomness arising from biological and experimental variability, the design, performance evaluation, and application of classifiers must take this randomness into account, especially when the number of samples (tissue specimens) is small, which is the case in most human tissue-based microarray experiments.
Algorithms are therefore needed to identify robust classifiers from very limited data sets. Three criteria have to be met for an algorithm to be considered strong. First, given a set of variables, a classifier from the sample data should provide good classification over the general population. Second, the algorithm should be able to estimate the error of a designed classifier when data are limited. Third, given a large set of potential variables, the algorithm should be able to select a set of variables as inputs to the classifier.
Taking these issues into consideration, we used a recently developed method to find both strong classifiers and strong features (10). This algorithm considers the inherently variable or "high-noise" nature of microarray measurements. Using this algorithm, we have identified robust classifier gene sets containing one to three genes that distinguish each type of glioma from the other three. This provides guidance for the development of pathological assays using a reasonable number of markers for clinical use.
In a broader context, the approach applied in this study can be used to identify genes that contribute to the major differences between any two groups of samples analyzed, in the process of which some less understood phenotypes might be identified. For example, we might find strong feature gene sets that distinguish cancers with high metastatic potential from cancers with little or no metastatic potential or gene sets that identify cancers that will be sensitive to specific therapies versus those that will be resistant and continue to grow unabated through therapy. Current histology-based classification and grading systems can do neither of these. Identification of such strong feature genes may not only provide markers for diagnosis and disease management but may also provide novel potential targets for drug development. Cancers have complex features, but we cannot target all of these features for treatment. A method that could identify the strong features, both genotypically and phenotypically, would provide an ideal route to the heart of the problem. Future studies will tell whether the currently used algorithm or an improved one will achieve this goal.
| Acknowledgments |
|---|
| Footnotes |
|---|
2 To whom requests for reprints should be addressed, at Cancer Genomics Core Laboratory, Department of Pathology, Box 85, The University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030. Phone: (713) 745-1103; Fax: (713) 792-5549; E-mail: wzhang{at}mdanderson.org ![]()
3 The abbreviations used are: GM, glioblastoma multiforme; OL, oligodendroglioma; AO, anaplastic oligodendroglioma; AA, anaplastic astrocytoma; IGFBP2, insulin-like growth factor-binding protein 2; LOO, leave-one-out. ![]()
4 Supplementary data is available at Molecular Cancer TherapeuticsOnline (http://mct.aacrjournals.org). ![]()
Received 2/14/02; revised 8/30/02; accepted 9/30/02.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
L. P. Petalidis, A. Oulas, M. Backlund, M. T. Wayland, L. Liu, K. Plant, L. Happerfield, T. C. Freeman, P. Poirazi, and V. P. Collins Improved grading and survival prediction of human astrocytic brain tumors by artificial neural network analysis of gene expression microarray data Mol. Cancer Ther., May 1, 2008; 7(5): 1013 - 1024. [Abstract] [Full Text] [PDF] |
||||
![]() |
O.-H. Lee, J. Xu, J. Fueyo, G. N. Fuller, K. D. Aldape, M. M. Alonso, Y. Piao, T.-J. Liu, F. F. Lang, B. N. Bekele, et al. Expression of the Receptor Tyrosine Kinase Tie2 in Neoplastic Glial Cells Is Associated with Integrin {beta}1-Dependent Adhesion to the Extracellular Matrix Mol. Cancer Res., December 1, 2006; 4(12): 915 - 926. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Choudhary, M. Brun, J. Hua, J. Lowey, E. Suh, and E. R. Dougherty Genetic test bed for feature selection Bioinformatics, April 1, 2006; 22(7): 837 - 842. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. A. Koike Folgueira, D. M. Carraro, H. Brentani, D. F. da Costa Patrao, E. M. Barbosa, M. M. Netto, J. R. F. Caldeira, M. L. H. Katayama, F. A. Soares, C. T. Oliveira, et al. Gene Expression Profile Associated with Response to Doxorubicin-Based Therapy in Breast Cancer Clin. Cancer Res., October 15, 2005; 11(20): 7434 - 7443. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. W. Mount and R. Pandey Using bioinformatics and genome analysis for new therapeutic interventions Mol. Cancer Ther., October 1, 2005; 4(10): 1636 - 1643. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. A. Schwartz, R. J. Weil, R. C. Thompson, Y. Shyr, J. H. Moore, S. A. Toms, M. D. Johnson, and R. M. Caprioli Proteomic-Based Prognosis of Brain Tumor Patients Using Direct-Tissue Matrix-Assisted Laser Desorption Ionization Mass Spectrometry Cancer Res., September 1, 2005; 65(17): 7674 - 7681. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. W. Vogel, Z. Zhuang, J. Li, H. Okamoto, M. Furuta, Y.-S. Lee, W. Zeng, E. H. Oldfield, A. O. Vortmeyer, and R. J. Weil Proteins and Protein Pattern Differences between Glioma Cell Lines and Glioblastoma Multiforme Clin. Cancer Res., May 15, 2005; 11(10): 3624 - 3632. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Pal, A. Datta, A. J. Fornace Jr, M. L. Bittner, and E. R. Dougherty Boolean relationships among genes responsive to ionizing radiation in the NCI 60 ACDS Bioinformatics, April 15, 2005; 21(8): 1542 - 1549. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. A. Freije, F. E. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy, L. M. Liau, P. S. Mischel, and S. F. Nelson Gene Expression Profiling of Gliomas Strongly Predicts Survival Cancer Res., September 15, 2004; 64(18): 6503 - 6510. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. K. Mell, J. J. Meyer, M. Tretiakova, A. Khramtsov, C. Gong, S. D. Yamada, A. G. Montag, and A. J. Mundt Prognostic Significance of E-Cadherin Protein Expression in Pathological Stage I-III Endometrial Cancer Clin. Cancer Res., August 15, 2004; 10(16): 5546 - 5553. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |