
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Science Applications International Corporation [A. W., A. A. R.], National Cancer Institute at Frederick, NIH, Frederick, Maryland 21702 Developmental Therapeutics Program, National Cancer Institute at Frederick, NIH [R. H. S., E. A. S., D. G. C.], Frederick, Maryland 21702
| Abstract |
|---|
|
|
|---|
6 K gene expressions across the National Cancer Institutes panel of 60 tumor cell lines. Initial assessments of reproducibility for gene expressions within each dataset, as derived from sequence analysis of full-length sequences as well as expressed sequence tags (EST), found statistically significant results for no more than 36% of those cases where at least one replicate of a gene appears on the array. Filtering the data based only on pairwise comparisons among these three datasets creates a list of
400 significant concordant expression patterns. The expression profiles of these smaller sets of genes were used to locate similar expression profiles of synthetic agents screened against these same 60 tumor cell lines. A correspondence was found between mRNA expression patterns and 50% growth inhibition response patterns of screened agents for 11 cases that were subsequently verifiable from ligand-target crystallographic data. Notable amongst these cases are genes encoding a variety of kinases, which were also found to be targets of small drug-like molecules within the database of protein structures. These 11 cases lend support to the premise that similarities between expression patterns and chemical responses for the National Cancer Institutes tumor panel can be related to known cases of molecular structure and putative cellular function. The details of the 11 verifiable cases and the concordant gene subsets are provided. Discussions about the prospects of using this approach as a data mining tool are included. | Introduction |
|---|
|
|
|---|
In this paper we examine the connection between gene expression data across the NCIs 3 60 tumor cell lines and chemical screening experiments conducted on the same cell lines to establish cytotoxicity, as measured by concentration of agents to GI50 (1214). Correlations between gene expression patterns and GI50 patterns across this tumor cell panel are postulated, in our analysis, to suggest possible interactions between chemical agents and either gene products or nucleic acids. One should note that no drugs are introduced in this set of gene expression measurements. It is only via their correlated responses that we postulate any connections between putative targets and chemical response.
Previous attempts to identify relationships between molecular targets and chemicals, based on expression patterns observed in the NCIs anticancer drug screen, have been reported with varied degrees of success (15, 16). Much of this difficulty results from the lack of abundant gene-drug relationships, that can also be experimentally verified, thus also making it difficult here to fully evaluate our general hypothesis. As an alternative it is possible to question our general hypothesis about linkages between chemical responses and putative targets without explicit administration of these agents (4, 17). However, our strategy does not take into account additional concerns related to making conclusions from mRNA microarray data. For example, Tamm et al. (18) critique the use of mRNA as a measure of expression by pointing to the well-known fact that post-transcriptional regulation of expression is most likely of equal importance for the expression of some genes. Others have amplified this viewpoint by suggesting the necessity of measuring actual protein levels, a prevailing feeling among members of the growing proteomics community.
In this analysis we examine three gene expression datasets for coherence between related gene expression patterns. At the outset we acknowledge that much controversy has been raised around the fact that gene expression data can be of highly varying quality and that it is necessary to make repeated measurements to get a clear picture of the true expression profile (1921). As an alternative, our analysis will treat each of these datasets as a single replicate, and based on coherence of their patterns across the 60 tumor cell lines, extract the "best" set of gene expressions amongst these experiments. In fact, the quality of these data supports only a few hundred significant gene expressions; however, based on our requirement of concordance, this smaller data set can be advanced with higher confidence for additional analysis. It is among these few hundred concordant gene expressions that we hunt for evidence of target-drug relationships based on the additional observations of GI50 values obtained from these same set of 60 tumor cell lines.
A limitation of this type of analysis is that verifications of postulated target-drug associations are nontrivial. Facing this criticism, we propose our findings as testable hypotheses. Our analysis infers target-drug relations by seeking similar patterns of cellular response as measured in gene space and in chemical space. A substantial amount of information already exists about this chemical space; a detailed clustering analysis has been completed that has organizes the >36 K screened compounds into slightly more than 1 K clusters; the latter of which are grouped according to mechanisms of cellular action that include biomolecular synthesis and cell cycle control (22). This analysis already provides a great deal of information about chemical activity and can be additionally surveyed for matches between gene expression profiles and chemical response profiles across the NCIs tumor cell panel. While this treatment offers only clues about these interactions, validations can be found by seeking chemically similar ligands that have been deposited as ligand-target complexes in the PDB (23). Examinations of these ligand targets for their biochemical function followed by establishing the link between these expressed proteins and mRNA expressions via sequence alignment is used here to make connections between gene expression and chemical activity.
Our analysis finds that in the relatively restricted set of concordant gene expressions, 11 verifiable gene-drug relationships are found. Previous attempts to find similar correlations have yielded, at best, only one or two such relationships (15, 16). While our 11 observations represent only a small number when compared with the total space of possible interactions, these results are encouraging by demonstrating that target-drug interactions can be extracted from diverse measurements of gene expression and GI50 data. The tools necessary to connect these measurements are incorporated in the world wide web. 4
| Data Treatment for Finding Significantly Differentiated Gene Expressions |
|---|
|
|
|---|
In order to capture similar differential gene expressions among these datasets we look to similarities between the total expression profile across the 60 cell lines. In Fig. 1 we display the normalized distribution of correlation coefficients of the data vectors for each gene that has two or more members within the same Unigene cluster but restricted to the same microarray. There were 987 such genes in the Millennium data set, 869 in the Stanford set, and 520 in the Whitehead dataset. The correlation coefficient of two expression profiles
and
across the 60 cell lines was calculated from:
![]() | (1) |
|
|
Our strategy for making this linkage lies in our previous analysis of the NCIs screening data. As noted earlier, we have clustered GI50 values for
36 K screened compounds (22) to organize this data into
1 K clusters that represent different types of cellular activity using a self organizing map (SOM). Using this organization, the gene expression profiles are matched to similar GI50 profiles from screened data. The matching projection is done by calculating the Euclidian distance between the data vector and all of the node vectors of the GI50 map, and selecting the location with the minimal distance. These projections of gene expression data onto chemical response data are a means of relating each measurement according to its activity pattern across the 60 tumor cells. However, our projections are not based on the complete set of measured genes for each microarray, rather the analysis is conducted only on the concordant subset of gene expressions. Thus, the gene expression data is first filtered according to concordance, then projected onto chemical response space.
An appropriate question about gene projections on to chemical clusters is the reliability of placement. The similarity measure for map projections is their Euclidean distance. The data vector from gene expressions measured across the 60 tumor cells is placed in the cluster having the smallest Euclidean distance. We estimate the a priori probability of a chance occurrence of having two data vectors coprojecting to the same location on the GI50 map by calculating the ratio of all of the vectors that coproject to those that do not. This yields a P of 3.8 x 10-3 for the coprojection procedure for finding significantly differentiated gene expressions in the datasets.
A total of 106 genes are identified based on pairwise filtering of the Millennium and the Stanford datasets, 175 from the Stanford and the Whitehead datasets, and 154 from the Millennium and Whitehead datasets. This gives a total of 376 unique genes that survive this filtering technique. A listing of these genes is available together with their Unigene identifier. 7 The 376 selected genes represent many types of cellular functions, although this set is dominated by genes involved in signal transduction, with the remainder falling in broadly defined categories: adhesion/extracellular matrix, tumor suppressor, ribosomal/transcriptions, immune response, melanoma, and proliferation clusters. Only 29 of them have been characterized as housekeeping genes (25).
The recent paper by Staunton et al. (26) attempts to identify drug-gene relationships in ways similar to ours but using a more extensive prefiltering of the datasets into cells of extreme drug (in)sensitivity. While our analysis cannot be directly compared with theirs, we find a 34% overlap between their sets of reported genes and those found by us to convey the most information in our analysis.
As examples we provide here a brief description of genes that appear to have strongly concordant gene expression profiles across these three datasets. We emphasize strongly that where these genes have similar response patterns across the 60 tumor cell lines, conclusions about these observations in regard to cell function are not addressed.
EDNRB (Hs.82002, endothelin receptor type B, cluster 16.8) or endothelin receptor B is expressed in all of the human melanoma cell lines, though metastatic melanoma expresses this receptor relatively less (27, 28). Inspection of the response pattern for the tumor cell panel reflects the high expression of EDNRB within the melanoma panel (data not shown). A similar strong pattern is observed for a smaller set of breast cancer cell lines.
FN1 (Hs.287820, Fibronectin 1 or LETS, cluster 7.22) is a fibronectin, which is an important class of extracellular multiadhesive matrix proteins. As such, fibronectins are ligands to the integrin family of cell adhesion molecules and partake in the regulation of cytoskeletal organization. The strong signal for fibronectin expression has also been corroborated by previous measurements of cancer expressions profiles using a variety of alternative methods (1, 29, 30). The strong fibronectin signal within the renal panel lines is quite evident and coincides with the observations that fibronectin may be a critical factor in the regulatory role of extracellular matrix proteins in metastatic invasion of renal cancer cells (31).
LCP1 (Hs.76506, lymphocyte cytosolic protein 1 (L-plastin), cluster 23.9) is an actin regulating protein. Structural proteins like actin, may be involved in the development and progression of cancer (32). Regulation of these genes is accomplished by a number of genes, L-plastin among them. L-plastin is an actin binding protein that has tissue-specific expression patterns. L-plastin is specifically expressed in hematopoietic cells but has also been found to be highly expressed in cell lines derived from mammary solid tumors. Dysregulation of actin-binding proteins during carcinogenesis may, thus, be the direct link between the observed upregulation of L-plastin in the cancer cell lines, although the exact role or L-plastin in the tumor process remains unknown (33). Upregulation of L-plastin has been linked to testosterone in breast and prostate cancer cells (34). This observation might suggest a corresponding subpanel sensitivity to testosterone. The expression profile of L-plastin is strongest within the leukemia and breast cancer panels, near a region on our anticancer map demonstrated to have sensitivity to selected steroid molecules, NSCs 624018 and 633664.
MCAM (Hs.211579, melanoma adhesion molecule, MUC18, cluster 15.7) is a transmembrane glycoprotein and is a member of the immunoglobulin superfamily. The protein is closely related to a number of cell adhesion molecules. Tumor progression and metastasis in human malignant melanoma is associated with MCAM. Consistent with this expression pattern we observe enhanced expression activities in the melanoma panel.
S100P (Hs.2962, S100 calcium-binding protein P, cluster 8.10) is a low molecular weight calcium-binding protein, which is associated with the regulation of cellular processes such as cell cycle progression and differentiation. Overexpression of S100P has been postulated to play an important role in the immortalization of human epithelial cell in vitro and in tumor progression in vivo (35). Other S100 calcium binding proteins are also found to be correlated among these three expression datasets. These include the S100A4 gene (r = 0.70, P < 0.01), whereas the S100B gene expression is only weakly correlated (r = 0.27, P < 0.01). S100P is down-regulated after androgen deprivation in an androgen-responsive prostate cancer cell line (36). As in the L-plastin case described above, the gene expression profile of S100P is most similar to a region on our anticancer map that is sensitive to steroid molecules, NSC 689621 and 652123.
It is important to note that previous analysis of portions of these datasets have also identified L-plastin and S100P as important genes. The methods used in these reports were considerably more complicated that the simple filtering method proposed here.
| Identifying Molecules That Affect Expression Levels |
|---|
|
|
|---|
|
and
as:
![]() | (2) |
Similarities between molecules are thus measured via the Tanimoto coefficient of a discrete bit-vector of length 431 for each compound. This Tanimoto coefficient identifies common molecular fragments between two compared molecules and ranges from 0 to 1. In this case we have used a cutoff of 0.75 as being of significant similarity (41). Thus, if we find a similar ligand in the PDB we query the parent structure for its function, and if its function is similar to that of the original query gene we consider evidence for verification of a target-drug association. Because the number of ligands in the PDB is rather modest, we cannot expect to verify each individually selected significant gene; instead we use this process to verify the basic premise of similarity between gene expression and drug response.
In order to estimate the joint occurrence of a coprojection and the chance occurrence that a PDB ligand has a Tanimoto score >0.75 with a NSC compound we calculate the ratio of all of the PDB ligand: NSC compound pairs that have such a Tanimoto coefficient to those that do not. This yields a P of 8 x 10-4. The a priori probability of a joint occurrence of these two events is then the product of these to probabilities and yields a final P of 3 x 10-6 for the procedure.
We have used a pairwise comparison strategy to extract information from the three gene expression datasets. Our analysis finds evidence for 11 putative chemical-gene relationships. While this number represents a low percentage of the total number of concordant genes, the remaining not-yet-verifiable genes represent the subject of future investigations into their potential chemical-gene relationships.
| Genes and Chemicals Connected via Cellular Profiles |
|---|
|
|
|---|
|
|
|
or a human cyclin-dependent kinase 2. Our analysis permits only speculations about the potential binding of STO to these other kinase molecules. Examination of the cellular profiles finds the renal panel to be most sensitive to the STO. Surveys for PDB proteins homologous to CAMK1 find an
-catalytic subunit of a cAMP-dependent protein kinase (1STC). This observation is significant, because 1STC also shares homology with the MAP2K4 sequence. Mitogen-activated protein kinase pathways are signal transduction cascades with distinct functions in mammals. MAP2K4 kinase is a potent physiologic activator of the stress-activated protein kinases. 1STC is bound by a ligand having structural similarity to NSC compound 645327 shown in Fig. 4A. Although this compound and STO are chemically quite different, both compounds display some structural similarity in their fused ring systems that might suggest a common pharmacophore and cellular activity.
PDGFRA (platelet-derived growth factor receptor,
polypeptide, PDGFR2) is a membrane-spanning growth factor receptor with tyrosine kinase activity. Overexpression of the PDGFRA subcomponent in the PDGF signaling system has been implicated in the development and malignant progression of diffuse gliomas (43). From their similarities in cellular response profiles, we identify NSC compound 672971 as a candidate ligand based on its structural similarity to the PDB ligand ANP 5'-adenyly-imido-triphosphate shown in Fig. 4A, which is bound to the crystal structure 2SRC, a human tyrosine-protein kinase c-src. This gene-drug association links the tyrosine-kinases together with a ligand binding motif similar to ATP, a natural substrate of kinases.
BDH is a lipid-requiring mitochondrial membrane enzyme with an absolute and specific requirement for phosphatidylcholine, which acts as an allosteric activator of BDH enzymatic activity (44). Its gene expression profile links it to two dehydrogenases in the PDB, 3DHE and 1DHT via their ligand similarity to the steroid-like NSC compound 92227. The 3DHE deposition contains estrogenic 17-ß hydroxysteroid dehydrogenase complexed with the ligand AND, while 1DHT is the same protein but complexed to DHT. The corresponding similarities of these compounds are shown in Fig. 4A. The similarities of their ligands allows us to make a tentative gene-chemical connection for the BDH gene and the steroid compound NSC 92227. Bailly et al. (45) showed that the gene expression of this mitochondrial enzyme is modulated throughout developmental changes in hormonal and metabolic conditions, especially via corticosterone and estradiol.
PRKCB1 plays an important role in B-cell activation and may be functionally linked to a tyrosine kinase in antigen receptor-mediated signal transduction (46). Berns et al. (47) report that PRKCB1 also functions in angiogenesis and cancer growth. Our analysis finds a structural link between protein kinase C and the PDB structure 2HCK, which contains a src family kinase, hck. The similarities in cellular profiles of the gene expression of PRKCB1 and the GI50 response pattern to the compound QUE are consistent with the sequence similarities of its target protein, and structural similarities between NSC compound 169517 and the kinase hck-bound ligand QUE.
DIA4 is part of the detoxification process of quinones derived from the oxidation of benzene metabolites. Diaphorase can also activate bioreductive anticancer drugs. Down-regulation of diaphorase has been shown to induce gastric cancer in certain cell lines (48). Menadione is present in the PDB as a ligand to the human quinone reductase type 2, and two analogous NSC compounds 11897 and 651207 are found to have similar gene expression profiles. Their structural similarity is given in Fig. 4B.
PTPRC is a major high molecular weight leukocyte cell surface molecule, and it functions as a membrane-bound protein tyrosine phosphatase. It is required for efficient lymphocyte signaling and plays an important role in the human immune system. Its gene expression profile is highly correlated with the cellular profile induced by NSC compound 635526. This molecule is analogous to the PDB ligand OBA in PDB deposition 1C85, which contains the structure of a protein tyrosine phosphatase 1B. The similarity between the ligand and the NSC compound shown in Fig. 4B, coupled with the closely matched gene/protein function, clearly establishes their gene/chemical relationship.
MMP1 is a matrix metalloproteinase that helps to break down interstitial collagen. Overexpression of MMP1 in tumor cells is indicative of the invasive nature of cancer. In the PDB there exists a structure of the catalytic domain of the metalloprotease neutrophil collagenase. The inhibitor bound to this enzyme is PLH, which shares common structural elements with the NSC compound 672675 in Fig. 4B. The gene expression profile projects to the nearest neighbor of the cluster containing this compound, providing a tentative link between chemical agent and gene.
APOD is a member of the
(2 µ)-microglobulin superfamily of carrier proteins termed lipocalins. It shares a high degree of homology to retinol-binding protein. This homology allows us to assign the PDB structure 1FEN as possessing similarities with the APOD gene. The axerophthene ligand is closely analogous to the NSC compound 122759 and shares its cellular profile with the gene expression for APOD. The strong similarity between the ligand and the NSC compound in Fig. 4B is evidence for a tentative gene/drug relationship between retinol-like molecules and lipocalins.
ADH5 (class III),
polypeptide is a protein of which the specific function in humans is largely unknown. There exists a highly homologous protein model in the 1DDA PDB deposition, which is an ADH complexed with isoursodeoxycholic acid, a steroid. An analogous NSC steroid compound shown in Fig. 4C, 49452, is found to have a strongly similar gene expression profile, indicating a tentative relationship between these two data profiles.
CTSH belongs to a class of cystein-dependent intracellular proteases. The cathepsins have an important function in regulating intracellular protein degradation. The up-regulation of cathepsin gene transcription appears to be characteristic for invasive tumor cells (49). In the structural deposition of 1BP4, papain has been used as a model to test cathepsin inhibitors. The PDB ligand carbobenzyloxylleucinyl-leucinyl-leucinal shares structural similarity to the NSC compound 679678 as shown in Fig. 4C, providing a link between the protease functions and the activity of structurally similar ligands that may bind cathepsin.
| Conclusion |
|---|
|
|
|---|
Using a methodology that seeks similarities in cellular response patterns derived from gene expression measurements and chemical screens, connections between gene and chemical space can be made. Our procedure is grounded in the premise that these similarities in cellular response represent associations between gene products and chemical activity. We additionally verify this association by identifying small structurally similar compounds that imply a putative connection to chemotherapeutic cancer pharmacology. These latter relationships are verified here for 11 test cases. Although not emphasized in this work these measurement also allow us to differentiate gene/chemical responses based on different cell lines and, thus, also on clinically different cancer types. This may aid the identification of drugs that are specific for certain types of cancers and provide a tool for focusing efforts in the drug discovery process.
Different methods for identifying gene-chemical associations have been proposed by Butte et al. (16) and by Scherf et al. (15), who also describe the paucity of verifiable connections possible from this same dataset; the former case revealing 1 and the latter case another of the 11 associations reported here. The difference between our approach and theirs is the use of multiple datasets as surrogate replicate measurements of the same data, then filtering these data based on concordant response patterns and finally verifying our gene-chemical relationships by seeking actual structural cases. We find, with reasonably high confidence, assignments of gene-drug relationship for 11 verifiable cases, comprising drug binding to a variety of targets. Known kinase effector molecules taken from the PDB were positively correlated with their corresponding genes and NSC compounds based on similarities in their cellular response profiles. Likewise the BDH gene was found to be projected to a cluster on our WEB-accessible anticancer map with known steroid activity, which could be verified by the corresponding hydroxysteroid dehydrogenase ligand and structure in the PDB archive. None of these 11 associations appears to be spurious, although this cannot be ruled out without additional biochemical investigations of each specific system.
The wealth of data accompanying the post-genomic era offers high promise for understanding cellular processes and deriving strategies to affect these systems. Harvesting this information will not be simple. As our investigation reveals, this data can be quite noisy, but when confronted with data of poor quality, additional computational efforts can be utilized that lead to the extraction of meaningful information. These results are not unanticipated, given that these analyses involve quite large amounts of data that are collected from extremely complex biological systems. Additional complications related to this system are that these measurements are made on somewhat artificial cell lines and not real tumors (1, 8, 50), the GI50 experiments are single valued measurements of a highly complex system, and that only a subset of all the genes in the cell are represented on the microarray chip. Strategies to overcome these criticisms will be devised. Our approach offers one solution by exploring chemical and genetic links that in most cases cannot be easily verified by other means than the route taken here. This strategy does offer hope, by revealing a small set of gene/drug linkages that can be additionally exploited as possible novel data in the search for new chemotherapeutic strategies.
| Footnotes |
|---|
2 To whom requests for reprints should be addressed, at Science Applications International Corporation, Frederick, MD 21702 (to A. W.) or Developmental Therapeutics Program, National Cancer Institute at Frederick, NIH, Frederick, MD 21702 (to D. G. C.). ![]()
3 The abbreviations used are: NCI, National Cancer Institute; GI50, 50% growth inhibition; PDB, Protein Data Bank; STO, staurosporine; cAMP, cyclic AMP; PDGF, platelet-derived growth factor; BDH, 3-hydroxybutyrate dehydrogenase; AND, dehydroepiandrosterone; DHT, dihydrotestosterone; QUE, quercetin; DIA4, diaphorase 4 or menadione oxidoreductase; PTPRC, protein tyrosine phosphatase receptor type C; MMP, matrix metalloproteinase; APOD, apolipoprotein D; ADH5, alcohol dehydrogenase 5; CTSH, cathepsin H. ![]()
4 Internet address: http://spheroid.ncifcrf.gov. ![]()
5 Internet address: http://dtp.nci.nih.gov. ![]()
6 Internet address: http://www.genome.wi.mit.edu/MPR. ![]()
7 Internet address: http://octagon.ncifcrf.gov/~wallqvis/gene.data.html. ![]()
Received 11/26/01; revised 1/29/02; accepted 2/ 6/02.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
A. Wallqvist, J. Connelly, E. A. Sausville, D. G. Covell, and A. Monks Differential Gene Expression as a Potential Classifier of 2-(4-Amino-3-methylphenyl)-5-fluorobenzothiazole-Sensitive and -Insensitive Cell Lines Mol. Pharmacol., March 1, 2006; 69(3): 737 - 748. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Chowdary, J. Lathrop, J. Skelton, K. Curtin, T. Briggs, Y. Zhang, J. Yu, Y. Wang, and A. Mazumder Prognostic Gene Expression Signatures Can Be Measured in Tissues Collected in RNAlater Preservative J. Mol. Diagn., February 1, 2006; 8(1): 31 - 39. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. G. Kolonin, L. Bover, J. Sun, A. J. Zurita, K.-A. Do, J. Lahdenranta, M. Cardo-Vila, R. J. Giordano, D. E. Jaalouk, M. G. Ozawa, et al. Ligand-Directed Surface Profiling of Human Cancer Cells with Combinatorial Peptide Libraries Cancer Res., January 1, 2006; 66(1): 34 - 40. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Vekris, D. Meynard, M.-C. Haaz, M. Bayssas, J. Bonnet, and J. Robert Molecular Determinants of the Cytotoxicity of Platinum Compounds: The Contribution of in Silico Research Cancer Res., January 1, 2004; 64(1): 356 - 362. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. G. Covell, A. Wallqvist, A. A. Rabow, and N. Thanki Molecular Classification of Cancer: Unsupervised Self-Organizing Map Analysis of Gene Expression Microarray Data Mol. Cancer Ther., March 1, 2003; 2(3): 317 - 332. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||