Abstract
Although gathered as continuous data, expression measurements from gene microarrays may be quantized before downstream analysis and modeling. This is especially true for modeling gene prediction and genetic regulatory networks. Coarse quantization results in lower computational requirements, lower data requirements for model inference, and easier conceptualization. This paper proposes a mixture model for binarization. For each gene, the model, composed of a sum of two distributions, is fit to expression data for that gene, and data points are binarized according to the model. The mixture model is based on the assumption of multiplicative upregulation. The proposed method is compared with mean and median binarization by comparing classification performance based on the binary data from the different methods. Classification is performed for simulated data generated from a microarray model studied previously and for cancer data arising from two studies involving hereditary breast cancer and small, round bluecell tumors of childhood.
Introduction
It is a general principle that a system should be modeled at the lowest level of complexity that permits the accomplishment of the purposes for which the model is being developed. Lower levels of model complexity mean less computation, lower data requirements for model identification, and greater ease of conceptualization. In a sense, this is an engineering form of Occam’s razor in which the pragmatics of the problem imply some minimal level of necessary complexity. The issue of complexity reduction is particularly salient for the development of prediction models and genetic regulatory models using microarray data, because the number of genes is very large and the number of samples very small. Modeling involves numerous goals, including prediction of targets based on pathways, characterization of disease states, and the design of optimal timedependent dosing regimens (1). For geneexpressionbased models, various modeling decisions must be made based on ones goals, computational power, and data abundance: the set of genes to be included in the model, the degree of complexity allowed for functional relations between genes within the model, and the quantization of expression levels.
Regarding quantization, the subject of this paper, perhaps the most wellstudied genetic regulatory network model is the Boolean network (2–4). As the name implies, this model uses binary quantization in which a gene is either ON (expression level equals 1) or OFF (expression level equals 0). The intent of such a coarsegrained model is not to serve as an architecture for biochemical pathways, but rather to give qualitative insight into multivariate, dynamical gene interaction, to provide a mathematical structure to study this dynamical behavior at the level of logical switching, and to design therapeutic strategies based on the dynamical behavior of the network when the ONOFF paradigm is sufficient. Application of Occam’s razor is implicit in the study of dynamical behavior and the design of therapies in the context of binary quantization; indeed, both are valid to the extent that they can be pragmatically described in a binary framework. In fact, the Boolean model has provided conceptual insights into the behavior of genetic regulatory networks (5–7). The Booleannetwork model, which is deterministic, has been extended recently to a stochastic framework that allows for different functional relationships and perturbations between states. These stochastic networks are called “probabilistic Boolean networks” (8). A probabilistic Boolean network maintains the rulebased structure of a Boolean network, whereas allowing for uncertainty. Two studies have demonstrated that intervention can be addressed within the context of probabilistic Boolean networks (9, 10), and a third has considered optimal timedependent external control based on dynamic programming (11). Even should the necessary technologies for diagnosis, monitoring, and therapy for these kinds of modelbased strategies become practically feasible, their successful use would still depend on appropriate binarization of continuous data. The issue of binarization has been addressed in some detail in a recent paper, and we refer to it for a more indepth discussion of binarization issues (12).
The approach we take in this paper is to apply a binarization procedure based on a multiplicative model for expression upregulation. The expression level of a gene varies across a set of microarrays, and in the context of a binary quantization is either ON or OFF for the various samples yielding the microarrays. Taken as a collection, the measurements for a particular gene compose a distribution of values, and we shall model that distribution as a mixture of two distributions, one corresponding to lack of upregulation and the other to upregulation. The parameters of these two distributions will be estimated from the expression data for the gene, and then data will be quantized according to the modeled distributions. The mixture model depends on whether one is using ratios or the intensities directly. We focus on ratios, and comment on how the procedure applies to straight intensity data with a simpler model.
Materials and Methods
Mixture Model.
The mixture model used for binarization is based on the assumption of multiplicative upregulation. For a particular gene, g, we assume that its values across the set of microarrays for which it is not upregulated are described by a normal random variable U of which the mean is the nominal value of g in these samples. The model has a multiplicative factor K > 1 such that the values of g in the upregulated samples are governed by the random variable KU, which is also normal because U is normal. Assuming a twochannel cDNA microarray, the values of the reference channel are modeled by a random variable B. Taking logs, there are two possibilities. For samples in which g is not upregulated, we get: For an upregulated sample, we get: If we make the simplifying assumption that B is not random, then the distribution of log KU/B is simply a shift of the distribution of log U/B and both possess a distribution that is the log of a normal distribution. Hence, across the samples, the logs of the ratios will appear as a mixture of two logofnormal (not lognormal) distributions.
On the basis of the preceding considerations, we postulate a logofnormal mixture model for the observations of a particular gene across a set of microarrays. We assume that the logs follow the mixture model where X is the log of the ratio, ℒN denotes the logofanormal distribution having density and c_{k},μ_{k} and ς_{k}^{2} are the weights, means, and variances, respectively, of the mixture model. Q is the number of quantization levels, which in our case is Q=2, but could be different were we to model more levels of multiplicative upregulation. Estimation of the model parameters is explained in “Appendix.” Fig. 1 shows a mixture of two logofnormal distributions.
Binarization based on the logofnormal mixture model is achieved by thresholding. Without loss of generality, assume μ_{1} < μ_{2}. The log of a ratio X is quantized by: where for i=1, 2, Because closed forms do not exist for the preceding integrals, they are evaluated by Monte Carlo methods.
Were we to use direct intensities instead of ratios, then we would not have quotients and the notupregulated and regulated cases would simply involve a normal random variable U and a second normal random variable KU, respectively. Hence, we would apply a Gaussian mixture model.
Testing Binarization Performance Using a Simulation Model.
We test the performance of MMB^{3} using a modelbased simulation and compare it to using the mean and median of the samples for each gene, denoted by “Mean” (12) and “Median” (12, 13), respectively. The Mean method is based on a threshold T (i.e., replacing the T in E) that is the mean of the intensities of each gene, and the Median method is based on a threshold T that is the median of the intensities of each gene. Then the intensity of this gene is quantized to 1 if the intensity exceeds T, and it is 0 otherwise. The latter two methods are simple but effective if there is sufficient separation in the mixture.
The simulated data are generated by a parameterized random signal model (14). It is assumed that the n genes on the m microarrays have the mean expression levels I_{1},I_{2},…, I_{n}, with I_{i}=100 + ξ_{i}, where ξ_{i} follows an exponential distribution with mean 3000. The model assumes that the intensities U_{i1}, U_{i2}, …, U_{im} for a specific gene g_{i} follow a normal distribution N(I_{i}, (αI_{i})^{2}), where α is a fixed coefficient of variation. Here we set α=0.2, which is realistic. With the inclusion of normally distributed additive noise N_{ij}, observed intensities are given by: for samples in which gene g_{i} is not upregulated, and when gene g_{i} is upregulated. The model assumes that the additive noise varies from microarray to microarray, with N_{ij}∼N(0,η_{i}^{2}), where η_{i}∼N(μ_{a}, ς_{a}^{2}). Here we set μ_{a} = 30 and ς_{a}=10. To complete the ratio model, we assume the reference channel has normally distributed values B_{ij} with mean equal to the mean, μ_{I}, of the intensity means and SD αμ_{I}. In essence, this means that we are assuming a uniform reference probe across all of the genes, and the mean probe intensity is normalized to the mean intensity of the expression means. Other reference models could be used, for instance, assuming each gene to have its own probe.
We test binarization performance by comparing the effects of the different methods on classification. On the basis of the simulation model, we take the log quantize the log to obtain q(X_{ij}), and then attempt to perform classification of the upregulated and notupregulated samples based on gene expression. The testing principle is that binarization yielding good classification is a satisfactory binarization (at least for classification purposes). For classification we use a recently proposed method involving Bayesian variable selection in conjunction with a probit regression model to relate the gene expression with the class (15). This method has performed well on labeled data from hereditary breast cancer, and we will be able to compare those earlier results with those obtained from binary expression data.
Considering the high computational cost of the Bayesian gene selection, we use the principle that the smaller the sum of squares within groups and the larger the sum of squares between groups, the better the classification performance. Hence, we take the ratio of the betweengroup to withingroup sum of squares and determine a threshold so that we only keep those genes for which the ratio exceeds the threshold. The leaveoneout method is used to estimate classification errors.
Subsequent to the simulation, we consider tumor classification using two different published data sets. Once again we are interested in which binarization method yields the best classification rate. It will be seen that MMB has no errors for both data sets. Not only does this show the worth of using the mixture model (which is our main intent), but it also lends support to a proposition put forth by Shmulevich and Zhang (12) that binary data can often perform important data analysis tasks. There, the authors demonstrate the ability of binary data to form meaningful clusters; here, the experiments demonstrate the ability of binary data to perform successful classification.
The cancer data will be normalized to account for variation between microarrays. For the jth array we take the log transform of the ratios t_{1j},t_{2j}, . . , s,t_{nj} to obtain log t_{ij}, for i=1, 2, …, n. The mean, m_{j}, of the logtransformed ratios for the jth array is computed, and we define the normalized ratio according to the equation for i=1, 2, …, n. The normalized ratios are used for classification.
The sensitivity of the Bayesianvariableselection method was examined in the original study by adding synthetic Gaussian noise to the data (15). Here we will check sensitivity using only actual data by considering how the binarization methods perform when the data are not normalized for microarrayinduced variation. This has the effect of making the classification more difficult because it tends to increase the dispersion of expression measurements for each gene.
Results and Discussions
Simulation Study.
The Bayesian selection algorithm ranks a gene according to the percentage of times it appears among the posterior samples generated by the algorithm (15). The five strongest (highest ranked) genes are used for classification for each binarization method being tested. These will differ according to the method. In each case, the five strongest genes are used in the probit classification to classify the upregulated samples. The average recognition accuracies, based on 50 simulations, are shown in Table 1 for different upregulation factors. There is little difference for large K, but when K is small, the mixturemodel method significantly outperforms the Mean and Median methods. Because classification accuracy depends on K in the upregulation model, this means that the mixturemodel approach should help for more difficult classification problems.
Hereditary Breast Cancer Data.
First we consider hereditary breast cancer data, which can be downloaded from the web page of the original paper (16). In that paper, cDNA microarrays are used in conjunction with classification algorithms to show the feasibility of using differences in global gene expression profiles to separate BRCA1 and BRCA2 mutationpositive breast cancers. Twentytwo tumor samples from 21 patients are examined: 7 BRCA1, 8 BRCA2, and 7 sporadic. There are 3226 genes for each tumor sample. After binarization, we use the probit regression classifier with Bayesian gene selection to classify BRCA1 versus BRCA2 and sporadic.
Table 2 gives the leaveoneout error counts for the three binarization methods being tested and the four experiments under consideration, the two cancers with normalized and nonnormalized data. The 5 topranked genes under Bayesian variable selection are used for classification for each binarization method.
Tables 3 through 5 give the top 10 scoring genes found by Bayesian variable selection for MMB, Median, and Mean, respectively, when the data has been normalized. Using the top 5 genes for classification for each binarization method, MMB yields no errors, whereas both Mean and Median result in one error each. Given that there are 22 microarrays, this represents a 5% improvement using MMB. This is significant, especially because MMB yields no errors at all.
By comparing the lists in Tables 3 through 5 with the list of the top 10 scoring genes in the original paper (15), shown in Table 6, we see that even with the compression of binarization, Bayesian selection may still find a number of similar significant genes. The top 10 scoring genes using MMB with normalized data includes 3 of the genes on the original list, all 3 being in the original top 5. Included among the 3 are TOB1 and phosphofructonkinase. Neither Mean nor Median has any of the original top 10 among their respective top 10.
For the nonnormalized data, we see from Table 2 that the performance of both MMB and Mean remain the same, with MMB still perfect, but Median results in three errors, which is very poor given that MMB has no errors.
Small Round BlueCell Tumor Data.
Now we apply binarization and probit regression classification with Bayesian feature selection to a published data set for SRBCTs of childhood, which includes NB, RMS, nonHodgkin’s lymphoma, and the Ewing family of tumors (17). We classify the rhabdomysosarcoma and NB tumors. The data set for the two cancers is composed of 2308 genes and 35 samples, 23 samples for RMS, and 12 samples for NB.
Tables 7 through 9 give the 10 strongest genes using Bayesian selection with MMB, Median, and Mean, respectively. For the top 5 genes on each list, the classification, and the recognition accuracies are shown in Table 2. Again, MMB yields no errors. Cadherin 2, which tops the significant gene list for MMB, is identified as an important gene in the original SRBCT study (17). In fact, it can perfectly classify RMS and nonHodgkin’s lymphoma by itself using the binary data. Even with perfect singlegene classification possible, Median still produces one error, although Mean does yield perfect classification, with cadherin 2 at the top of its list.
The situation is very different for the nonnormalized data. Again, MMB identifies cadherin 2 as the strongest gene and produces no errors. However, both Mean and Median perform poorly, with four and three errors, respectively.
Appendix
This appendix describes parameter estimation for the logofnormal mixture model. First, the samples are partitioned into Q clusters using fuzzy Cmeans clustering, and then initial parameters are estimated using the standard vector quantization method. We then estimate the parameters in C using the expectationmaximization algorithm by iterating the following steps: Γ_{i} denotes the observations of gene i.
We note that, if Q were not fixed at Q=2 for binary quantization, we could find an optimal number of distributions in the model via the minimum description length principle by building a binary tree with Q leaves, each representing a distribution in the mixture model (18).
Footnotes

↵1 Supported by the National Human Genome Research Institute, the University of Texas M. D. Anderson Cancer Center, and the National Science Foundation grant DMS0225692.

↵3 The abbreviations used are: MMB, mixture model binarization; SRBCT, small, round bluecell tumor; NB, neuroblastoma; RMS, rhabdomyosarcoma.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Accepted April 16, 2003.
 Received February 5, 2003.
 Revision received April 14, 2003.
 Molecular Cancer Therapeutics