Gene Selection

1 Introduction. 1

2 Related Work. 2

3 Methods. 3

3.1 F-test Method. 4

3.2 Cho’s Method. 4

3.3 GS1 and GS2 Method. 4

3.4 SFS and SFFS Method. 4

The advancement of bioinformatics has absorbed tremendous attentions in the last decade. Using modern microarray technology, great achievement has been made in medical and biological researches. Although the progress in microarray technology is prominent, there are still many unresolved problems hindering scientists making more progress by microarray data. The unbalance between the large number of genes tested on microarray chip and the relative small number of samples can be one of the most challenging problems. In microarray experiments, we are already able to monitor the expression level of hundreds and thousands of genes simultaneously. However, the sample size is significantly smaller than the gene size. To process such a large number of data is time-consuming and inefficient. Thus, it is necessary to select the "important" genes from the gene dataset to reduce the dimension of the dataset, to generate classifier and prepare for the clustering procedure.

There are tons of genes in mammalians that controls the protein metabolism and modulates the function of the body. Some genes keep their expression level constant under different conditions. These genes are called house keeping genes. They are not important genes we are looking for to construct classifier. Other genes, on the contrary, have different expression levels under different conditions and in different tissues. These differentially expressed genes are called discriminatory genes. They contain valuable information that is related to the features of the samples. This step of discriminatory gene identification is generally referred to as gene selection. Gene selection is a pre-requisite in many applications [8]. It should be noted that, often, the number of unrelated genes is much larger than the number of discriminatory genes.

Gene selection is to detect a subset of a small number of discriminatory genes that can effectively recognize the class to which a sample belongs. That is, the classification error is minimized by given classifier. The known dataset that is used for learning the classifier, or equivalently for gene selection, is referred to as training dataset. In a training dataset, every sample is labeled with its known class. Notice that sometimes a class is significantly larger than the others, then the trained classifier might be biased. Therefore, the desired gene selection methods are those that are not affected by the sizes of classes in the training dataset. After picking the a subset of a small number of discriminatory genes, these number of genes will be used on the testing samples. Note that testing sample is different from the training samples. The classification accuracy is defined as the percentage of correct decision make by the classifier on the testing samples. Good gene selection method can obtain the better classification accuracy.

Another major concern for gene selection and classification is over-fitting. Over-fitting refers to the creating diagnostic models that may not generalize well to new data from the same sample types and data distribution despite good performance on the training set [12]. It is very easy to over-fit the classifiers and the gene selection procedures, because a lot of algorithms are highly parametric and datasets consist of a relatively small number of high-dimensional samples, especially when using intensive model search and powerful learners [12]. Another problem is under-fitting, which leads to an inefficient classifier that is not optimally robust due to limited experimentation [12]. Under-fitting is easy to happen under some circumstances, such as the application of a specific learning algorithm without consideration of alternatives, or the use of parametric learners with certain values of parameters without searching for the best ones [12].

In this survey, we intend to survey recent progress in Gene selection research, to have a better picture of whole work, Section [2] investigate some the latest techniques developed for gene selection.

2 Related Work

There are a variety of gene selection methods proposed in the last a few years, some of which are based on some statistical models. Xiong et al. [4] suggested a model to select genes through the space of feature subsets using the classification error. Guyon et al. [6] proposed a gene selection approach utilizing support vector machines based on recursive feature elimination. These gene selection methods do not assume any specific data model for the gene expression data. However, it is reported [13] that their results may be influenced by the specific classification methods used for scoring the genes.

Another category of widely used gene selection methods is through model-based probabilistic analysis. For example, Baldi et al. [1] developed a Gaussian gene-independent model to process the gene expression data. They implemented a t-test combined with a full Bayesian treatment on the gene expression data. A disadvantage of this category of gene selection methods is the lack of adaptability, because it is unlikely possible to construct a universal probabilistic analysis model that is suitable for all kinds of gene expression data, where noise and variance vary dramatically across different gene expression data [13]. Thus, model-free methods are more desirable.

In the following part, we will review some details on several classical and latest gene selection methods and their corresponding score function.

3 Methods

Assume there are a total p genes in the microarray chips, and assume we have in total n samples that are grouped into L classes in the training dataset. Let a_ij denote the expression value of gene j in sample i. The training dataset can thus be represented as a matrix

Let n_k denote the number of samples in class C_k, for k = 1, 2, ..., L (i.e., ). Let , which is the average expression value of gene j in class C_k, for k = 1, 2, ..., L. The expression vector is the centroid of class C_k. Correspondingly, the centroid matrix is

The mean of centroids is , where

For sample i belonging to class C_k, the difference between the expression value of gene j and the average expression value in class C_k is . The matrix

is the variation matrix of the training dataset. Let denote the average variation for samples in class C_k with respect to the centroid. The centroid variation matrix is

Intuitively, most of methods assume the feature genes have small intra-class variation

and large inter-class variation.

3.1 F-test Method

F-test method [3] is also a single gene scoring approach. Besides the notations used in our methods, it is a classical gene selection method. It uses to denote the variance of expression value of gene j in the k-th class:

and to denote the variance in the whole dataset. Gene j has a score defined to be:

3.2 Cho’s Method

Using the same notations as above, Cho's method [2] defines a weight factor w_i for sample i, which is if sample i belongs to class k. Let . The weighted mean(j) for gene j is defined as

The weighted standard deviation is defined as

Then the score of gene j is calculated as

where is the standard deviation of centroid expression values .

3.3 GS1 and GS2 Method

Both GS1 and GS2 [15] define the scatter of gene j to capture the inter-class variations, which takes in as a component:

in which the square root is the standard (estimated) deviation of all the centroids on gene j.

For the intra class variation, GS1 choose the funtion compact₁(j)

and GS2 choose the function compact₂(j)

as the intro-class score for gene j. Therefore, are the score function for GS1 and GS2 respectively.

3.4 SFS and SFFS Method

SFS(Sequential forward search) and SFFS(sequential forward floating search) algorithms were proposed by Pudil et.al.[10], and was employed in Xiong et.al's feature selection article [14]. SFS first find the most significant feature and add it into selected feature pool. Then SFS find the feature which leads to the most significant combination with the feature/features in the selected feature pool. One problem of SFS is that once a feature is added into selected feature, there is no chance that it would be removed out. However, SFFS tries to solve this problem. It not only tries to find the best feature combination, but also tries to remove features in the selected feature pool to see if that increases the significance.

4 Classifier

There are a variety of algorithms proposed to devise classifier. The most widely used classifier is support vector (SVM) machine-based classification method. SVMs have achieved excellent classification performances in most tasks [12]. The advantage of these approaches is that they are insensitive to the problem of dimensionality and are able to deal with large-scale classification in both sample and variables [12]. The early stage SVM can only handle classification for two classes, which is called Binary SVMS. The basic idea of binary SVMS is to use a kernel function to implicitly map data to a higher dimensional space, and to identify the maximum-margin hyperplane that divides training instances [12]. In 1999, Kressel proposed the simplest multiclass SVM method, called one-versus-rest (OVR) method, which constructs k binary SVM classifiers [12]. He also proposed another multiclass SVMs at the mean time, which is one-versus-one (OVO) SVM classifier [12]]. These two methods made tremendous progress in the development of SVMs, but they also have apparent limitations. These approaches are computationally non-economic, and do not incorporate analysis of generalization which is necessary for a good learning algorithm [12]. Later on, other researchers have made modifications with SVMs and improved the classifier. Platt et al devised classifier named DAGSVM (2000) [9]. Hsu and Lin (2002), Weston and Watkins (1999) also proposed their classification methods based on the SVMs [7]. In addition to SVMs, K-nearest beighbors (KNNs) is another robust classifier. A KNN classifier defines the class of a testing sample by analyzing its K nearest neighbors in the training dataset [3].

In this method, samples are regarded as points in m-dimensional space, and then Euclidean distance or other distance calculation is used to classify the samples by a vote of K-nearest training instances [12]. Another classification method is backpropagation neural networks (NNs). The network is composed of three parts: the input layer of units, the hidden layer(s) of units, and the output layer of units [12]. The algorithm adjusts the weights between units in the network until a vector of weights is found that best fits the training data [12]. After the training session is finished, the testing data can be imported into the input layer, and the network will output the class the sample belongs to by output layer [12].

Probabilistic neural network (PNNs) is a member of Radial Basis Function (RBF) neural networks methods [16]. PNNs have a very similar structure with NNs, comprising three layers. However, the PNNs can only have one hidden layer, while PNNs can have more than one. The difference between these two methods is that for PNNs, the inputs pass directly to the hidden layer with no weights [12]. The PNNs use Gaussian density function as an activation function for the hidden layer and a least squares optimization algorithm is used to calculate the weights for the connection between hidden layer and output layer [12].

Among the various classification methods, SVMs and KNNs are most widely used. When using classifiers, a few problems have to be noticed in order to obtain the best classification performance. First is the combination of algorithms. For different dataset, we can combine specific gene selection methods and classification methods. Different selection or classification methods can also be grouped together to improve the selection efficiency and classification accuracy. Statnikov et al found that combining classifier Decision Trees (DT) and MC-SVM can yield good classification accuracy [12]. We also found that combining the clustering of genes and other general gene selection method can improve the quality of the genes selected, and achieve better classification results.

The classification of cancer has been one of the most successful examples for the application of classification. The use of classification in cancer research is prominent because of the variety of cancer types. The different types of cancer require different treatment. However, the traditional diagnosis by histopathological analysis has a high chance of mistake.

The gene classification can be used not only for class prediction, but also for class discovery prediction [3]. Class discovery refers to identifying new tumor subtypes, and class prediction is to assign the particular tumor samples to existing classes, which could reflect current states or future outcomes [5]. Using microarray data, Golub et al successfully discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these cancer types [5]. Fu et al applied a gene classification method on small round blue cell tumors and leukemia microarray dataset and compare the classification accuracy with results by artificial neural networks and statistical techniques [4]. They demonstrated the reliability of their method by the consistency with different methods and other biological evidences [4]. These studies confirm our confidence in the prospect of using molecular technique of microarray in cancer diagnosis.

5 Discussion

Recently we found another way to improve the accuracy of gene selection is to select genes from different groups. Because most of the current algorithms only use a limited number of discriminatory genes to device the classifier, the quality of the genes selected is essentially important. We select genes from different groups, which maximize the dimension of the discriminatory gene set, and enhance the accuracy of the classification. After gene selection, a classifier will be devised based on discriminatory genes. The classifier then will cluster the samples by its microarray dataset. (Delete this sentence)

To explain the way we partition genes, we need to introduce the basic idea of genetic network here. A single gene could modulate (activate or inhibit) some genes, and at the same time, be modulated by other genes. The modulation of one gene to another could be condition dependent and tissue specific. Some genes who share the same expression pattern, the same function, or co-express in specific tissue or under some conditions, can be grouped together as a module. Then we find the expression pattern of these modules in microarray dataset, that is to match the condition or tissue with the module that has significantly higher or lower expression pattern. By doing so, the gene modules can be connected with specific condition or tissue type. Finally, biological evidences of the regulatory relationship can be added to different genes and modules to reflect the complex regulatory relations in the network. In a typical genetic network, gene is represented by vertices, and edge between genes stands for a regulatory program. The value of the edge (either 1 or 0) stands for activating or inhibiting. The network can also be used to discover few relationships between conditions and genes, or between genes themselves. For example, Segal uses a network with 2849 genes and found that activation of some modules is specific to certain types of tumor, for example, a growth-inhibitory module is specifically repressed in acute lymphoblastic leukemias and may contributes to the deregulated proliferation in these cancers [11].

For the grouping, we can assume the genes in different modules represent the same pattern of expression, signaling pathway, or same function. Genes in the same module may play the same role in classification. Thus we limit the number of genes being selected from the group or module to enhance the dimension of the genes and improve the accuracy of classification.

References

[1] P. Baldi and A. D. Long. A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics, 17:509{519, 2001.

[2] J. Cho, D. Lee, J. H. Park, and I. B. Lee. New gene selection for classification of cancer subtype considering within-class variation. FEBS Letters, 551:3{7, 2003.

[3] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97:77{87, 2002.

[4] L. M. Fu and C. S. Fu-Liu. Multi-class cancer subtype classi¯cation based on gene expression signatures with reliability analysis. FEBS Letters, 561:186{190, 2004.

[5] T. R. Golub. Molecular classification of cancer: Class discovery and class prediction gene expression monitoring. Science, 286:531{537, 1999.

[6] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi¯cation using support vector machines. Machine Learning, 46:389{422, 2002.

[7] C. W. Hsu and C. J. Lin. A comparison of methods for multi-class support vector

machines. IEEE Transactions in Neural Networks, 13:415{435, 2002.

[8] S. Mukherjee and S. J. Roberts. A theoretical analysis of gene selection. In The Pro-ceedings of IEEE Computer Society Bioinformatics Conference (CSB 2004), 2004.

[9] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin dags for multiclass classification. Neural Information Processing Systems, 12:547{553, 2000.

[10] P. Pudil, Novovicova. J., and J Kittler. Floating search methods in feature selection. Patt. Recogn. Lett, 15:1119{1125, 1994.

[11] E. Segal, N. Friedman, D. Koller, and A. Regev. A module map showing conditional activity of expression modules in cancer. Nature, 36:1090{1098, 2004.

[12] A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A comprehensive evaluation of multicategory classi¯cation methods for microarray gene expression cancer diagnosis. Bioinformatics, 21:631{643, 2005.

[13] O. G. Troyanskaya, M. E. Garber, P. O. Brown, D. Botstein, and R. B. Altman. Non-parametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 18:1454{1461, 2002.

[14] M. Xiong, X. Fang, and J. Zhao. Biomarker identi¯cation by feature wrappers. Genome Research, 11:1878{1887, 2001.

[15] K. Yang, Z. Cai, J. Li, and G. Lin. A model-free and stable gene selection in microarray data analysis. In The Proceedings of IEEE The 5th Symposium on Bioinformatics and Bioengineering (BIBE 2005), pages 3{10, 2005.

Microarray Data Analysis for Genetic Network Prediction

Gene Selection