Home Topics Contact us

Microarray Data Analysis for Genetic Network Prediction

 

Biclustering Survey

 

1.        Introduction

Gene clustering is one of the most important applications of DNA microarray data. Researches have made prominent progress in classifying the samples by similar genetic expression pattern. In cancer research, where microarray are most widely used, clustering has successfully been used to classify different cancer types, and identify new cancer types which were not recognized in the original data. Most developed clustering methods base their function on the entire set of discriminatory genes, that is, the genes they use as criterion are same across different group. The emerging idea, called biclustering, is to cluster both samples and conditions. The biological meaning for this idea is that a gene way function in multiple pathways and under some specific conditions. Their expression levels are intimately related to conditions. The biclustering method has been demonstrated of good performance in some cancer data [5].

 

The biclustering and traditional clustering are intimately related. However, several technical challenges are still notable for biclustering, which most of the biclustering algorithms are struggling for solutions. Kung summarizes that there are three main challenges: 1) the coherence models are not known in biclustering; 2) Clustering genes and conditions simultaneously adds the technical difficulty; 3) A gene or condition can be attributed to more than one group, or overlapping between biclustering [6].  In the following session, this survey will review several biclustering methods, focusing on their underlying idea and process of how to cope with the challenges. We mainly introduce two methods here, one of which is a machine learning approach proposed by Kung and colleagues, and the other is called spectral biclustering, proposed by Kluger and colleagues [5, 6, 7].

 

2.        Methods

2.1    A machine learning approach of biclustering

 

1.       Preprocessing

This approach is comprised of three steps, the preprocessing, biclustering metics and fusion. There are two biological coherence models that correspond with the two pro-processing methods, normalization and standardization [7]. The first model is additive coherence model. Its corresponding preprocessing step is normalization. The additive coherence model assumes a relation between mRNAa and mRNAb as mRNAb=k(mRNAa), where k is a scaling factor [7]. Let

              and         .

Then we have

      where ,

So that the multiplicative change is converted into additive increment [7]

 

The goal of normalization is to diminish the uncertainty caused by the additive increments [6]. Mathematically, this process can subtract the mean from each row or column [7]. Another model is multiplicative coherence model. Its corresponding preprocessing step is standardization. This model assumes an exponential relation between mRNAa and mRNAb, expressed as mRNAb=(mRNAa)c. Then the exponential relation is converted into multiplicative relation, and the corresponding relation between a and b is: b=c*a [7]. The goal of standardization is to diminish the uncertainty incurred by the cultiplicative increments [7]. Mathematically, this process can divide each row and column by its standard deviation [7]. As the biclustering methods involve the clustering of both rows and columns, it is not hard to imagine that our preprocessing has to convert data based on both rows and columns. Accordingly, there are two types of preprocessing methods in terms of the combination of row and column processing, which are symmetrical preprocessing model and asymmetrical preprocessing model [7]. The basic idea of symmetrical model is that the row and column received the same preprocessing. A few biclustering models adopt symmetrical processing, for example, the Cheng and Church uses symmetrical processing in their models [1]; Tavazoie et al. applies normalization and standardization to both rows and columns [4]. Asymmetrical preprocessing, on the other hand, applies different processing to row and column. The asymmetrical preporocessing is more biologically meaningful, and thus comes to be more promising. The preprocessing is an important step in the entire biclustering algorithm. Its effect on the accuracy and efficiency of the results has already been proved in traditional clustering algorithms.

 

2. Biclustering metrics

After the preprocessing procedure, the data set will be applied with biclustering algorithms to cluster rows and columns based on the homogeneity between rows and columns. The two most widely used norms for this step is matrix norms and vector norms.

 

2.1 Matrix-norm metrics

In this method, A is defined as a submatrix in the entire matrix. The similarity between rows and columns are measured by function [7]. Most of the matrix-norm method uses Frobenius norm, denoted by ||A||F defined as the square-root of the sum of squares of all the elements in the entire matrix [7] The first matrix-norm metrics is introduced by Hartigan [3]. The notion of this model is that for a perfect bicluster, there is a constant denoted by c for every matrix entry [3]. If there is noise, then the deviation will be

Where E denotes an all-one matrix, i.e.,

and ||.||F denotes the Frobenius norm [3].

 

The advantage of this model is its simple concept. Nevertheless, it ignores the biologically justifiable coherence models, thus a good preprocessing is required for this method [7]. There are also other matrix norms devised according to corresponding coherent models, e.g., the additive coherent matrix norm and multiplicative coherent matrix norms. The details of these two models can be referred to [7].

 

2.2 Vector-based metrics

The basic notion of vector-based metrics is to measure the distance between vectors. Assume two different genes are two vectors, then each entry in the vector is related to one condition [7]. Traditionally, Euclidean distance is used to measure the distance between vectors. For additive coherence models, the distance is denoted as (a, b) normal [7], then the distance in this model is adjusted by a minimizing parameter so that

Therefore we have the minimum distance as following

where  and  are the means of ai and bi [7], For additive and multiplicative coherence models which are preprocessed with both normalization and standardization

where

and  and  are the standard deviation of ai and bi [7]. Computationally, the vector-based matrics is much simpler than the matrix-norm matrics.

 

3. Fusion

The biological principle behind fusion is that the genes in the sample group could be co-expressed under different models [6]. How do we decide which models to fuse? There are two basic principles for fusion, which are 1) the preprocessing model must deliver a sound performance, and 2) the models to be fused must be complementary to each other [2]. ROC curve provides useful information of possible complimentary models [6]. After deciding the models to be fused, direct or indirect fusion architects can be adopted for fusion purpose. Direct fusion is to concatenate multiple features to form an enlarged feature vector, while indirect fusion refers to compressing each feature vector into a scalar feature or represented by a local score [6].

 

2.2    Spectral Biclustering

The goal of spectral biclustering is to arrange the matrices of microarray data into checkerboard structure by clustering the genes and conditions simultaneously [5]. This algorithm derives from Dhillon's formulation for the problem of coclustering of words and documents [2]. This old coclustering method bipartites the rows and columns into the same number of clusters, though the new spectral biclustering can divide rows and columns into different number of clusters [5, 2]. The basic assumption for this algorithm is that the better estimates of correlations between genes or conditions can be observed by averaging the gene over various conditions of the same type, or the sets of genes of similar expression profiles under specific condition [5]. Therefore, the spectral biclustering method is comprised of the following steps: Independent rescaling of genes and condition; simultaneous normalization of genes and conditions; postprocessing the eigenvectors to find partitions [5]. As defined, the independent rescaling procedure processes the original data according to rows or columns separately while the simultaneous normalization of genes and conditions processes the roles and columns together. The second step is also named bistochastization. The detailed procedures for these two steps can be referred to in [5]. The outcome of the process is to obtain a set of gene and condition eigenvectors, eigengenes or eigenarrays [5]. In the post preprocessing step, these eigenvectors can be used to perform the partition.

 

2.3    Application

In this session, this survey will introduce an advancement of the application of feature selection technique. In this application, the authors propose to combine the feature gene selection with pattern recognition to identify biomarkers [8]. Biomarker is a molecule that is characteristic of a biological phenomenon, especially disease, that is used for disease assessment or detection. The main problems for the previous biomarker identification algorithms is 1) the classification accuracy based on the biomarkers obtained from fold change calculation such as t-test and F test is very low; 2) most scoring methods do not adopt classification accuracy as criterion to evaluate the biomarker; 3) some models adopts the combination of genes as biomarker, which lacks the biological support [8]. Facing these problems, Xiong and colleagues proposed to combine the advantages of pattern recognition and feature selection to develop three wrappers, which are linear discriminant analysis, logistic regression, and support vector machines [8]. Feature gene selection serves two purposes in this model, the traditional role of reducing the dimension and to identify genes that is related to the cause or phenotype of disease that can be used as characteristics for diagnosis [8]. Some of the biomarkers found by this algorithm recieve support from biological evidences. For example, they found p53-binding protein, Ras suppressor protein, psoriasis-associated protein and DNA repair gene MSH2 as the biomarkers from the microarray data of BRCA1 mutation-positive tumors [8]. These genes are all relevant to the development of tumors. In BRCA2 mutation-positive tumors dataset, they identify MAPK1, MAPK7, suppression of tumorogenicity and semia sarcoma viral oncogene homolog, which are all related to tumurogenesis [8]. Good algorithms for biomarker identification are still in demand. As the biclustering methods have the advantage of clustering genes and conditions, in this case the conditions can be related to disease types, the biclustering method is very promising in this application field.

 

References:

[1] Y. Cheng and G. M. Church. Biclustering of expression data. In The Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB00), pages 93-103.
[2] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269 - 274,, 2001.
[3] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67:123-129,, 1972.
[4] Y. Kluger, R. Basri, J. T. Chang, and M. Gerstein. Systematic determination of genetic network architecture. Nature Genetics, 22:281-285,, 1999.
[5] Y. Kluger, R. Basri, J. T. Chang, and M. Gerstein. Spectral biclustering of microarray data" coclustering genes and conditions. Genome Research, 13:703-716, 2003.
[6] S. Y. Kung and M. Mak. Machine learning approach to dna microarray biclustering analysis. In In IEEE Workshop on Machine Learning for Signal Processing, 2005.
[7] S. Y. Kung, Man-Wai Mak, and I. Tagkopoulos. Multi-metric and multi-substructure biclustering analysis for gene expression data. In The Proceedings of IEEE Computer Society Bioinformatics Conference (CSB 2005), 2005.
[8] M. Xiong, X. Fang, and J. Zhao. Direct clustering of a data matrix. Genome Research,
11:1878-1887,, 2001.