In the current context, there is a clear need for Text Mining techniques to analyse the huge quantity of unstructured text documents available on the Internet. These textual data are often represented by sparse high dimensional matrices where rows and columns represent documents and terms respectively.
Thus, it would be worthwhile to simultaneously group these terms and documents into meaningful clusters, making this substantial amount of data easier to handle and interpret. Co-clustering techniques just serve this purpose.
Although many existing co-clustering approaches have been successful in revealing homogeneous blocks in several domains, these techniques are still challenged by the high dimensionality and sparsity characteristics exhibited by document-term matrices.
Due to this sparsity, several co-clusters are primarily composed of zeros. While homogeneous, these co-clusters are irrelevant and must be filtered out in a post-processing step to keep only the most significant ones.
The objective is to propose new co-clustering algorithms tailored to take into account these sparsity-related issues. The proposed algorithms seek a block diagonal structure and allow to straightaway identify the most useful co-clusters, which makes them specially effective for the text co-clustering task.
Our contributions can be summarized as follows:
First, we introduce and demonstrate the effectiveness of a novel co-clustering algorithm based on a direct maximization of graph modularity.
While existing graph-based co-clustering algorithms rely on spectral relaxation, the proposed algorithm uses an iterative alternating optimization procedure to reveal the most meaningful co-clusters in a document-term matrix. Moreover, the proposed optimization has the advantage of avoiding the computation of eigenvectors, a task which is prohibitive when considering high dimensional data. This is an improvement over spectral approaches, where the eigenvectors computation is necessary to perform the co-clustering.
Second, we use an even more powerful approach to discover block diagonal structures in document-term matrices. We rely on mixture models, which offer strong theoretical foundations and considerable flexibility that makes it possible to uncover various specific cluster structure. More precisely, we propose a rigorous probabilistic model based on the Poisson distribution and the well known Latent Block Model. Interestingly, this model includes the sparsity in its formulation, which makes it particularly effective for text data.
Setting the estimate of this model’s parameters under the Maximum Likelihood (ML) and the Classification Maximum Likelihood (CML) approaches, four co-clustering algorithms have been proposed, including a hard, a soft, a stochastic and a fourth algorithm which leverages the benefits of both the soft and stochastic variants, simultaneously.
As a last contribution, we propose a new biomedical text mining framework that includes some of the above mentioned co-clustering algorithms. This work shows the contribution of co-clustering in a real biomedical text mining problematic.
The proposed framework is able to propose new clues about the results of genome wide association studies (GWAS) by mining PUBMED abstracts.
This framework has been tested on asthma disease and allowed to assess the strength of associations between asthma genes reported in previous GWAS as well as discover new candidate genes likely associated to asthma.