Choisir la langue :

H. Amoualian (LIG): Copula-based Approaches to Capture Topic Dependency between Documents and Cohere Topic Assignment within Document

Institutional tag: 

Topic models are based upon the idea that documents are mixtures of topics, where a topic is a probability distribution over words. A topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated. To make a new document, one chooses a distribution over topics. Then, for each word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic. Latent Dirichlet Allocation model (Blei-2003) is a probabilistic Bayesian model used to describe a corpus of D documents, associated with a vocabulary of size V. There are some assumptions in LDA which are not realistic regarding different applications. Firstly, topic distribution of documents based in the corpus are independent which is not a precise belief when they streams and there is correlation between topics. We propose a new models for modeling topic and word-topic dependencies between consecutive documents in document streams. The model is an extension of LDA makes use of copulas, which constitute a generic tools to model dependencies between random variables. We rely here on Archimedean copulas, and more precisely on Franck copulas, as they are symmetric and associative and are thus appropriate for exchangeable random variables. Our experiments, conducted on three standard collections that have been used in several studies on topic modeling, show that our proposal outperforms previous ones (as dynamic topic models and temporal LDA), both in terms of perplexity and for tracking similar topics in a document stream. Secondly, in LDA they assumed words are generated from the bag of the words and there is exchangeability between them where recents works showed capturing dependency between them using different units such as Segments, Sentences and Chunks can result in an improvement in the model. Here we propose an LDA-based model that generates topically coherent segments within documents by jointly segmenting documents and assigning topics to their words. The coherence between topics is ensured through a copula, binding the topics associated to the words of a segment. In addition, this model relies on both document and segment specific topic distributions so as to capture fine grained differences in topic assignments. We show that the proposed model naturally encompasses other state-of-the-art LDA-based models designed for similar tasks. Furthermore, our experiments, conducted on six different publicly available datasets, show the effectiveness of our model in terms of perplexity, Normalised Point-wise Mutual Information, which captures the coherence between the generated topics, and the Micro F1 measure for text classification.

Thursday, June 1, 2017 - 11:00 to 12:00
Inria B21
Hesam Amoualian
Université Grenoble Alpes