Stochastic Systems Group  

Scalable Topic Models
David Blei
Princeton University
Probabilistic topic modeling provides a suite of tools for analyzing large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. We can use topic models to explore the thematic structure of a corpus and to solve a variety of prediction problems about documents.
Most topic models are based on hierarchical mixedmembership models, where each document expresses a set of components (called topics) with individual perdocument proportions. The computational problem is to condition on a collection of observed documents and estimate the posterior distribution of the topics and perdocument proportions. In modern data sets, this amounts to posterior inference with billions of latent variables.
How can we cope with such data? In this talk, I will describe stochastic variational inference, an algorithm for computing with topic models that can handle very large document collections (and even endless streams of documents). I will demonstrate our algorithm with models fitted to millions of articles. I will show how stochastic variational inference can be generalized to many kinds of hierarchical models, including models of images and social networks, and Bayesian nonparametric models. I will highlight several open questions and outstanding issues.
Biography: David Blei is an associate professor of Computer Science at Princeton University. His research involves probabilistic topic models, graphical models, approximate posterior inference, and Bayesian nonparametrics.
Problems with this site should be emailed to jonesb@mit.edu