Finding scientific topics
Thomas L. Griffiths*†‡ and Mark Steyvers§
*Department of Psychology, Stanford University, Stanford, CA 94305; †Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology,
Cambridge, MA 02139-4307; and §Department of Cognitive Sciences, University of California, Irvine, CA 92697
A first step in identifying the content of a document is determining
which topics that document addresses. We describe a generative
model for documents, introduced by Blei, Ng, and Jordan [Blei,
D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3,
993-1022], in which each document is generated by choosing a
distribution over topics and then choosing each word in the
document from a topic selected according to this distribution. We
then present a Markov chain Monte Carlo algorithm for inference
in this model. We use this algorithm to analyze abstracts from
PNAS by using Bayesian model selection to establish the number of