Tex>> nmf>> 返回
项目作者: duhaime

项目描述 :
Non-Negative Matrix Factorization
高级语言: Python
项目地址: git://github.com/duhaime/nmf.git
创建时间: 2017-11-12T15:57:04Z
项目社区:https://github.com/duhaime/nmf

开源协议:MIT License

下载


NMF Topic Models

Build Status

Non-Negative Matrix Factorization is a dimension reduction technique that factors an input matrix of shape m x n into a matrix of shape m x k and another matrix of shape n x k.

In text mining, one can use NMF to build topic models. Using NMF, one can factor a Term-Document Matrix of shape documents x word types into a matrix of documents x topics and another matrix of shape word types x topics. The former matrix describes the distribution of each topic in each document, and the latter describes the distribution of each word in each topic.

Given a collection of input documents, the source code in this repository builds a memory-efficient Term-Document Matrix, factors that matrix using NMF, then writes the resulting data structures as JSON outputs.

Usage

Command Line Usage

  1. # Obtain sample documents
  2. wget https://s3.amazonaws.com/duhaime/github/nmf/texts.tar.gz
  3. tar -zxf texts.tar.gz && rm texts.tar.gz
  4. # Obtain nmf script
  5. git clone https://github.com/duhaime/nmf
  6. # Install dependencies
  7. cd nmf && pip install -r requirements.txt --user
  8. # Build a topic model with 20 topics using ./texts/ as the input directory
  9. python nmf/nmf.py -files texts -topics 20

Class Usage

To install, run pip install nmf.

Then, to build a topic model using all text files in texts, run:

  1. from nmf import NMF
  2. model = NMF(files='texts', topics=20)

The following attributes will then be present on model:

  1. # the top terms in each topic
  2. model.topics_to_words # top terms in each topic
  3. # the presence of each topic in each document
  4. model.doc_to_topics # presence of each topic in each document
  5. # the documents by topics matrix; shape = (documents, topics)
  6. model.documents_by_topics
  7. # the topics by terms matrix; shape = (topics, terms)
  8. model.topics_by_terms

JSON Output

If you evoke NMF from the command line, or you build an NMF model and specify the write_output=True argument, the following output files will be generated in a directory named results:

topic_to_words.json maps each topic id to the top words in that topic:

  1. {
  2. "0": [
  3. "colours",
  4. "light",
  5. "prism", ...
  6. ],
  7. "1": [
  8. "sap",
  9. "tree",
  10. "bark", ...
  11. ], ...
  12. }

doc_to_topics.json maps each input document to each topic id and its weight in the document:

  1. {
  2. "texts/doc_1.txt": {
  3. "0": 0.52,
  4. "1": 0.0,
  5. "2": 0.0, ...
  6. },
  7. "texts/doc_2.txt": {
  8. "0": 0.0,
  9. "1": 0.67,
  10. "2": 0.0, ...
  11. },
  12. ]