项目作者: linguistic-dev

项目描述 :
A PHP Library to extract n-grams from a text. Simple preprocessing tools (cleaning, tokenizing) included.
高级语言: PHP
项目地址: git://github.com/linguistic-dev/n-gram-extractor.git
创建时间: 2017-12-05T22:23:34Z
项目社区:https://github.com/linguistic-dev/n-gram-extractor

开源协议:GNU General Public License v2.0

下载


NGramExtractor for PHP

Installation

Simple install via Composer:

  1. composer require linguistic/ngramextractor

Usage

Coming soon.

Example

  1. $tokenizer = new Tokenizer();
  2. $tokenizer->addRemovalRule('/<\/?\w+[\s\w\=\"\/\#\-\:\.\_]*>/') # Removes HTML Tags
  3. ->addRemovalRule('/[^a-z0-9]+/', ' ') # Replaces everything which is not text with a space
  4. ->setSeperator('/\s+/'); # Tokenizes text with whitespace as delimiter
  1. $content = ""; # The text that should get tokenized
  2. $stopwords = array(); # (optional) array of stopwords
  3. $extractor = new NGramExtractor($content, $tokenizer, $stopwords);
  4. $unigrams = $extractor->getNGrams(1); # gets all n-grams in the text, n = 1
  5. $unigramsFiltered = NGramExtractor::limitByOccurance($extractor->getNGramCount(1, true), 3); # get unigrams and their occurance if the occurance is greater or equal 3

Ressources