A PHP Library to extract n-grams from a text. Simple preprocessing tools (cleaning, tokenizing) included.
Simple install via Composer:
composer require linguistic/ngramextractor
Coming soon.
$tokenizer = new Tokenizer();
$tokenizer->addRemovalRule('/<\/?\w+[\s\w\=\"\/\#\-\:\.\_]*>/') # Removes HTML Tags
->addRemovalRule('/[^a-z0-9]+/', ' ') # Replaces everything which is not text with a space
->setSeperator('/\s+/'); # Tokenizes text with whitespace as delimiter
$content = ""; # The text that should get tokenized
$stopwords = array(); # (optional) array of stopwords
$extractor = new NGramExtractor($content, $tokenizer, $stopwords);
$unigrams = $extractor->getNGrams(1); # gets all n-grams in the text, n = 1
$unigramsFiltered = NGramExtractor::limitByOccurance($extractor->getNGramCount(1, true), 3); # get unigrams and their occurance if the occurance is greater or equal 3