Morfologik dictionaries client in pure Ruby: POS tagging & spellcheck
MorMor is pure Ruby morfologik dictionary client that could be used for POS (part of speech) tagging and simplistic spellchecking. Morfologik format’s distinguishing feature is it is primary dictionary format for LanguageTool, therefore a lot of ready high-quality dictionaries exist.
¹The only runtime dependency is backports and that’s only because I am too fond of modern Ruby features to sacrifice them to “no-dependencies” god.
mormor
gem (via bundler or just [sudo] gem install mormor
)
require 'mormor'
dictionary = MorMor::Dictionary.new('path/to/english')
dictionary.lookup('meowing')
# => [#<struct MorMor::Dictionary::Word stem="meow", tags="VBG">]
dictionary.lookup('barks')
# => [#<struct MorMor::Dictionary::Word stem="bark", tags="NNS">,
# #<struct MorMor::Dictionary::Word stem="bark", tags="VBZ">]
dictionary.lookup('borogoves')
# = nil
dictionary = MorMor::Dictionary.new('path/to/ukrainian')
dictionary.lookup("солов'їна")
# => [#<struct MorMor::Dictionary::Word stem="солов'їний", tags="adj:f:v_kly">,
# #<struct MorMor::Dictionary::Word stem="солов'їний", tags="adj:f:v_naz">]
Dictionary#lookup
returns an array of structs which describe all possible base forms + part of speech /word form tags. (For example, “barks” could be a third person form of the verb “to bark”, or plural form of noun “bark”.)
Tags are dependent on the particular dictionary used and typically documented in a free form alongside the dictionaries.
A lot of dictionaries in Morfologik format could be found at LanguageTool’s repo. For example, for Polish language, dictionary is at languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/
.
What you need there, are:
polish.dict
is a dictionary (binary finite-state-automata) itselfpolish.info
is dictionary metadataIn order to use Polish dictionary with mormor, you need to place both files at the same folder, and then
pl = MorMor::Dictionary.new('path/to/that/folder/polish') # without extension
pl.lookup('świetnie')
You may also be interested in tagset.txt
file of the same folder, which has an explanation for all POS/forms tags in natural language (Polish language, for that case).
Sometimes (for example, in case of German and Ukrainian), LanguageTool repo contains not the dictionary itself, but a link to other repo/site where it can be downloaded.
Please carefully consider dictionary licenses when using them!
Note: mormor repo contains copies of dictionary files from LanguageTool and referred projects, but they are not a part of the gem distribution and only used for testing the parser/lookup correctness, and demonstration purposes.
Most of the credit for algorithms and original code belong to original Morfologik’s authors, and author of paper’s they based their work on.
Ruby version is done by Victor Shepelev.
The license is BSD, the same as the original Morfologik.