项目作者: dkajtoch

项目描述 :
Language detection tool based on fastText pretrained model.
高级语言: Python
项目地址: git://github.com/dkajtoch/fast-lang.git
创建时间: 2019-02-24T15:45:28Z
项目社区:https://github.com/dkajtoch/fast-lang

开源协议:MIT License

下载


fast-lang

Language detection tool based on fastText pretrained model.

Text preprocessing

Numbers, punctuation and repeating whitespaces are removed before feeding into language detector tool.

Examples

  1. from fastlang import FastLangDetect
  2. detector = FastLangDetect()
  3. detector.detect('Where is my mother?')
  4. # {'en': 0.996435284614563}
  5. detector.detect('Where is my mother?', k=3)
  6. # {'en': 0.996435284614563, 'th': 0.0005820714286528528, 'bn': 0.0005180443404242396}

As the examples demonstrates you can specify how many labels to return with associated probabilities.
Output can also be controlled by the threshold parameter which filters result based on probability value.

  1. detector.detect('Where is my mother?', k=3, threshold=0.5)
  2. # {'en': 0.996435284614563}

Labels are ISO 639-1 encoded. If you want to check what is the corresponding language use iso_codes

  1. from fastlang import iso_codes
  2. iso_codes['en']
  3. # 'English'

Language detector also works with lists of strings.

  1. from fastlang import FastLangDetect
  2. detector = FastLangDetect()
  3. detector.detect(['Where is my mother?', 'pies i kot na drodze.'])
  4. # [{'en': 0.996435284614563}, {'sl': 0.6256219148635864}]

All 176 model lables can be exposed via get_labels() method.

  1. detector.get_labels()

If you want associated frequencies just pass include_freq=True to the get_labels method.

Installation

pip install .

References

  • A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
  • A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models