项目作者: stphnlyd

项目描述 :
Perl wrapper for CppJieba (Chinese text segmentation)
高级语言: Perl
项目地址: git://github.com/stphnlyd/perl5-jieba.git
创建时间: 2017-04-03T09:08:23Z
项目社区:https://github.com/stphnlyd/perl5-jieba

开源协议:

下载


Actions Status

NAME

Lingua::ZH::Jieba - Perl wrapper for CppJieba (Chinese text segmentation)

VERSION

version 0.007

SYNOPSIS

  1. use Lingua::ZH::Jieba;
  2. binmode STDOUT, ":utf8";
  3. my $jieba = Lingua::ZH::Jieba->new();
  4. # default cut (切词,MP/HMM混合方法)
  5. my $words = $jieba->cut("他来到了网易杭研大厦");
  6. print join('/', @$words), "\n";
  7. # 他/来到/了/网易/杭研/大厦
  8. # cut without HMM (切词,MP方法)
  9. my $words_nohmm = $jieba->cut(
  10. "他来到了网易杭研大厦",
  11. { no_hmm => 1 } );
  12. print join('/', @$words_nohmm), "\n";
  13. # 他/来到/了/网易/杭/研/大厦
  14. # cut all (Full方法,切出所有词典里的词语)
  15. my $words_cutall = $jieba->cut(
  16. "我来到北京清华大学",
  17. { cut_all => 1 } );
  18. print join('/', @$words_cutall), "\n";
  19. # 我/来到/北京/清华/清华大学/华大/大学
  20. # cut for search (先用Mix方法切词,对于切出的较长词再用Full方法)
  21. my $words_cut4search = $jieba->cut_for_search(
  22. "小明硕士毕业于中国科学院计算所,后在日本京都大学深造" );
  23. print join('/', @$words_cut4search), "\n";
  24. # 小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
  25. # get word offset and length with cut_ex() or cut_for_search_ex()
  26. my $words_ex = $jieba->cut_ex("他来到了网易杭研大厦");
  27. # [
  28. # [ "他", 0, 1 ],
  29. # [ "来到", 1, 2 ],
  30. # [ "了", 3, 1 ],
  31. # [ "网易", 4, 2 ],
  32. # [ "杭研", 6, 2 ],
  33. # [ "大厦", 8, 2 ],
  34. # ]
  35. # part-of-speech tagging (词性标注)
  36. my $word_pos_tags = $jieba->tag("我是蓝翔技工拖拉机学院手扶拖拉机专业的。");
  37. for my $pair (@$word_pos_tags) {
  38. my ($word, $part_of_speech) = @$pair;
  39. print "$word:$part_of_speech\n";
  40. }
  41. # 我:r
  42. # 是:v
  43. # 蓝翔:nz
  44. # 技工:n
  45. # 拖拉机:n
  46. # ...
  47. # keyword extraction (关键词提取)
  48. my $extractor = $jieba->extractor();
  49. my $word_score = $extractor->extract(
  50. "我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。",
  51. 5
  52. );
  53. for my $pair (@$word_scores) {
  54. my ($word, $score) = @$pair;
  55. printf "%s:%.3f\n", $word, $score;
  56. }
  57. # CEO:11.739
  58. # 升职:10.856
  59. # 加薪:10.643
  60. # 手扶拖拉机:10.009
  61. # 巅峰:9.494
  62. # insert user word (动态增加用户词)
  63. my $words_before_insert = $jieba->cut("男默女泪");
  64. print join('/', @$words_before_insert), "\n";
  65. # 男默/女泪
  66. $jieba->insert_user_word("男默女泪");
  67. my $words_after_insert = $jieba->cut("男默女泪");
  68. print join('/', @$words_after_insert), "\n";
  69. # 男默女泪

DESCRIPTION

This module is the Perl wrapper for CppJieba, which is a C++ implementation of
the Jieba Chinese text segmentation library. The Perl/C++ binding is generated
via SWIG.

The module may contain several packages. Unless stated otherwise, you only
need to use Lingua::ZH::Jieba; in your programs.

At present this module is still in alpha state. Its interface is subject to
change in future, although I will keep compatibilities if possible.

CONSTRUCTOR

new

  1. my $jieba = Lingua::ZH::Jieba->new;

By default constructor would use data files from “share” dir of its
installation. But it’s possible to override any of the data files like below.

  1. my $jieba = Lingua::ZH::Jieba->new(
  2. {
  3. dict_path => $my_dict_path,
  4. hmm_path => $my_hmm_path,
  5. user_dict_path => $my_user_dict_path,
  6. idf_path => $my_idf_path,
  7. stop_word_path => $my_stop_word,
  8. }
  9. );
  10. # if you just would like override user dict
  11. my $jieba = Lingua::ZH::Jieba->new(
  12. {
  13. user_dict_path => $my_user_dict_path,
  14. }
  15. );

METHODS

cut

  1. my $words = $jieba->cut($sentence);

Default cut mode. Returns an arrayref of utf8 strings of words cut from
the sentence.

  1. my $words = $jieba->cut($sentence, { no_hmm => 1 });

Cut without HMM mode.

  1. my $words = $jieba->cut($sentence, { cut_all => 1 });

Cut all possible words in dictionary.

cut_ex

  1. my $words_ex = $jieba->cut_ex($sentence);

Similar to cut(), but returns an arrayref of complex data.
Each element in the result arrayref is [ word, offset, length ].

  1. my $words = $jieba->cut_for_search($sentence);
  2. my $words_nohmm = $jieba->cut_for_search($sentence, { no_hmm => 1 });

cut_for_search_ex

  1. my $words_ex = $jieba->cut_for_search_ex($sentence);

Similar to cut_for_search(), but returns an arrayref of complex data.
Each element in the result arrayref is [ word, offset, length ].

tag

  1. my $word_pos_tags = $jieba->tag($sentence);
  2. for my $pair (@$word_pos_tags) {
  3. my ($word, $part_of_speech) = @$pair;
  4. ...
  5. }

POS (part-of-speech) tagging. Returns an arrayref of which each element is in
the form of [ $word, $part_of_speech ].

insert_user_word

  1. $jieba->insert_user_word($word);

Dynamically inserts a user word.

extractor

  1. my $extractor = $jieba->extractor();

Get the keyword extractor object. For more about the extractor, see
Lingua::ZH::Jieba::KeywordExtractor.

SEE ALSO

https://github.com/fxsjy/jieba - Jieba, the Chinese text segmentation
library

https://github.com/yanyiwu/cppjieba - CppJieba, Jieba implemented in C++

http://www.swig.org - SWIG, the Simplified Wrapper and Interface Generator

ACKNOWLEDGEMENTS

Thanks to Junyi Sun, and Yanyi Wu. This piece of Perl library would not be
existing without their work on jieba and CppJieba.

AUTHOR

Stephan Loyd stephanloyd9@gmail.com

COPYRIGHT AND LICENSE

This software is copyright (c) 2017-2023 by Stephan Loyd.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.