人们可以这样做:
Reader reader = new StringReader(paragraphText); DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader, DocumentPreprocessor.DocType.Plain); TokenizerFactory<? extends HasWord> factory = PTBTokenizer.factory(); factory.setOptions("untokenizable=noneDelete"); documentPreprocessor.setTokenizerFactory(factory);
从这里: https://github.com/stanfordnlp/CoreNLP/issues/103#issuecomment-157793500
如果您直接使用Tokenizer,Denis Kulagin给出的答案是好的;如果您在StanfordCoreNLP管道的更高级别操作,则可以简单地给出属性(或等效的命令行选项):
tokenize.options = untokenizable=noneDelete
(以静默方式删除所有未知字符)或以静默方式保留它们:
tokenize.options = untokenizable=noneKeep