仅从格式为.tex的arXiv文章中提取正文文本

作者: 757461156
发布时间: 2025-04-08 11:29:01 (3月前)
转自：

2 条回复

0#
回复此人
猫南北 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 要从文档中获取所有文本， <code> tree.descendants </code> 这里会更加友好。这将按顺序输出所有文本。 </p> <pre> <code> def getText(section): for token in section.descendants: if isinstance(token, str): corpus.write(str(x)) </code> </pre> <P> 为了捕捉边缘情况，我写了一个稍微更加丰富的版本。这包括检查您在那里列出的所有条件。 </p> <pre> <code> from TexSoup import RArg def getText(section): for x in section.descendants: if isinstance(x, str): if x.startswith('$') and x.endswith('$'): continue corpus.write(str(x)) elif isinstance(x, RArg): corpus.write(str(x)) elif hasattr(x, 'source') and hasattr(x.source, 'name') and x.source.name in ('acknowledgements', 'appendix'): return </code> </pre> </DIV>

编辑

登录后才能参与评论