项目作者: frankier

项目描述 :
Scrapes some Finnish word definitions from English Wiktionary.
高级语言: Python
项目地址: git://github.com/frankier/wikiparse.git
创建时间: 2019-03-31T06:59:47Z
项目社区:https://github.com/frankier/wikiparse

开源协议:Apache License 2.0

下载


Wikiparse

Scrapes some Finnish word definitions from English Wiktionary.

Usage

  1. $ poetry install
  2. $ DATABASE_URL=sqlite:///enwiktionary-20171001.db poetry run ./scrape_to_sqlite.sh ~/corpora/enwiktionary-20171001-pages-meta-current.xml

You can also pipe straight from lbunzip2 run a multistream bzip2 file which
should be about as fast on a multiprocessor machine (pbunzip2 segfaults when
piped directly into wikiparse):

  1. $ sudo apt install lbunzip2
  2. $ lbunzip2 -c ~/corpora/enwiktionary-latest-pages-articles-multistream.xml.bz2 | poetry run python parse.py parse-dump - --outdir enwiktionary.defns

Coverage info

You can generate coverage info by passing e.g. --stats-db stats.db when
running parse-dump and then running:

  1. $ poetry run python parse.py parse-stats-agg stats.db stats.csv
  2. $ poetry run python parse.py parse-stats-cov stats.csv

You can get a breakdown of the top problems affecting the coverage like so:

  1. $ poetry run python parse.py parse-stats-probs stats.csv

For each of these problems, you can then get the most frequent words affected
by it (e.g. so it can be turned into a test later):

  1. $ poetry run python parse.py parse-stats-probs parse-stats-top10 "my-problem"

Please consult the source code for more information on what the different
problems mean.