项目作者: ag-gipp

项目描述 :
Find most common formula of a wikipedia page across its available languages.
高级语言: Python
项目地址: git://github.com/ag-gipp/ss19-sem-most-common-formula-across-wikipedia-languages.git


Cross-language mathematical formulae in Wikipedia

  • Research task: Find shared formulae across Wikipedia pages that are available in different languages.
  • Goal: Extract defining formula from Wikipedia pages more often than by extracting the first formula on a Wikipedia page.

Obtain and filter Wikipedias fulltext

Use “wikiFilter.py” to filter wikidumps of different languages for all pages that contain a tag (e.g. math-tag), results are found in “Dumps filtered for tags”. Those can then be further filtered via “wikiFilter.py” for pages belonging to certain QIDs (given via —QID_file, use “Gold Standard.txt”), found in “Dumps filtered for tags/filtered 100 QIDs”.
Use “wikiFilter.py” as follows:

  1. usage: wikiFilter.py [-h] [-f [FILENAMES [FILENAMES ...]]] [-s SIZE] [-d DIR]
  2. [-t [TAGS [TAGS ...]]] [-k [KEYWORDS [KEYWORDS ...]]]
  3. [-K KEYWORD_FILE] [-Q QID_FILE] [-v] [-T]
  4. Extract wikipages that contain the math tag.
  5. optional arguments:
  6. -h, --help show this help message and exit
  7. -f [FILENAMES [FILENAMES ...]], --filename [FILENAMES [FILENAMES ...]]
  8. The bz2-file(s) to be split and filtered. You may use
  9. one/multiple file(s) or e.g. "*.bz2" as input.
  10. (default: enwiki-latest-pages-articles.xml.bz2)
  11. -s SIZE, --splitsize SIZE
  12. The number of pages contained in each split. (default:
  13. 1000000)
  14. -d DIR, --outputdir DIR
  15. The directory name where the files go. (default: wout)
  16. -t [TAGS [TAGS ...]], --tagname [TAGS [TAGS ...]]
  17. Tags to search for, e.g. use -t TAG1 TAG2 TAG3
  18. (default: ['math', 'ce', 'chem', 'math chem'])
  19. -k [KEYWORDS [KEYWORDS ...]], --keyword [KEYWORDS [KEYWORDS ...]]
  20. Keywords to search for, e.g. use -k KEYWORD1 KEYWORD2
  21. KEYWORD3 You might want to disable tags = specify
  22. empty tags (""), if you don`t want pages containing a
  23. tag OR a keyword! (default: [])
  24. -K KEYWORD_FILE, --keyword_file KEYWORD_FILE
  25. Another way to specify keywords. Use a keyword file
  26. containing one keyword (e.g.
  27. "<title>formulae</title>") in each line. (default: )
  28. -Q QID_FILE, --QID_file QID_FILE
  29. QID-file, containing one QID (e.g. "Q1234") in each
  30. line. They will be translated to the titles in their
  31. respective languages and "<title>SOME_TITLE</title>"
  32. will be used as keywords. Specify languages with "-l".
  33. The languages will be taken from the beginning of the
  34. filenames, which thus must start with
  35. "enwiki"/"dewiki"/... for english/german/... !
  36. (default: )
  37. -v, --verbosity
  38. -T, --template Include all templates. (default: False)

These filtered results can then be used to quickly extract small bz2-files via “find_most_common_formula.py” for all languages containing the titles (corresponding to the given QIDs) together with all formulae from said pages, see “Dumps filtered for tags/filtered 100 QIDs/filtered titles and formulae”. These are then automatically used to find the most common formula from a page across its different languages, see “terminal output.txt”.
Use as follows:

  1. usage: find_most_common_formula.py [-h] [-f [FILE [FILE ...]]] [-s SIZE]
  2. [-d DIR] [-Q QID_FILE] [-t TAGS] [-v] [-T]
  3. Extract all formulae (defined as having a formula_indicator) from the
  4. wikipages that contain the titles corresponding to the given QIDs(loaded via
  5. "-Q"), in all specified languages(corresponding to the beginning of the
  6. bz2-filenames, e.g. "enwiki....bz2"). Afterwards extracts the most common
  7. formula for a wikipedia page (in all languages specified). Formulae occuring
  8. multiple times for a wikipedia page(in a single language) are counted only
  9. once!
  10. optional arguments:
  11. -h, --help show this help message and exit
  12. -f [FILE [FILE ...]], --filename [FILE [FILE ...]]
  13. The bz2-file(s) to be filtered. Default: Use all
  14. bz2-files in current folder. (default: )
  15. -s SIZE, --splitsize SIZE
  16. The number of pages contained in each split. (default:
  17. 1000000)
  18. -d DIR, --outputdir DIR
  19. The output directory name. (default: wout)
  20. -Q QID_FILE, --QID_file QID_FILE
  21. QID-file, containing one QID (e.g. "Q1234") in each
  22. line(other lines without QIDs can be mixed in). They
  23. will be translated to the titles in their respective
  24. languages and "<title>SOME_TITLE</title>" will be used
  25. as keywords. The languages will be taken from the
  26. beginning of the filenames, which thus must start with
  27. "enwiki"/"dewiki"/... for english/german/... !
  28. "enwikibooks", "enwikiquote" etc. are not allowed!!!
  29. (default: )
  30. -t TAGS, --tagname TAGS
  31. Comma separated string of the tag names to search for;
  32. no spaces allowed. (default: math,ce,chem,math chem)
  33. -v, --verbosity
  34. -T, --template include all templates (default: False)

To use both “wikiFilter.py” as well as “find_most_common_formula.py”, they need to be in the same folder as the bz2-input-files you are using them for.

To use other languages, download the original, big dumps via the links given in “links to dumps.txt” and use “wikiFilter.py” to get the filtered dumps as in “Dumps filtered for tags”.
Due to the maximum file size on GitHub, the filtered results for multiple languages are not uploaded to “Dumps filtered for tags”, but the further filtered results are included in “Dumps filtered for tags/filtered 100 QIDs”.

In the folder “miscellaneous”, files useful during the development of the project are included for the sake of completeness.