Cross-language mathematical formulae in Wikipedia

Research task: Find shared formulae across Wikipedia pages that are available in different languages.
Goal: Extract defining formula from Wikipedia pages more often than by extracting the first formula on a Wikipedia page.

Obtain and filter Wikipedias fulltext

Use “wikiFilter.py” to filter wikidumps of different languages for all pages that contain a tag (e.g. math-tag), results are found in “Dumps filtered for tags”. Those can then be further filtered via “wikiFilter.py” for pages belonging to certain QIDs (given via —QID_file, use “Gold Standard.txt”), found in “Dumps filtered for tags/filtered 100 QIDs”.
Use “wikiFilter.py” as follows:

usage: wikiFilter.py [-h] [-f [FILENAMES [FILENAMES ...]]] [-s SIZE] [-d DIR]
                     [-t [TAGS [TAGS ...]]] [-k [KEYWORDS [KEYWORDS ...]]]
                     [-K KEYWORD_FILE] [-Q QID_FILE] [-v] [-T]
Extract wikipages that contain the math tag.
optional arguments:
  -h, --help            show this help message and exit
  -f [FILENAMES [FILENAMES ...]], --filename [FILENAMES [FILENAMES ...]]
                        The bz2-file(s) to be split and filtered. You may use
                        one/multiple file(s) or e.g. "*.bz2" as input.
                        (default: enwiki-latest-pages-articles.xml.bz2)
  -s SIZE, --splitsize SIZE
                        The number of pages contained in each split. (default:
                        1000000)
  -d DIR, --outputdir DIR
                        The directory name where the files go. (default: wout)
  -t [TAGS [TAGS ...]], --tagname [TAGS [TAGS ...]]
                        Tags to search for, e.g. use -t TAG1 TAG2 TAG3
                        (default: ['math', 'ce', 'chem', 'math chem'])
  -k [KEYWORDS [KEYWORDS ...]], --keyword [KEYWORDS [KEYWORDS ...]]
                        Keywords to search for, e.g. use -k KEYWORD1 KEYWORD2
                        KEYWORD3 You might want to disable tags = specify
                        empty tags (""), if you don`t want pages containing a
                        tag OR a keyword! (default: [])
  -K KEYWORD_FILE, --keyword_file KEYWORD_FILE
                        Another way to specify keywords. Use a keyword file
                        containing one keyword (e.g.
                        "<title>formulae</title>") in each line. (default: )
  -Q QID_FILE, --QID_file QID_FILE
                        QID-file, containing one QID (e.g. "Q1234") in each
                        line. They will be translated to the titles in their
                        respective languages and "<title>SOME_TITLE</title>"
                        will be used as keywords. Specify languages with "-l".
                        The languages will be taken from the beginning of the
                        filenames, which thus must start with
                        "enwiki"/"dewiki"/... for english/german/... !
                        (default: )
  -v, --verbosity
  -T, --template        Include all templates. (default: False)

These filtered results can then be used to quickly extract small bz2-files via “find_most_common_formula.py” for all languages containing the titles (corresponding to the given QIDs) together with all formulae from said pages, see “Dumps filtered for tags/filtered 100 QIDs/filtered titles and formulae”. These are then automatically used to find the most common formula from a page across its different languages, see “terminal output.txt”.
Use as follows:

usage: find_most_common_formula.py [-h] [-f [FILE [FILE ...]]] [-s SIZE]
                                   [-d DIR] [-Q QID_FILE] [-t TAGS] [-v] [-T]
Extract all formulae (defined as having a formula_indicator) from the
wikipages that contain the titles corresponding to the given QIDs(loaded via
"-Q"), in all specified languages(corresponding to the beginning of the
bz2-filenames, e.g. "enwiki....bz2"). Afterwards extracts the most common
formula for a wikipedia page (in all languages specified). Formulae occuring
multiple times for a wikipedia page(in a single language) are counted only
once!
optional arguments:
  -h, --help            show this help message and exit
  -f [FILE [FILE ...]], --filename [FILE [FILE ...]]
                        The bz2-file(s) to be filtered. Default: Use all
                        bz2-files in current folder. (default: )
  -s SIZE, --splitsize SIZE
                        The number of pages contained in each split. (default:
                        1000000)
  -d DIR, --outputdir DIR
                        The output directory name. (default: wout)
  -Q QID_FILE, --QID_file QID_FILE
                        QID-file, containing one QID (e.g. "Q1234") in each
                        line(other lines without QIDs can be mixed in). They
                        will be translated to the titles in their respective
                        languages and "<title>SOME_TITLE</title>" will be used
                        as keywords. The languages will be taken from the
                        beginning of the filenames, which thus must start with
                        "enwiki"/"dewiki"/... for english/german/... !
                        "enwikibooks", "enwikiquote" etc. are not allowed!!!
                        (default: )
  -t TAGS, --tagname TAGS
                        Comma separated string of the tag names to search for;
                        no spaces allowed. (default: math,ce,chem,math chem)
  -v, --verbosity
  -T, --template        include all templates (default: False)

To use both “wikiFilter.py” as well as “find_most_common_formula.py”, they need to be in the same folder as the bz2-input-files you are using them for.

To use other languages, download the original, big dumps via the links given in “links to dumps.txt” and use “wikiFilter.py” to get the filtered dumps as in “Dumps filtered for tags”.
Due to the maximum file size on GitHub, the filtered results for multiple languages are not uploaded to “Dumps filtered for tags”, but the further filtered results are included in “Dumps filtered for tags/filtered 100 QIDs”.

In the folder “miscellaneous”, files useful during the development of the project are included for the sake of completeness.