NLP resources for Serbian and Serbo-Croatian

Wikipedia and Wiktionary resources are distributed under the CC BY-SA 3.0 license.

All dictionaries were parsed using Python 3.8.3. The regex library re was used to match the structure of dictionary entries and extract the data.

The XML structure is specified according to TEI Lex0, with a few additional tags.

The Wiktionary synonyms and definitions were collected from the Wiktionary dump of September 2020. The \s contain an orthographic form (\) of a word and grammatical information group (\). Entries contain senses inside \ tags, each with a list of \ cross-referencing synonyms. Since we extracted both synonyms and definitions from SH Wiktionary, there are two corresponding files. Definitions are either given as text or are derived from synonyms. In the definitions file, in addition to the orthographic form, we also recorded syllables (\) and pronunciation information (\) tags. We kept all styling information, abbreviations and node descriptions inside \ tags.

The Systematic Dictionary of Serbo-Croatian) contains words related to each other via systematic groups. We chose the first words from a group as a group name. Sets of words in a group are divided into subgroups and they were recorded as senses, with cross-referenced words in \ tags. Subgroups that were not synonymous to the group (antonyms, for example) were recorded under \ tag, with \ representing a related group that can contain its own senses and words.