NLP resources for Serbian and Serbo-Croatian. This repository currently contains parsed Serbo-Croatian Wiktionary data (words with definitions and synonyms) and data from the Systematic Dictionary of Serbo-Croatian. Entries are stored in an XML compliant to TEI Lex0 specification.
Wikipedia and Wiktionary resources are distributed under the CC BY-SA 3.0 license.
All dictionaries were parsed using Python 3.8.3. The regex library re was used to match the structure of dictionary entries and extract the data.
The XML structure is specified according to TEI Lex0, with a few additional tags.
The Wiktionary synonyms and definitions were collected from the Wiktionary dump of September 2020. The \
The Systematic Dictionary of Serbo-Croatian) contains words related to each other via systematic groups. We chose the first words from a group as a group name. Sets of words in a group are divided into subgroups and they were recorded as senses, with cross-referenced words in \