项目作者: dracor-org

项目描述 :
Russian Drama Corpus (in TEI-P5)
高级语言: CSS
项目地址: git://github.com/dracor-org/rusdracor.git
创建时间: 2017-09-19T09:28:12Z
项目社区:https://github.com/dracor-org/rusdracor

开源协议:

下载


RusDraCor

Corpus Description

We are building a Russian Drama Corpus with files encoded in
TEI-P5. Our corpus comprises
212 plays to date, originating from ilibrary,
Wikisource, РВБ,
lib.ru, ФЕБ,
СовЛит and
Wikilivres, converted to TEI and corrected
and enhanced by us. There will be more.

If you want to cite the corpus, please use this publication:

  • Fischer, Frank, et al. (2019). Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama. In Proceedings of DH2019: “Complexities”, Utrecht University, doi:10.5281/zenodo.4284002.

RusDraCor was first presented on June 29, 2017, at the Corpora 2017
conference
in St.
Petersburg (our slides here),
on July 11, 2017, at the “Digitizing the stage”
conference
in Oxford and
on November 14, 2017, at the
TEI 2017 conference
in Victoria. The social network data we extract from plays may also be explored
on our website dracor.org/rus or via
our Shinyapp.

If you just want to download the corpus in its current state in XML-TEI,
do this:

svn export https://github.com/dracor-org/rusdracor/trunk/tei

API

An easy way to download the network data (instead of the actual TEI files) is
to use our API (documentation here).
If you have jq installed, it would work
like this:

  1. for play in `curl 'https://dracor.org/api/corpora/rus' | jq -r ".dramas[] .name"`; do
  2. wget -O "$play".csv https://dracor.org/api/corpora/rus/play/"$play"/networkdata/csv
  3. done

The API info page is at https://dracor.org/api/info.

Simple Visualisation with R

To have a first look at the distribution of the number of speakers per play over
time, you could feed the metadata table into R:

  1. library(data.table)
  2. library(ggplot2)
  3. rusdracor <- fread("https://dracor.org/api/corpora/rus/metadata.csv")
  4. ggplot(rusdracor[], aes(x = yearNormalized, y = numOfSpeakers)) + geom_point()

Result:

number of speakers per play over time

Here is a barplot showing the number of plays per decade:

number of plays per decade

(README last updated on July 26, 2021.)