Search and download public domain texts from Project Gutenberg
Download and process public domain works from the Project
Gutenberg collection. Includes
gutenberg_download()
that downloads one or more worksgutenberg_download(84)
downloadsgutenberg_metadata
contains information about each work, pairinggutenberg_authors
contains information about each author, such asgutenberg_subjects
contains pairings of works with Library ofr
# install.packages("pak")
pak::pak("ropensci/gutenbergr")
The gutenberg_works()
function retrieves, by default, a table of
metadata for all unique English-language Project Gutenberg works that
have text associated with them. (The gutenberg_metadata
dataset has
all Gutenberg works, unfiltered).
Suppose we wanted to download Emily Bronte’s “Wuthering Heights.” We
could find the book’s ID by filtering:
library(dplyr)
library(gutenbergr)
gutenberg_works() |>
filter(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#> gutenberg_id title author gutenberg_author_id language
#> <int> <chr> <chr> <int> <chr>
#> 1 768 Wuthering Heights Brontë, Emily 405 en
#> gutenberg_bookshelf rights has_text
#> <chr> <chr> <lgl>
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books/Browsing: Literature/Browsing… Publi… TRUE
# or just:
gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#> gutenberg_id title author gutenberg_author_id language
#> <int> <chr> <chr> <int> <chr>
#> 1 768 Wuthering Heights Brontë, Emily 405 en
#> gutenberg_bookshelf rights has_text
#> <chr> <chr> <lgl>
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books/Browsing: Literature/Browsing… Publi… TRUE
Since we see that it has gutenberg_id
768, we can download it with thegutenberg_download()
function:
wuthering_heights <- gutenberg_download(768)
wuthering_heights
#> # A tibble: 12,342 × 2
#> gutenberg_id text
#> <int> <chr>
#> 1 768 "Wuthering Heights"
#> 2 768 ""
#> 3 768 "by Emily Brontë"
#> 4 768 ""
#> 5 768 ""
#> 6 768 ""
#> 7 768 ""
#> 8 768 "CHAPTER I"
#> 9 768 ""
#> 10 768 ""
#> # ℹ 12,332 more rows
gutenberg_download
can download multiple books when given multiple
IDs. It also takes a meta_fields
argument that will add variables from
the metadata.
# 1260 is the ID of Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
#> # A tibble: 33,343 × 3
#> gutenberg_id text title
#> <int> <chr> <chr>
#> 1 768 "Wuthering Heights" Wuthering Heights
#> 2 768 "" Wuthering Heights
#> 3 768 "by Emily Brontë" Wuthering Heights
#> 4 768 "" Wuthering Heights
#> 5 768 "" Wuthering Heights
#> 6 768 "" Wuthering Heights
#> 7 768 "" Wuthering Heights
#> 8 768 "CHAPTER I" Wuthering Heights
#> 9 768 "" Wuthering Heights
#> 10 768 "" Wuthering Heights
#> # ℹ 33,333 more rows
books |>
count(title)
#> # A tibble: 2 × 2
#> title n
#> <chr> <int>
#> 1 Jane Eyre: An Autobiography 21001
#> 2 Wuthering Heights 12342
It can also take the output of gutenberg_works
directly. For example,
we could get the text of all Aristotle’s works, each annotated with bothgutenberg_id
and title
, using:
aristotle_books <- gutenberg_works(author == "Aristotle") |>
gutenberg_download(meta_fields = "title")
aristotle_books
#> # A tibble: 43,801 × 3
#> gutenberg_id text
#> <int> <chr>
#> 1 1974 "THE POETICS OF ARISTOTLE"
#> 2 1974 ""
#> 3 1974 "By Aristotle"
#> 4 1974 ""
#> 5 1974 "A Translation By S. H. Butcher"
#> 6 1974 ""
#> 7 1974 ""
#> 8 1974 "[Transcriber's Annotations and Conventions: the translator left"
#> 9 1974 "intact some Greek words to illustrate a specific point of the original"
#> 10 1974 "discourse. In this transcription, in order to retain the accuracy of"
#> title
#> <chr>
#> 1 The Poetics of Aristotle
#> 2 The Poetics of Aristotle
#> 3 The Poetics of Aristotle
#> 4 The Poetics of Aristotle
#> 5 The Poetics of Aristotle
#> 6 The Poetics of Aristotle
#> 7 The Poetics of Aristotle
#> 8 The Poetics of Aristotle
#> 9 The Poetics of Aristotle
#> 10 The Poetics of Aristotle
#> # ℹ 43,791 more rows
wikipedia
column in gutenberg_author
toformat_reverse
function for reversing “Last, First” names).See the
data-raw
directory for the scripts that generate these datasets. As of now, these
were generated from the Project Gutenberg
catalog on 14
September 2024.
Yes! The package respects these
rules and complies
to the best of our ability. Namely:
https://www.gutenberg.lib.md.us/8/84/84.zip
..zip
file to minimize bandwidth.txt
files are only retrieved if there is no .zip
.Still, this package is not the right way to download the entire
Project Gutenberg corpus (or all from a particular language). For that,
follow their
recommendation to
set up a mirror. This package is recommended for downloading a single
work, or works for a particular author or topic. See their Terms of
Service for
details.
Please note that the gutenbergr project is released with a Contributor
Code of
Conduct.
By contributing to this project, you agree to abide by its terms.