项目作者: ropensci

项目描述 :
Search and download public domain texts from Project Gutenberg
高级语言: R
项目地址: git://github.com/ropensci/gutenbergr.git
创建时间: 2016-04-28T14:25:04Z
项目社区:https://github.com/ropensci/gutenbergr

开源协议:

下载


gutenbergr

CRAN
status
rOpenSci
peer-review
Project Status: Active – The project has reached a stable, usable
state and is being actively
developed.
R-CMD-check
Codecov test
coverage

Download and process public domain works from the Project
Gutenberg
collection. Includes

  • A function gutenberg_download() that downloads one or more works
    from Project Gutenberg by ID: e.g., gutenberg_download(84) downloads
    the text of Frankenstein.
  • Metadata for all Project Gutenberg works as R datasets, so that they
    can be searched and filtered:
    • gutenberg_metadata contains information about each work, pairing
      Gutenberg ID with title, author, language, etc
    • gutenberg_authors contains information about each author, such as
      aliases and birth/death year
    • gutenberg_subjects contains pairings of works with Library of
      Congress subjects and topics

Installation



Install the released version of gutenbergr from
CRAN:

r install.packages("gutenbergr")



Install the development version of gutenbergr from
GitHub:

r # install.packages("pak") pak::pak("ropensci/gutenbergr")

Examples

The gutenberg_works() function retrieves, by default, a table of
metadata for all unique English-language Project Gutenberg works that
have text associated with them. (The gutenberg_metadata dataset has
all Gutenberg works, unfiltered).

Suppose we wanted to download Emily Bronte’s “Wuthering Heights.” We
could find the book’s ID by filtering:

  1. library(dplyr)
  2. library(gutenbergr)
  3. gutenberg_works() |>
  4. filter(title == "Wuthering Heights")
  5. #> # A tibble: 1 × 8
  6. #> gutenberg_id title author gutenberg_author_id language
  7. #> <int> <chr> <chr> <int> <chr>
  8. #> 1 768 Wuthering Heights Brontë, Emily 405 en
  9. #> gutenberg_bookshelf rights has_text
  10. #> <chr> <chr> <lgl>
  11. #> 1 Best Books Ever Listings/Gothic Fiction/Movie Books/Browsing: Literature/Browsing… Publi… TRUE
  12. # or just:
  13. gutenberg_works(title == "Wuthering Heights")
  14. #> # A tibble: 1 × 8
  15. #> gutenberg_id title author gutenberg_author_id language
  16. #> <int> <chr> <chr> <int> <chr>
  17. #> 1 768 Wuthering Heights Brontë, Emily 405 en
  18. #> gutenberg_bookshelf rights has_text
  19. #> <chr> <chr> <lgl>
  20. #> 1 Best Books Ever Listings/Gothic Fiction/Movie Books/Browsing: Literature/Browsing… Publi… TRUE

Since we see that it has gutenberg_id 768, we can download it with the
gutenberg_download() function:

  1. wuthering_heights <- gutenberg_download(768)
  2. wuthering_heights
  3. #> # A tibble: 12,342 × 2
  4. #> gutenberg_id text
  5. #> <int> <chr>
  6. #> 1 768 "Wuthering Heights"
  7. #> 2 768 ""
  8. #> 3 768 "by Emily Brontë"
  9. #> 4 768 ""
  10. #> 5 768 ""
  11. #> 6 768 ""
  12. #> 7 768 ""
  13. #> 8 768 "CHAPTER I"
  14. #> 9 768 ""
  15. #> 10 768 ""
  16. #> # ℹ 12,332 more rows

gutenberg_download can download multiple books when given multiple
IDs. It also takes a meta_fields argument that will add variables from
the metadata.

  1. # 1260 is the ID of Jane Eyre
  2. books <- gutenberg_download(c(768, 1260), meta_fields = "title")
  3. books
  4. #> # A tibble: 33,343 × 3
  5. #> gutenberg_id text title
  6. #> <int> <chr> <chr>
  7. #> 1 768 "Wuthering Heights" Wuthering Heights
  8. #> 2 768 "" Wuthering Heights
  9. #> 3 768 "by Emily Brontë" Wuthering Heights
  10. #> 4 768 "" Wuthering Heights
  11. #> 5 768 "" Wuthering Heights
  12. #> 6 768 "" Wuthering Heights
  13. #> 7 768 "" Wuthering Heights
  14. #> 8 768 "CHAPTER I" Wuthering Heights
  15. #> 9 768 "" Wuthering Heights
  16. #> 10 768 "" Wuthering Heights
  17. #> # ℹ 33,333 more rows
  18. books |>
  19. count(title)
  20. #> # A tibble: 2 × 2
  21. #> title n
  22. #> <chr> <int>
  23. #> 1 Jane Eyre: An Autobiography 21001
  24. #> 2 Wuthering Heights 12342

It can also take the output of gutenberg_works directly. For example,
we could get the text of all Aristotle’s works, each annotated with both
gutenberg_id and title, using:

  1. aristotle_books <- gutenberg_works(author == "Aristotle") |>
  2. gutenberg_download(meta_fields = "title")
  3. aristotle_books
  4. #> # A tibble: 43,801 × 3
  5. #> gutenberg_id text
  6. #> <int> <chr>
  7. #> 1 1974 "THE POETICS OF ARISTOTLE"
  8. #> 2 1974 ""
  9. #> 3 1974 "By Aristotle"
  10. #> 4 1974 ""
  11. #> 5 1974 "A Translation By S. H. Butcher"
  12. #> 6 1974 ""
  13. #> 7 1974 ""
  14. #> 8 1974 "[Transcriber's Annotations and Conventions: the translator left"
  15. #> 9 1974 "intact some Greek words to illustrate a specific point of the original"
  16. #> 10 1974 "discourse. In this transcription, in order to retain the accuracy of"
  17. #> title
  18. #> <chr>
  19. #> 1 The Poetics of Aristotle
  20. #> 2 The Poetics of Aristotle
  21. #> 3 The Poetics of Aristotle
  22. #> 4 The Poetics of Aristotle
  23. #> 5 The Poetics of Aristotle
  24. #> 6 The Poetics of Aristotle
  25. #> 7 The Poetics of Aristotle
  26. #> 8 The Poetics of Aristotle
  27. #> 9 The Poetics of Aristotle
  28. #> 10 The Poetics of Aristotle
  29. #> # ℹ 43,791 more rows

FAQ

What do I do with the text once I have it?

  • The Natural Language Processing CRAN
    View

    suggests many R packages related to text mining, especially around the
    tm package.
  • The tidytext package is
    useful for tokenization and analysis, especially since gutenbergr
    downloads books as a data frame already.
  • You could match the wikipedia column in gutenberg_author to
    Wikipedia content with the
    WikipediR package or
    to pageview statistics with the
    wikipediatrend
    package.
  • If you’re considering an analysis based on author name, you may find
    the humaniformat
    (for extraction of first names) and
    gender (prediction of
    gender from first names) packages useful. (Note that humaniformat has
    a format_reverse function for reversing “Last, First” names).

How were the metadata R files generated?

See the
data-raw
directory for the scripts that generate these datasets. As of now, these
were generated from the Project Gutenberg
catalog
on 14
September 2024
.

Do you respect the rules regarding robot access to Project Gutenberg?

Yes! The package respects these
rules
and complies
to the best of our ability. Namely:

  • Project Gutenberg allows harvesting with automated software using
    this list of
    links
    .
    The gutenbergr package visits that page once to find the recommended
    mirror for the user’s location.
  • We retrieve the book text directly from that mirror using links in the
    same format. For example, Frankenstein (book 84) is retrieved from
    https://www.gutenberg.lib.md.us/8/84/84.zip.
  • We give priority to retrieving the .zip file to minimize bandwidth
    on the mirror. .txt files are only retrieved if there is no .zip.

Still, this package is not the right way to download the entire
Project Gutenberg corpus (or all from a particular language). For that,
follow their
recommendation
to
set up a mirror. This package is recommended for downloading a single
work, or works for a particular author or topic. See their Terms of
Service
for
details.

Code of Conduct

Please note that the gutenbergr project is released with a Contributor
Code of
Conduct
.
By contributing to this project, you agree to abide by its terms.

ropensci_footer