项目作者: tdude92

项目描述 :
4,308 short stories (4 million words) scraped from https://reddit.com/r/WritingPrompts
高级语言:
项目地址: git://github.com/tdude92/reddit-short-stories.git
创建时间: 2021-04-28T04:36:07Z
项目社区:https://github.com/tdude92/reddit-short-stories

开源协议:MIT License

下载


reddit-short-stories

A small unlabelled dataset of 4,308 short stories (4 million words) scraped from https://reddit.com/r/WritingPrompts for your machine learning needs.

Scraped and formatted by Trevor Du

Dataset description

  • Each line of reddit_short_stories.txt is one full short story.
  • Each short story begins with an “\“ token and ends with an “\“ token (eg. “\ once upon a time, the end \“).
  • Newline characters in a story are replaced with the “\“ token (eg. “\ line 1 \ line 2 \“)

Data Collection Method

r/WritingPrompts is a forum on the popular discussion website, https://reddit.com. The tradition is that users start threads that are titled with a Writing Prompt. In these threads, other users comment a short story they’ve written based on the original prompt.

The scraper saved a comment on a post on r/WritingPrompts if the following conditions are satisfied:

  • The post is flaired “Writing Prompt”
  • The post has >=1.0k upvotes.
  • The author of the comment is not a moderator of r/WritingPrompts (to avoid scraping automod posts and mod announcements).
  • The comment has >=200 upvotes.
  • The comment has >=200 words.
  • <20 comments have already been scraped from the comment’s parent post.

Note: Only a portion of r/WritingPrompts was scraped, not the entire thing.

Hoping to scrape more of r/WritingPrompts and other subreddits in the future.