项目作者: mkearney

项目描述 :
Word Factor Vectors
高级语言: R
项目地址: git://github.com/mkearney/wactor.git
创建时间: 2019-07-11T19:58:57Z
项目社区:https://github.com/mkearney/wactor

开源协议:Other

下载


" class="reference-link">wactor

Travis build
status
AppVeyor build
status
CRAN
status
Lifecycle:
experimental
Codecov test
coverage

A user-friendly factor-like interface for converting strings of text
into numeric vectors and rectangular data structures.

Installation

You can install the development version from
GitHub with:

  1. ## install {remotes} if not already
  2. if (!requireNamespace("remotes")) {
  3. install.packages("remotes")
  4. }
  5. ## install wactor for github
  6. remotes::install_github("mkearney/wactor")

Example

Here’s some basic text (e.g., natural language) data:

  1. ## load wactor package
  2. library(wactor)
  3. ## text data (sentences)
  4. x <- c(
  5. "This test is a test",
  6. "This one will be a test",
  7. "This this was a test",
  8. "And this is the fourth test",
  9. "Fifth: the test!",
  10. "And the sixth test",
  11. "This is the seventh test",
  12. "This test is going to be a test",
  13. "This one will have been a test",
  14. "This this has been a test"
  15. )
  16. ## for demonstration purposes, store as a data frame as well
  17. data <- tibble::tibble(
  18. text = x,
  19. value = rnorm(length(x)),
  20. z = c(rep(TRUE, 7), rep(FALSE, 3))
  21. )

split_test_train()

A convenience function for splitting an input object into test and train
data frames. This is often useful for splitting a single data frame:

  1. ## split into test/train data sets
  2. split_test_train(data)
  3. #> $train
  4. #> # A tibble: 8 x 3
  5. #> text value z
  6. #> <chr> <dbl> <lgl>
  7. #> 1 This test is a test 0.523 TRUE
  8. #> 2 This one will be a test 0.362 TRUE
  9. #> 3 This this was a test -1.61 TRUE
  10. #> 4 And this is the fourth test -0.923 TRUE
  11. #> 5 Fifth: the test! -1.17 TRUE
  12. #> 6 This test is going to be a test -0.192 FALSE
  13. #> 7 This one will have been a test 0.531 FALSE
  14. #> 8 This this has been a test -1.85 FALSE
  15. #>
  16. #> $test
  17. #> # A tibble: 2 x 3
  18. #> text value z
  19. #> <chr> <dbl> <lgl>
  20. #> 1 And the sixth test 1.25 TRUE
  21. #> 2 This is the seventh test 1.13 TRUE

By default, split_test_train() returns 80% of the input data in the
train data set and 20% of the input data in the test data set. This
proportion of data used in the returned training data can be adjusted
via the .p argument:

  1. ## split into test/train data sets–with 70% of data in training set
  2. split_test_train(data, .p = 0.70)
  3. #> $train
  4. #> # A tibble: 7 x 3
  5. #> text value z
  6. #> <chr> <dbl> <lgl>
  7. #> 1 This test is a test 0.523 TRUE
  8. #> 2 This one will be a test 0.362 TRUE
  9. #> 3 This this was a test -1.61 TRUE
  10. #> 4 And this is the fourth test -0.923 TRUE
  11. #> 5 This test is going to be a test -0.192 FALSE
  12. #> 6 This one will have been a test 0.531 FALSE
  13. #> 7 This this has been a test -1.85 FALSE
  14. #>
  15. #> $test
  16. #> # A tibble: 3 x 3
  17. #> text value z
  18. #> <chr> <dbl> <lgl>
  19. #> 1 Fifth: the test! -1.17 TRUE
  20. #> 2 And the sixth test 1.25 TRUE
  21. #> 3 This is the seventh test 1.13 TRUE

When predicting categorical variables, it’s often desireable to ensure
the training data set has an even number of observations for each level
in the response variable. This can be achieved by indicating the
[column] name of the categorical response variable using tidy
evaluation. This will prioritize evenly balanced observations over the
specified proportion in training data:

  1. ## ensure evenly balanced groups in `train` data set
  2. split_test_train(data, .p = 0.70, z)
  3. #> $train
  4. #> # A tibble: 6 x 3
  5. #> text value z
  6. #> <chr> <dbl> <lgl>
  7. #> 1 This one will be a test 0.362 TRUE
  8. #> 2 And this is the fourth test -0.923 TRUE
  9. #> 3 Fifth: the test! -1.17 TRUE
  10. #> 4 This test is going to be a test -0.192 FALSE
  11. #> 5 This one will have been a test 0.531 FALSE
  12. #> 6 This this has been a test -1.85 FALSE
  13. #>
  14. #> $test
  15. #> # A tibble: 4 x 3
  16. #> text value z
  17. #> <chr> <dbl> <lgl>
  18. #> 1 This test is a test 0.523 TRUE
  19. #> 2 This this was a test -1.61 TRUE
  20. #> 3 And the sixth test 1.25 TRUE
  21. #> 4 This is the seventh test 1.13 TRUE

The split_test_train() doesn’t only work on data frames. It’s also
possible to split atomic vectors (i.e., character, numeric, logical):

  1. ## OR split character vector into test/train data sets
  2. (d <- split_test_train(x))
  3. #> $train
  4. #> # A tibble: 8 x 1
  5. #> x
  6. #> <chr>
  7. #> 1 This one will be a test
  8. #> 2 This this was a test
  9. #> 3 And this is the fourth test
  10. #> 4 Fifth: the test!
  11. #> 5 And the sixth test
  12. #> 6 This is the seventh test
  13. #> 7 This test is going to be a test
  14. #> 8 This one will have been a test
  15. #>
  16. #> $test
  17. #> # A tibble: 2 x 1
  18. #> x
  19. #> <chr>
  20. #> 1 This test is a test
  21. #> 2 This this has been a test

wactor()

Use wactor() to convert a character vector into a wactor object. The
code below uses the previously split [into test/train] text data d
described above.

  1. ## create wactor
  2. w <- wactor(d$train$x)

dtm()

Get the document term frequency matrix

  1. ## term frequency–inverse document frequency
  2. dtm(w)
  3. #> 8 x 18 sparse Matrix of class "dgCMatrix"
  4. #> [[ suppressing 18 column names 'test', 'this', 'a' ... ]]
  5. #>
  6. #> 1 1 1 1 . . . 1 1 1 . . . . . . . . .
  7. #> 2 1 2 1 . . . . . . . . . 1 . . . . .
  8. #> 3 1 1 . 1 1 1 . . . . . . . . . . 1 .
  9. #> 4 1 . . 1 . . . . . . . 1 . . . . . .
  10. #> 5 1 . . 1 . 1 . . . . . . . . . 1 . .
  11. #> 6 1 1 . 1 1 . . . . . . . . . . . . 1
  12. #> 7 2 1 1 . 1 . . 1 . . . . . 1 1 . . .
  13. #> 8 1 1 1 . . . 1 . 1 1 1 . . . . . . .
  14. ## same thing as dtm
  15. predict(w)
  16. #> 8 x 18 sparse Matrix of class "dgCMatrix"
  17. #> [[ suppressing 18 column names 'test', 'this', 'a' ... ]]
  18. #>
  19. #> 1 1 1 1 . . . 1 1 1 . . . . . . . . .
  20. #> 2 1 2 1 . . . . . . . . . 1 . . . . .
  21. #> 3 1 1 . 1 1 1 . . . . . . . . . . 1 .
  22. #> 4 1 . . 1 . . . . . . . 1 . . . . . .
  23. #> 5 1 . . 1 . 1 . . . . . . . . . 1 . .
  24. #> 6 1 1 . 1 1 . . . . . . . . . . . . 1
  25. #> 7 2 1 1 . 1 . . 1 . . . . . 1 1 . . .
  26. #> 8 1 1 1 . . . 1 . 1 1 1 . . . . . . .

tfidf()

or term frequency–inverse document frequency matrix

  1. ## create tf-idf matrix
  2. tfidf(w)
  3. #> test this a the is and one be will
  4. #> 1 0.7604596 0.13103552 0.9993288 -0.8761975 -0.7050698 -0.5262117 1.7793527 1.9030735 1.7793527
  5. #> 2 0.2213996 1.98205976 1.3806741 -0.8761975 -0.7050698 -0.5262117 -0.5379438 -0.5328606 -0.5379438
  6. #> 3 0.7604596 0.13103552 -0.9073974 0.3535534 1.2069839 1.1576657 -0.5379438 -0.5328606 -0.5379438
  7. #> 4 -1.9348401 -1.19112465 -0.9073974 1.5833043 -0.7050698 -0.5262117 -0.5379438 -0.5328606 -0.5379438
  8. #> 5 -0.5871903 -1.19112465 -0.9073974 0.9684289 -0.7050698 1.9996044 -0.5379438 -0.5328606 -0.5379438
  9. #> 6 0.2213996 0.39546755 -0.9073974 0.5995036 1.5893946 -0.5262117 -0.5379438 -0.5328606 -0.5379438
  10. #> 7 -0.5871903 -0.19950453 0.5226473 -0.8761975 0.7289704 -0.5262117 -0.5379438 1.2940900 -0.5379438
  11. #> 8 1.1455024 -0.05784451 0.7269394 -0.8761975 -0.7050698 -0.5262117 1.4483103 -0.5328606 1.4483103
  12. #> been have fifth was to going sixth fourth seventh
  13. #> 1 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534
  14. #> 2 -0.3535534 -0.3535534 -0.3535534 2.4748737 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534
  15. #> 3 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 2.4748737 -0.3535534
  16. #> 4 -0.3535534 -0.3535534 2.4748737 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534
  17. #> 5 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 2.4748737 -0.3535534 -0.3535534
  18. #> 6 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 2.4748737
  19. #> 7 -0.3535534 -0.3535534 -0.3535534 -0.3535534 2.4748737 2.4748737 -0.3535534 -0.3535534 -0.3535534
  20. #> 8 2.4748737 2.4748737 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534

Or apply the wactor on new data

  1. ## document term frequecy of new data
  2. dtm(w, d$test$x)
  3. #> 2 x 18 sparse Matrix of class "dgCMatrix"
  4. #> [[ suppressing 18 column names 'test', 'this', 'a' ... ]]
  5. #>
  6. #> 1 2 1 1 . 1 . . . . . . . . . . . . .
  7. #> 2 1 2 1 . . . . . . 1 . . . . . . . .
  8. ## same thing as dtm
  9. predict(w, d$test$x)
  10. #> 2 x 18 sparse Matrix of class "dgCMatrix"
  11. #> [[ suppressing 18 column names 'test', 'this', 'a' ... ]]
  12. #>
  13. #> 1 2 1 1 . 1 . . . . . . . . . . . . .
  14. #> 2 1 2 1 . . . . . . 1 . . . . . . . .
  15. ## term frequency–inverse document frequency of new data
  16. tfidf(w, d$test$x)
  17. #> test this a the is and one be will
  18. #> 1 -3.0129600 0.3954676 1.380674 -0.8761975 1.5893946 -0.5262117 -0.5379438 -0.5328606 -0.5379438
  19. #> 2 0.2213996 1.9820598 1.380674 -0.8761975 -0.7050698 -0.5262117 -0.5379438 -0.5328606 -0.5379438
  20. #> been have fifth was to going sixth fourth seventh
  21. #> 1 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534
  22. #> 2 3.6062446 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534 -0.3535534

xgb_mat

The wactor package also makes it easy to work with the
{xgboost} package:

  1. ## convert tfidf matrix into xgb.DMatrix
  2. xgb_mat(tfidf(w, d$test$x))
  3. #> xgb.DMatrix dim: 2 x 18 info: NA colnames: yes

The xgb_mat() function also allows users to specify a
response/label/outcome vector, e.g.:

  1. ## include a response variable
  2. xgb_mat(tfidf(w, d$train$x), y = c(rep(0, 4), rep(1, 4)))
  3. #> xgb.DMatrix dim: 8 x 18 info: label colnames: yes

To return split (into test and train) data, specify a value between 0-1
to set the proportion of observations that should appear in the training
data set:

  1. ## split into test/train
  2. xgb_data <- xgb_mat(tfidf(w, d$train$x), y = c(rep(0, 4), rep(1, 4)), split = 0.8)

The object returned by xgb_mat() can then easily be passed to
{xgboost} functions for powerful and fast machine learning!

  1. ## specify hyper params
  2. params <- list(
  3. max_depth = 2,
  4. eta = 0.25,
  5. objective = "binary:logistic"
  6. )
  7. ## init training
  8. xgboost::xgb.train(
  9. params,
  10. xgb_data$train,
  11. nrounds = 4,
  12. watchlist = xgb_data)
  13. #> [1] train-error:0.500000 test-error:0.500000
  14. #> [2] train-error:0.500000 test-error:0.500000
  15. #> [3] train-error:0.500000 test-error:0.500000
  16. #> [4] train-error:0.500000 test-error:0.500000
  17. #> ##### xgb.Booster
  18. #> raw: 1.1 Kb
  19. #> call:
  20. #> xgboost::xgb.train(params = params, data = xgb_data$train, nrounds = 4,
  21. #> watchlist = xgb_data)
  22. #> params (as set within xgb.train):
  23. #> max_depth = "2", eta = "0.25", objective = "binary:logistic", silent = "1"
  24. #> xgb.attributes:
  25. #> niter
  26. #> callbacks:
  27. #> cb.print.evaluation(period = print_every_n)
  28. #> cb.evaluation.log()
  29. #> # of features: 18
  30. #> niter: 4
  31. #> nfeatures : 18
  32. #> evaluation_log:
  33. #> iter train_error test_error
  34. #> 1 0.5 0.5
  35. #> 2 0.5 0.5
  36. #> 3 0.5 0.5
  37. #> 4 0.5 0.5