项目作者: bensadeghi

项目描述 :
Julia implementation of Decision Tree (CART) and Random Forest algorithms
高级语言: Julia
项目地址: git://github.com/bensadeghi/DecisionTree.jl.git
创建时间: 2013-01-14T01:29:53Z
项目社区:https://github.com/bensadeghi/DecisionTree.jl

开源协议:Other

下载


DecisionTree.jl

CI
Codecov
Docs Stable

Attention: this package is now maintained by JuliaAI. Its new home is here.

Julia implementation of Decision Tree (CART) and Random Forest algorithms

Available via:

  • AutoMLPipeline.jl - create complex ML pipeline structures using simple expressions
  • CombineML.jl - a heterogeneous ensemble learning package
  • MLJ.jl - a machine learning framework for Julia
  • ScikitLearn.jl - Julia implementation of the scikit-learn API

Classification

  • pre-pruning (max depth, min leaf size)
  • post-pruning (pessimistic pruning)
  • multi-threaded bagging (random forests)
  • adaptive boosting (decision stumps)
  • cross validation (n-fold)
  • support for ordered features (encoded as Reals or Strings)

Regression

  • pre-pruning (max depth, min leaf size)
  • multi-threaded bagging (random forests)
  • cross validation (n-fold)
  • support for numerical features

Note that regression is implied if labels/targets are of type Array{Float}

Installation

You can install DecisionTree.jl using Julia’s package manager

  1. Pkg.add("DecisionTree")

ScikitLearn.jl API

DecisionTree.jl supports the ScikitLearn.jl interface and algorithms (cross-validation, hyperparameter tuning, pipelines, etc.)

Available models: DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, RandomForestRegressor, AdaBoostStumpClassifier.
See each model’s help (eg. ?DecisionTreeRegressor at the REPL) for more information

Classification Example

Load DecisionTree package

  1. using DecisionTree

Separate Fisher’s Iris dataset features and labels

  1. features, labels = load_data("iris") # also see "adult" and "digits" datasets
  2. # the data loaded are of type Array{Any}
  3. # cast them to concrete types for better performance
  4. features = float.(features)
  5. labels = string.(labels)

Pruned Tree Classifier

  1. # train depth-truncated classifier
  2. model = DecisionTreeClassifier(max_depth=2)
  3. fit!(model, features, labels)
  4. # pretty print of the tree, to a depth of 5 nodes (optional)
  5. print_tree(model, 5)
  6. # apply learned model
  7. predict(model, [5.9,3.0,5.1,1.9])
  8. # get the probability of each label
  9. predict_proba(model, [5.9,3.0,5.1,1.9])
  10. println(get_classes(model)) # returns the ordering of the columns in predict_proba's output
  11. # run n-fold cross validation over 3 CV folds
  12. # See ScikitLearn.jl for installation instructions
  13. using ScikitLearn.CrossValidation: cross_val_score
  14. accuracy = cross_val_score(model, features, labels, cv=3)

Also, have a look at these classification and regression notebooks.

Native API

Classification Example

Decision Tree Classifier

  1. # train full-tree classifier
  2. model = build_tree(labels, features)
  3. # prune tree: merge leaves having >= 90% combined purity (default: 100%)
  4. model = prune_tree(model, 0.9)
  5. # pretty print of the tree, to a depth of 5 nodes (optional)
  6. print_tree(model, 5)
  7. # apply learned model
  8. apply_tree(model, [5.9,3.0,5.1,1.9])
  9. # apply model to all the sames
  10. preds = apply_tree(model, features)
  11. # generate confusion matrix, along with accuracy and kappa scores
  12. confusion_matrix(labels, preds)
  13. # get the probability of each label
  14. apply_tree_proba(model, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])
  15. # run 3-fold cross validation of pruned tree,
  16. n_folds=3
  17. accuracy = nfoldCV_tree(labels, features, n_folds)
  18. # set of classification parameters and respective default values
  19. # pruning_purity: purity threshold used for post-pruning (default: 1.0, no pruning)
  20. # max_depth: maximum depth of the decision tree (default: -1, no maximum)
  21. # min_samples_leaf: the minimum number of samples each leaf needs to have (default: 1)
  22. # min_samples_split: the minimum number of samples in needed for a split (default: 2)
  23. # min_purity_increase: minimum purity needed for a split (default: 0.0)
  24. # n_subfeatures: number of features to select at random (default: 0, keep all)
  25. # keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
  26. n_subfeatures=0; max_depth=-1; min_samples_leaf=1; min_samples_split=2
  27. min_purity_increase=0.0; pruning_purity = 1.0; seed=3
  28. model = build_tree(labels, features,
  29. n_subfeatures,
  30. max_depth,
  31. min_samples_leaf,
  32. min_samples_split,
  33. min_purity_increase;
  34. rng = seed)
  35. accuracy = nfoldCV_tree(labels, features,
  36. n_folds,
  37. pruning_purity,
  38. max_depth,
  39. min_samples_leaf,
  40. min_samples_split,
  41. min_purity_increase;
  42. verbose = true,
  43. rng = seed)

Random Forest Classifier

  1. # train random forest classifier
  2. # using 2 random features, 10 trees, 0.5 portion of samples per tree, and a maximum tree depth of 6
  3. model = build_forest(labels, features, 2, 10, 0.5, 6)
  4. # apply learned model
  5. apply_forest(model, [5.9,3.0,5.1,1.9])
  6. # get the probability of each label
  7. apply_forest_proba(model, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])
  8. # run 3-fold cross validation for forests, using 2 random features per split
  9. n_folds=3; n_subfeatures=2
  10. accuracy = nfoldCV_forest(labels, features, n_folds, n_subfeatures)
  11. # set of classification parameters and respective default values
  12. # n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
  13. # n_trees: number of trees to train (default: 10)
  14. # partial_sampling: fraction of samples to train each tree on (default: 0.7)
  15. # max_depth: maximum depth of the decision trees (default: no maximum)
  16. # min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
  17. # min_samples_split: the minimum number of samples in needed for a split (default: 2)
  18. # min_purity_increase: minimum purity needed for a split (default: 0.0)
  19. # keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
  20. # multi-threaded forests must be seeded with an `Int`
  21. n_subfeatures=-1; n_trees=10; partial_sampling=0.7; max_depth=-1
  22. min_samples_leaf=5; min_samples_split=2; min_purity_increase=0.0; seed=3
  23. model = build_forest(labels, features,
  24. n_subfeatures,
  25. n_trees,
  26. partial_sampling,
  27. max_depth,
  28. min_samples_leaf,
  29. min_samples_split,
  30. min_purity_increase;
  31. rng = seed)
  32. accuracy = nfoldCV_forest(labels, features,
  33. n_folds,
  34. n_subfeatures,
  35. n_trees,
  36. partial_sampling,
  37. max_depth,
  38. min_samples_leaf,
  39. min_samples_split,
  40. min_purity_increase;
  41. verbose = true,
  42. rng = seed)

Adaptive-Boosted Decision Stumps Classifier

  1. # train adaptive-boosted stumps, using 7 iterations
  2. model, coeffs = build_adaboost_stumps(labels, features, 7);
  3. # apply learned model
  4. apply_adaboost_stumps(model, coeffs, [5.9,3.0,5.1,1.9])
  5. # get the probability of each label
  6. apply_adaboost_stumps_proba(model, coeffs, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])
  7. # run 3-fold cross validation for boosted stumps, using 7 iterations
  8. n_iterations=7; n_folds=3
  9. accuracy = nfoldCV_stumps(labels, features,
  10. n_folds,
  11. n_iterations;
  12. verbose = true)

Regression Example

  1. n, m = 10^3, 5
  2. features = randn(n, m)
  3. weights = rand(-2:2, m)
  4. labels = features * weights

Regression Tree

  1. # train regression tree
  2. model = build_tree(labels, features)
  3. # apply learned model
  4. apply_tree(model, [-0.9,3.0,5.1,1.9,0.0])
  5. # run 3-fold cross validation, returns array of coefficients of determination (R^2)
  6. n_folds = 3
  7. r2 = nfoldCV_tree(labels, features, n_folds)
  8. # set of regression parameters and respective default values
  9. # pruning_purity: purity threshold used for post-pruning (default: 1.0, no pruning)
  10. # max_depth: maximum depth of the decision tree (default: -1, no maximum)
  11. # min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
  12. # min_samples_split: the minimum number of samples in needed for a split (default: 2)
  13. # min_purity_increase: minimum purity needed for a split (default: 0.0)
  14. # n_subfeatures: number of features to select at random (default: 0, keep all)
  15. # keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
  16. n_subfeatures = 0; max_depth = -1; min_samples_leaf = 5
  17. min_samples_split = 2; min_purity_increase = 0.0; pruning_purity = 1.0 ; seed=3
  18. model = build_tree(labels, features,
  19. n_subfeatures,
  20. max_depth,
  21. min_samples_leaf,
  22. min_samples_split,
  23. min_purity_increase;
  24. rng = seed)
  25. r2 = nfoldCV_tree(labels, features,
  26. n_folds,
  27. pruning_purity,
  28. max_depth,
  29. min_samples_leaf,
  30. min_samples_split,
  31. min_purity_increase;
  32. verbose = true,
  33. rng = seed)

Regression Random Forest

  1. # train regression forest, using 2 random features, 10 trees,
  2. # averaging of 5 samples per leaf, and 0.7 portion of samples per tree
  3. model = build_forest(labels, features, 2, 10, 0.7, 5)
  4. # apply learned model
  5. apply_forest(model, [-0.9,3.0,5.1,1.9,0.0])
  6. # run 3-fold cross validation on regression forest, using 2 random features per split
  7. n_subfeatures=2; n_folds=3
  8. r2 = nfoldCV_forest(labels, features, n_folds, n_subfeatures)
  9. # set of regression build_forest() parameters and respective default values
  10. # n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
  11. # n_trees: number of trees to train (default: 10)
  12. # partial_sampling: fraction of samples to train each tree on (default: 0.7)
  13. # max_depth: maximum depth of the decision trees (default: no maximum)
  14. # min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
  15. # min_samples_split: the minimum number of samples in needed for a split (default: 2)
  16. # min_purity_increase: minimum purity needed for a split (default: 0.0)
  17. # keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
  18. # multi-threaded forests must be seeded with an `Int`
  19. n_subfeatures=-1; n_trees=10; partial_sampling=0.7; max_depth=-1
  20. min_samples_leaf=5; min_samples_split=2; min_purity_increase=0.0; seed=3
  21. model = build_forest(labels, features,
  22. n_subfeatures,
  23. n_trees,
  24. partial_sampling,
  25. max_depth,
  26. min_samples_leaf,
  27. min_samples_split,
  28. min_purity_increase;
  29. rng = seed)
  30. r2 = nfoldCV_forest(labels, features,
  31. n_folds,
  32. n_subfeatures,
  33. n_trees,
  34. partial_sampling,
  35. max_depth,
  36. min_samples_leaf,
  37. min_samples_split,
  38. min_purity_increase;
  39. verbose = true,
  40. rng = seed)

Saving Models

Models can be saved to disk and loaded back with the use of the JLD2.jl package.

  1. using JLD2
  2. @save "model_file.jld2" model

Note that even though features and labels of type Array{Any} are supported, it is highly recommended that data be cast to explicit types (ie with float.(), string.(), etc). This significantly improves model training and prediction execution times, and also drastically reduces the size of saved models.