项目作者: const-ae

项目描述 :
Cluster high dimensional categorical datasets
高级语言: R
项目地址: git://github.com/const-ae/mixdir.git
创建时间: 2018-01-20T19:17:24Z
项目社区:https://github.com/const-ae/mixdir

开源协议:

下载


mixdir

The goal of mixdir is to cluster high dimensional categorical datasets.

It can

  • handle missing data
  • infer a reasonable number of latent class (try mixdir(select_latent=TRUE))
  • cluster datasets with more than 70,000 observations and 60 features
  • propagate uncertainty and produce a soft clustering

A detailed description of the algorithm and the features of the package can be found in the the accompanying paper. If you find the package useful please cite

C. Ahlmann-Eltze and C. Yau, “MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data”, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.

Installation

  1. install.packages("mixdir")
  2. # Or to get the latest version from github
  3. devtools::install_github("const-ae/mixdir")

Example

Clustering the mushroom data set.

  1. # Loading the library and the data
  2. library(mixdir)
  3. set.seed(1)
  4. data("mushroom")
  5. # High dimensional dataset: 8124 mushroom and 23 different features
  6. mushroom[1:10, 1:5]
  7. #> bruises cap-color cap-shape cap-surface edible
  8. #> 1 bruises brown convex smooth poisonous
  9. #> 2 bruises yellow convex smooth edible
  10. #> 3 bruises white bell smooth edible
  11. #> 4 bruises white convex scaly poisonous
  12. #> 5 no gray convex smooth edible
  13. #> 6 bruises yellow convex scaly edible
  14. #> 7 bruises white bell smooth edible
  15. #> 8 bruises white bell scaly edible
  16. #> 9 bruises white convex scaly poisonous
  17. #> 10 bruises yellow bell smooth edible

Calling the clustering function mixdir on a subset of the data:

  1. # Clustering into 3 latent classes
  2. result <- mixdir(mushroom[1:1000, 1:5], n_latent=3)

Analyzing the result

  1. # Latent class of of first 10 mushrooms
  2. head(result$pred_class, n=10)
  3. #> [1] 3 1 1 3 2 1 1 1 3 1
  4. # Soft Clustering for first 10 mushrooms
  5. head(result$class_prob, n=10)
  6. #> [,1] [,2] [,3]
  7. #> [1,] 3.103495e-07 1.055098e-05 9.999891e-01
  8. #> [2,] 9.998594e-01 4.683764e-06 1.359291e-04
  9. #> [3,] 9.998944e-01 3.111462e-06 1.025194e-04
  10. #> [4,] 5.778033e-04 7.114603e-08 9.994221e-01
  11. #> [5,] 3.662625e-07 9.999992e-01 4.183025e-07
  12. #> [6,] 9.996461e-01 8.764031e-08 3.537838e-04
  13. #> [7,] 9.998944e-01 3.111462e-06 1.025194e-04
  14. #> [8,] 9.997331e-01 5.822320e-08 2.668420e-04
  15. #> [9,] 5.778033e-04 7.114603e-08 9.994221e-01
  16. #> [10,] 9.999999e-01 5.850067e-09 9.845112e-08
  17. pheatmap::pheatmap(result$class_prob, cluster_cols=FALSE,
  18. labels_col = paste("Class", 1:3))

  1. # Structure of latent class 1
  2. # (bruises, cap color either yellow or white, edible etc.)
  3. purrr::map(result$category_prob, 1)
  4. #> $bruises
  5. #> bruises no
  6. #> 0.9998223256 0.0001776744
  7. #>
  8. #> $`cap-color`
  9. #> brown gray red white yellow
  10. #> 0.0001775934 0.0001819672 0.0001776373 0.4079822666 0.5914805356
  11. #>
  12. #> $`cap-shape`
  13. #> bell convex flat sunken
  14. #> 0.3926736 0.4767291 0.1304197 0.0001776
  15. #>
  16. #> $`cap-surface`
  17. #> fibrous scaly smooth
  18. #> 0.0568571 0.4871396 0.4560033
  19. #>
  20. #> $edible
  21. #> edible poisonous
  22. #> 0.9998223174 0.0001776826
  23. # The most predicitive features for each class
  24. find_predictive_features(result, top_n=3)
  25. #> column answer class probability
  26. #> 19 cap-color yellow 1 0.9993990
  27. #> 22 cap-shape bell 1 0.9990947
  28. #> 1 bruises bruises 1 0.7089533
  29. #> 48 edible poisonous 3 0.9980468
  30. #> 15 cap-color red 3 0.8462032
  31. #> 9 cap-color brown 3 0.6473043
  32. #> 5 bruises no 2 0.9990364
  33. #> 11 cap-color gray 2 0.9978218
  34. #> 32 cap-shape sunken 2 0.9936162
  35. # For example: if all I know about a mushroom is that it has a
  36. # yellow cap, then I am 99% certain that it will be in class 1
  37. predict(result, c(`cap-color`="yellow"))
  38. #> [,1] [,2] [,3]
  39. #> [1,] 0.999399 0.0003004692 0.0003004907
  40. # Note the most predictive features are different from the most typical ones
  41. find_typical_features(result, top_n=3)
  42. #> column answer class probability
  43. #> 1 bruises bruises 1 0.9998223
  44. #> 43 edible edible 1 0.9998223
  45. #> 19 cap-color yellow 1 0.5914805
  46. #> 3 bruises bruises 3 0.9995546
  47. #> 27 cap-shape convex 3 0.7460615
  48. #> 9 cap-color brown 3 0.6746224
  49. #> 44 edible edible 2 0.9995310
  50. #> 5 bruises no 2 0.9713177
  51. #> 35 cap-surface fibrous 2 0.7355413

Dimensionality Reduction

  1. # Defining Features
  2. def_feat <- find_defining_features(result, mushroom[1:1000, 1:5], n_features = 3)
  3. print(def_feat)
  4. #> $features
  5. #> [1] "cap-color" "bruises" "edible"
  6. #>
  7. #> $quality
  8. #> [1] 74.35146
  9. # Plotting the most important features gives an immediate impression
  10. # how the cluster differ
  11. plot_features(def_feat$features, result$category_prob)
  12. #> Loading required namespace: ggplot2
  13. #> Loading required namespace: tidyr

Underlying Model

The package implements a variational inference algorithm to solve a Bayesian latent class model (LCM).