项目作者: sandroradovanovic

项目描述 :
White Box Cluster Algorithm Design
高级语言: R
项目地址: git://github.com/sandroradovanovic/whiboclustering.git
创建时间: 2018-10-29T16:18:13Z
项目社区:https://github.com/sandroradovanovic/whiboclustering

开源协议:GNU General Public License v3.0

下载


White-Box Clustering Algorithm Design

This is a framework for representative based component based cluster algorithm design. Representative based cluster algorithms are one of the largest class of clustering algorithms (K-Means as a typical example) which can be decomposed to RCs (Reusable Components). Identified RCs are:

  • (Optional) Normalization,
  • Initialization of representatives,
  • Measuring distance (and assigning of data points to representatives),
  • Updating representatives,
  • Cluster Evaluation.

All of the RCs have different implementation as representative based clustering algorithms evolved. Every improvement tried to solve major issues of K-Means algorithm just by improving one of the RC leaving other unchanged. This R package aims to develop all of those improvemets in such manner that one can extend it by implementing new solution for specific RC and use it with already developed solutions of other RCs.

Getting Started

Hopefully, one will find all needed instructions for running and developing improvements for whiboclustering R package.

Installing

Installment of R package should be as easy as every other package.

  1. install.package("whiboclustering")

After that you can use it by running command:

  1. library("whiboclustering")

Using the package

Once you installed and loaded the library you can use it for your clustering experiments.

Simple example:

  1. data <- iris[, 1:4] #Selecting only numeric columns
  2. model <- whiboclustering(data = data, k = 3)

With this step you created WhiBo Cluster model which will show information about clustering. You can access brief information about cluster model by providing:

  1. print(model)

Implementations of RCs

Here one can find implemented solutions for RCs. Formulas and more detailed explanation will be available on GitHub page (soon to be). If you know any other RC solution that is available, but not implemented please contact us.

Normalization:

  • No normalization
  • Z transformation
  • $L_{2}$ normalization
  • $L_{1}$ normalization
  • $L_{\infty}$ normalization
  • Max-Min normalization
  • Mean normalization
  • Logarithmic normalization
  • Non-monotonic normalization
  • Comprehensive normalization
  • Decimal Scaling
  • Sigmoid transformation
  • Softmax transformation

Cluster Initialization:

  • Random
  • Forgy
  • K-Means++
  • KKZ
  • PCA
  • AGNES
  • DIANA
  • Ward
  • Quantile
  • CCIA

Measuring Distance and Cluster Assignment:

  • Euclidean
  • Squared Euclidean
  • Manhattan
  • Canbera
  • Chebyshev
  • Cosine
  • Correlation
  • Sorensen
  • Kulczynski
  • Lorentzian
  • Gower
  • Inersection
  • Tanimoto
  • Inner Product
  • Fidelity
  • Bhattacharyya
  • Hellinger
  • Whittaker

Update (Recalculate) Cluster Representatives:

  • Mean
  • Median
  • Trimmed mean
  • Geometric mean
  • Harmonic mean
  • Quadratic mean
  • Trimean
  • Midhinge
  • Midrange
  • Online Mean
  • Online Median
  • Online Trimmed mean
  • Online Geometric mean
  • Online Harmonic mean
  • Online Quadratic mean
  • Online Trimean
  • Online Midhinge
  • Online Midrange

Cluster evaluation:

  • Total Sum of Squares
  • Within Sum of Squares
  • Between Sum of Squares
  • Ball-Hall index
  • Banfeld-Raftery index
  • C index
  • Calinski-Harabasz index
  • Davies-Bouldin score
  • Det ratio
  • Dunn index
  • Gamma index
  • G+ index
  • Silhouette score
  • Xie-Beni index

Extending whiboclustering

This part is explained in more details on GitHub page (soon to be) or you can contact us.

Authors

Acknowledgements

We had a help from other members: