项目作者: cmtt

项目描述 :
k-means clustering algorithm with k-means++ initialization.
高级语言: JavaScript
项目地址: git://github.com/cmtt/kmpp.git
创建时间: 2012-08-15T18:45:03Z
项目社区:https://github.com/cmtt/kmpp

开源协议:

下载


kmpp

Travis CI

When dealing with lots of data points, clustering algorithms may be used to group them. The k-means algorithm partitions n data points into k clusters and finds the centroids of these clusters incrementally.

The algorithm assigns data points to the closest cluster, and the centroids of each cluster are re-calculated. These steps are repeated until the centroids do not changing anymore.

The basic k-means algorithm is initialized with k centroids at random positions. This implementation addresses some disadvantages of the arbitrary initialization method with the k-means++ algorithm (see “Further reading” at the end).

Installation

Installing via npm

Install kmpp as Node.js module via NPM:

  1. $ npm install kmpp

Example

  1. var kmpp = require('kmpp');
  2. kmpp([
  3. [x1, y1, ...],
  4. [x2, y2, ...],
  5. [x3, y3, ...],
  6. ...
  7. ], {
  8. k: 4
  9. });
  10. // =>
  11. // { converged: true,
  12. // centroids: [[xm1, ym1, ...], [xm2, ym2, ...], [xm3, ym3, ...]],
  13. // counts: [ 7, 6, 7 ],
  14. // assignments: [ 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 ]
  15. // }

API

kmpp(points[, opts)

Exectes the k-means++ algorithm on points.

Arguments:

  • points (Array): An array-of-arrays containing the points in format [[x1, y1, ...], [x2, y2, ...], [x3, y3, ...], ...]
  • opts: object containing configuration parameters. Parameters are
    • distance (function): Optional function that takes two points and returns the distance between them.
    • initialize (Boolean): Perform initialization. If false, uses the initial state provided in centroids and assignments. Otherwise discards any initial state and performs initialization.
    • k (Number): number of centroids. If not provided, sqrt(n / 2) is used, where n is the number of points.
    • kmpp (Boolean, default: true): If true, uses k-means++ initialization. Otherwise uses naive random assignment.
    • maxIterations (Number, default: 100): Maximum allowed number of iterations.
    • norm (Number, default: 2): L-norm used for distance computation. 1 is Manhattan norm, 2 is Euclidean norm. Ignored if distance function is provided.
    • centroids (Array): An array of centroids. If initialize is false, used as initialization for the algorithm, otherwise overwritten in-place if of the correct size.
    • assignments (Array): An array of assignments. Used for initialization, otherwise overwritten.
    • counts (Array): An output array used to avoid extra allocation. Values are discarded and overwritten.

Returns an object containing information about the centroids and point assignments. Values are:

  • converged: true if the algorithm converged successfully
  • centroids: a list of centroids
  • counts: the number of points assigned to each respective centroid
  • assignments: a list of integer assignments of each point to the respective centroid
  • iterations: number of iterations used

Credits

  • Jared Harkins improved the performance by
    reducing the amount of function calls, reverting to Manhattan distance
    for measurements and improved the random initialization by choosing from
    points

  • Ricky Reusser refactored API

Further reading

License

© 2017-2019. MIT License.