我运行了以下示例:
https://github.com/technobium/mahout-clustering/blob/master/src/main/java/com/technobium/ClusteringDemo.java#L64
文件1 - >约翰看到一辆红色汽车。文件2 - > …
正如Anony-Mousse在第一次回复中所说,我提供的数据属于一个集群。在最近几周进行了一些灵魂搜索之后(或者更具体地说,直接尝试了距离测量类),我找到了一个导致多个集群的数据集:
Text id1 = new Text("Document 1"); Text text1 = new Text("Atletico Madrid win"); writer.append(id1, text1); Text id6 = new Text("Document 6"); Text text6 = new Text("Both apple and orange are fruit"); writer.append(id6, text6); Text id7 = new Text("Document 7"); Text text7 = new Text("Both orange and apple are fruit"); writer.append(id7, text7);
Vector v1 = toVector("Atletico Madrid win"); Vector v2 = toVector("Both apple and orange are fruit"); Vector v3 = toVector("Both orange and apple are fruit"); of = ImmutableList.of(v1, v2, v3); List<Vector> vectorList = new LinkedList(); vectorList.addAll(of); List<Canopy> canopies = CanopyClusterer.createCanopies(vectorList, new CosineDistanceMeasure(), 0.3, 0.3); for (Canopy canopy : canopies) { System.out.println("DistanceMeasureMain.main() " + canopy.asFormatString()); }
生产:
DistanceMeasureMain.main() distance is 0.19193857965451055 DistanceMeasureMain.main() distance is 0.5281191379648771 DistanceMeasureMain.main() distance is 0.19193857965451055 DistanceMeasureMain.main() C0: {0:1.1,117724:1.0,378550445:1.0,1997849123:1.0} DistanceMeasureMain.main() C1: {0:1.1,96727:1.0,96852:1.0,2076577:1.0,93029210:1.0,97711124:1.0,1008851410:1.0}
我觉得 t1 和 t2 价值观( 0.2 和 0.2 )为 CanopyDriver.run() 虽然我不知道复杂细节中所有数值参数在下面的调用中的效果,但也很重要:
t1
t2
0.2
CanopyDriver.run()
// CosineDistanceMeasure CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new CosineDistanceMeasure(), 0.2, 0.2, true, 1, true); FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path( canopyCentroids, "clusters-0-final"), new Path( clusterOutput), 0.01, 20, 2, true, true, 0, false);
Document 1 -> Atletico Madrid win Document 6 -> Both apple and orange are fruit Document 7 -> Both orange and apple are fruit Clusters: 0 -> wt: 1.0 distance: 0.0 vec: Document 1 = [1:1.405, 4:1.405, 6:1.405] 1 -> wt: 1.0 distance: 0.0 vec: Document 6 = [0:1.000, 2:1.000, 3:1.000, 5:1.000] 1 -> wt: 1.0 distance: 0.0 vec: Document 7 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]