-
Notifications
You must be signed in to change notification settings - Fork 0
/
k-means
35 lines (27 loc) · 1.33 KB
/
k-means
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Pseudo-code:
-----------------
Select K random points from data, assign to a variable named "centroids"
def update_clusters():
for c in centroids:
for data_point in data:
Assign data_point with minimum distance(data_point, c) to clusters[c]
update_clusters()
def update_centroids():
new_centroids = []
for c in centroids:
Calculate "mean" for clustters[c]
new_centroids.append([mean, clustters[c].data_points])
return new_centroids
while(true):
new_centroids = update_centroids()
if new_centroids == centroids:
break
centroids = new_centroids:
update_clusters()
# clustered data_points available in "clusters"
-----------------
The "update_clusters" function where we calculate distance for each data point and also calculating "mean" in "update_centroids" are the main two bottlenecks for large datasets.
Implementing these loops with a Big Data stack like RDD.map parallelizes the computation.
Since the key component of kmeans is the "distance" calculation, it is sensitive to initialization. Example in the link below shows a case when bad initialization can hinder finding the optimum solution:
[LINK]
In the "Bad initialization" example for this scenario, distance function computes about the same distance for all outer data points.