-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory requirements in KMedoids #23
Comments
Too good idea to be not implemented. Especially it worth for binary data clustering with |
+1, I've hit this snag and having troubling thinking of a workaround (sampling from data a bit tough in my situation). Was hoping there was continue training via fit or such.
I vote worth it. Many cluster algos are super heavy on the training anyway for many rows. It's inference we care about usually. At the least, if it works for now as a holdover, more could use the package |
I think from sklearn.metrics import pairwise_distances_chunked
pdc = pairwise_distances_chunked(x, metric='cosine', working_memory=64)
# attempt 1
dists = np.vstack(pdc) # vstack can work with generators
# attempt 2
dists = np.empty((x.shape[0], x.shape[0]))
for i, chunk in enumerate(pdc):
sz = chunk.shape[0]
start, stop = i*sz, (i+1)*sz
dists[start:stop] = chunk dists = np.empty((x.shape[0], x.shape[0]))
MemoryError: Unable to allocate 43.0 GiB for an array with shape (75970, 75970) and data type float64 I haven't sat with kmedoids code yet (hopefully will find time), maybe it doesn't need the n*n after |
pairwise_distances_chunked is useful where there is a "reduce operation"
following the distance computation, such that the final memory required is
smaller than n^2.
|
We could also implement CLARA which does an aggregation of PAM on sub-samples (see wikipedia) this is an approximate algorithm and there is a risk that the result would be not as good but at least it will be tractable, the algorithm is a trade-off between efficiency and complexity (there is a parameter to tune this trade-off). This could easily be implemented as yet another method for KMedoids. Ref : Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw). doi:10.1002/9780470316801.ch2. This algorithm is already implemented in R, Matlab, ELKI... Another possibility is this article which should not decrease the efficiency of KMedoids from what I understood but may save time/space, this would be more experimental as the method is much more recent and may be for later. |
K-Medoids will need all distances several times. So unless you have a very cheap distance that you can afford to compute several times you will likely be trading memory vs. even heavier CPU usage. |
KMedoids currently pre-computes a full distance matrix with
pairwise_distances
resulting in large memory usage making it unsuitable for datasets with more than 20-50k samples.To improve the situation somewhat, following approaches could be possible,
pairwise_distances_chunked
float32
input the distance matrix is also 32 bit.The text was updated successfully, but these errors were encountered: