-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hdbscan and sparse precomputed distance matrix #636
Comments
@KukumavMozolo any success with this? My (limited) experience with this is that hdbscan does not support sparse matrices for the distance matrix. |
Unfortunately I gave up on it for now and just uniformly down-sampled the data:( |
However, i revisited the problem and managed to pre compute a sparse distance matrix semi efficiently using numba(and numba-progress for verbosity) and only storing distances smaller than a threshold. Also i put an epsilon value on the diagonal of the distance matrix, not sure this is necessary. This is example code for data that is encoded in a scipy sparse csr format. One has to hand in data, indices, ind_ptr directly since csr_matrix is not well supported by numba. With the distance matrix computed one has to set hdbscan metric to "precomputed":
This potentially produces a disconnected graph in some cases, so one needs to join the disconnected components.
|
Hi!,
so i am working at the following problem i have millions of sparse data points that are very high dimensional.
Using a sparse precomputed distance matrix seems one way to feed this data into hdbscan.
My current idea is to only store those distances that are below a certain threshold or use a fixed number of distances for every point and than ensuring that there are no disconnected components in the resulting graph.
How do hdbscan's hyperparameters interact with the required level of sparsity of that matrix. e.g. given a fixed
min_cluster_size
,min_samples
andcluster_selection_epsilon
how would that constrain the threshold or the number of distances per point so that the resulting clustering is no different from when providing the full distance matrix?The text was updated successfully, but these errors were encountered: