Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate graph objects into hdbscan #539

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

JanRhoKa
Copy link

Adapt the HDBSCAN function for graphs.

Create a new metric called "graph", which run HDBSCAN on graphs using an adjacency matrix of the graph, in scipy.sparse.csr_matrix format.

Modified the _hdbscan_sparse_distance_matrix function for this, as no graph has to be calculated from the data. Additionally an example of the function is provided under examples\plot_hdbscan_graph.py which shows the communities calculated by HDBSCAN, the communities calculated using the precomputed metric and the time save by integrating the graph directly into the function.

Create a new option for metrics calles "graph", which takes graph objects as a csr adjacency matrix and runs the HDBSCAN function on the csgraph min_span_tree of the given graph.

Example plot file (plot_hdbscan_graph.py) shows the working of the function and displays the example plots of the graphs.
Modifed the _hdbscan_sparse_distance_matrix function to integrate the graph metric into the hdbscan code.

Started working on the test for the graph metric.
Use the test_hdbscan_sparse_distance_matrix test to create the test_hdbscan_graph function. Added the requirments for the metric to the requirements.txt (igraph and networkx).
Added commentary and proving the test.
Set requirements back to the latest HDBSCAN version.
Copy link

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see this PR opened @JanRhoKa!

I've made a few comments about code organization.

My main thinking here is that the whole idea of this PR is to allow users to weight the graphs by means other than mutual reachability of a distance matrix. I think this could be done with smaller changes to the code in a few places.

For example, the changes I suggested to the test is more "If a user passed mutual reachability weights, the result should be equivalent to passing distances and letting hdbscan compute mutual reachability weights." Right now it looks like you're testing that the result of using mutual reachability weights vs. just the distances directly is minimal.


Minor other point: I would see if you could revert the unrelated formatting changes.

hdbscan/hdbscan_.py Outdated Show resolved Hide resolved
hdbscan/hdbscan_.py Outdated Show resolved Hide resolved
hdbscan/tests/test_hdbscan.py Outdated Show resolved Hide resolved
JanRhoKa and others added 2 commits May 5, 2022 15:47
Updated the hdbscan_ function for metric="graph" and test_hdbscan by integrating the comments made by ivirshup
@lmcinnes
Copy link
Collaborator

lmcinnes commented May 9, 2022

Thanks for the updates. This looks like it is starting to come together. The test failures are due to importing networkx in the test module; but it doesn't seem to be actually used there. Perhaps you can remove the import.

It might also be useful (to help ensure the feature sees usage) to include some documentation for this. Perhaps a short tutorial on clustering a graph with hdbscan? Perhaps based on the example you already have?

Removed the networkx as nx import from the test_hdbscan file to avoid test failures, as it wasn't used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants