-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate graph objects into hdbscan #539
base: master
Are you sure you want to change the base?
Integrate graph objects into hdbscan #539
Conversation
Create a new option for metrics calles "graph", which takes graph objects as a csr adjacency matrix and runs the HDBSCAN function on the csgraph min_span_tree of the given graph. Example plot file (plot_hdbscan_graph.py) shows the working of the function and displays the example plots of the graphs.
Modifed the _hdbscan_sparse_distance_matrix function to integrate the graph metric into the hdbscan code. Started working on the test for the graph metric.
Use the test_hdbscan_sparse_distance_matrix test to create the test_hdbscan_graph function. Added the requirments for the metric to the requirements.txt (igraph and networkx).
Added commentary and proving the test.
Set requirements back to the latest HDBSCAN version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see this PR opened @JanRhoKa!
I've made a few comments about code organization.
My main thinking here is that the whole idea of this PR is to allow users to weight the graphs by means other than mutual reachability of a distance matrix. I think this could be done with smaller changes to the code in a few places.
For example, the changes I suggested to the test is more "If a user passed mutual reachability weights, the result should be equivalent to passing distances and letting hdbscan compute mutual reachability weights." Right now it looks like you're testing that the result of using mutual reachability weights vs. just the distances directly is minimal.
Minor other point: I would see if you could revert the unrelated formatting changes.
Co-authored-by: Isaac Virshup <[email protected]>
Updated the hdbscan_ function for metric="graph" and test_hdbscan by integrating the comments made by ivirshup
Thanks for the updates. This looks like it is starting to come together. The test failures are due to importing networkx in the test module; but it doesn't seem to be actually used there. Perhaps you can remove the import. It might also be useful (to help ensure the feature sees usage) to include some documentation for this. Perhaps a short tutorial on clustering a graph with hdbscan? Perhaps based on the example you already have? |
Removed the networkx as nx import from the test_hdbscan file to avoid test failures, as it wasn't used.
Adapt the HDBSCAN function for graphs.
Create a new metric called
"graph"
, which run HDBSCAN on graphs using an adjacency matrix of the graph, inscipy.sparse.csr_matrix
format.Modified the
_hdbscan_sparse_distance_matrix
function for this, as no graph has to be calculated from the data. Additionally an example of the function is provided underexamples\plot_hdbscan_graph.py
which shows the communities calculated by HDBSCAN, the communities calculated using the precomputed metric and the time save by integrating the graph directly into the function.