Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SilhouetteVisualizer add support for more estimators #1294

Merged
merged 10 commits into from
Jul 5, 2023
Merged

SilhouetteVisualizer add support for more estimators #1294

merged 10 commits into from
Jul 5, 2023

Conversation

stergion
Copy link
Contributor

@stergion stergion commented Jan 19, 2023

This PR closes #1182 .

  1. It adds support for clustering estimators that do not implement a predict() method,
  2. for estimators without attribute n_clusters infer clusters from labels,
  3. uses the estimator's metric or affinity as the silhouette metric.

I decided not to implement special handling for estimators that produce outlier values, eg DBSCAN,
as sklearn doesn't do neither in their examples.

I wasn't sure weather to use the estimator's metric attribute for the silhouette metric,
or add a parameter metric in SilhouetteVisualizer constructor. I chose the first option because it didn't
alter the class signature and seemed like the safer option. Although, I do believe the second option to be the better one.

@codecov
Copy link

codecov bot commented Jan 19, 2023

Codecov Report

Merging #1294 (e50a829) into develop (7a3c94c) will decrease coverage by 0.19%.
The diff coverage is 58.62%.

@@             Coverage Diff             @@
##           develop    #1294      +/-   ##
===========================================
- Coverage    90.89%   90.70%   -0.19%     
===========================================
  Files           93       93              
  Lines         5303     5327      +24     
===========================================
+ Hits          4820     4832      +12     
- Misses         483      495      +12     
Impacted Files Coverage Δ
yellowbrick/cluster/silhouette.py 85.22% <58.62%> (-13.22%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bbengfort
Copy link
Member

@stergion thank you so much for your interest in Yellowbrick and for opening this PR; we really appreciate all contributions to Yellowbrick. We'll find a reviewer for this PR as soon as possible so that we can include it in our next release.

Copy link
Contributor

@lwgray lwgray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR. I like how you simply added support for more estimators. The only thing left is to write some tests. Let me know if you need help with that. After that I will approve.

if check_fitted(self.estimator, is_fitted_by=self.is_fitted) and hasattr(self.estimator, "predict"):
labels = self.estimator.predict(X)
else: # if estimator is NOT fitted, OR estimator does NOT implement predict()
labels = self.estimator.fit_predict(X, y, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great 👍 way to cover fit_predict here.

yellowbrick/cluster/silhouette.py Outdated Show resolved Hide resolved
@stergion
Copy link
Contributor Author

@lwgray Thanks for the comments and the review. Sorry, I took so long to reply.

If you could help me with the tests, it would be great, since I don't have any experience.

@lwgray
Copy link
Contributor

lwgray commented Jun 25, 2023

I went back and reviewed this again and realized that all but two clustering algorithm work immediately with this fix. I am unsure why I get the ValueError with SpectralClustering. FeatureAgglomeration gives back an AttributeError and it is because it doesn't have a fit_predict method. See table below:

Works Error
<class 'sklearn.cluster._kmeans.KMeans'> Yes
<class 'sklearn.cluster._kmeans.MiniBatchKMeans'> Yes
<class 'sklearn.cluster._affinity_propagation.AffinityPropagation'> Yes
<class 'sklearn.cluster._mean_shift.MeanShift'> Yes
<class 'sklearn.cluster._spectral.SpectralClustering'> NO ValueError: Unknown metric rbf. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski', 'nan_euclidean', 'haversine'], or 'precomputed', or a callable
<class 'sklearn.cluster._dbscan.DBSCAN'> Yes
<class 'sklearn.cluster._optics.OPTICS'> Yes
<class 'sklearn.cluster._agglomerative.AgglomerativeClustering'> Yes
<class 'sklearn.cluster._birch.Birch'> Yes
<class 'sklearn.cluster._agglomerative.FeatureAgglomeration'> NO AttributeError: - Does not have a fit_predict method

@lwgray
Copy link
Contributor

lwgray commented Jun 25, 2023

A couple of questions...

  1. What do we do about tests? Do we write individual test for each clustering algorithm?
  2. After tests have been dealt with, Can we write up issues for the two remaining clustering algorithms that aren't fixed by this PR? I suggest we merge this PR after test added.

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwgray I think it is simple enough to add a test for all the clusters using pytest parameterize -- we do that for a lot of tests, and I agree it should be part of this PR.

We do need to fix SpectralClustering since the bug was introduced in this PR. FeatureAgglomoration is a transformer not a model -- so I think we can omit that from the tests.

Comment on lines 154 to 155
elif hasattr(self.estimator, "affinity"):
metric = self.estimator.affinity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwgray this is where the error is occurring for SpectralClustering - SpectralClustering does have an attribute affinity which is used to compute the adjacency matrix between instances. For spectral clustering the attribute affinity is not a distance metric and defaults to "rbf" -- which is why that test is failing.

@stergion what model prompted you to add this metric selector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was for the AffinityPropagation, AgglomerativeClustering, FeatureAgglomeration.

Since sklearn version 1.2, affinity is deprecated in AgglomerativeClustering and FeatureAgglomeration
and metric is used instead, like the other clustering algorithms.

AffinityPropagation was not updated, it still uses affinity. Although, AffinityPropagation uses the negative
squared euclidean distance between points, when affinity='euclidean'

…d are implementing fit_predict(). Added condition to make sure Spectral Clustering metric is not being set to
@lwgray lwgray requested a review from bbengfort June 30, 2023 19:50
@lwgray
Copy link
Contributor

lwgray commented Jun 30, 2023

@stergion @bbengfort Can you review my changes?

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwgray thank you for making those changes! I made a few changes of my own to try to help with the CI tests and linting. Would you and @stergion please review those changes?

@bbengfort
Copy link
Member

@lwgray RE: the conda tests; in a separate PR we're going to have to update our Python versions for testing. The Conda 3.8 and 3.9 test failures can be ignored.

Copy link
Contributor Author

@stergion stergion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Your changes also made the code cleaner and easier to understand.

@bbengfort bbengfort merged commit f7a8e95 into DistrictDataLabs:develop Jul 5, 2023
9 of 14 checks passed
@bbengfort
Copy link
Member

@stergion thank you so much for your contribution to Yellowbrick!

@stergion stergion deleted the SilhouetteVisualizer-add-support-for-more-estimators branch July 5, 2023 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more estimators to Silhouette Visualizer
3 participants