diff --git a/README.md b/README.md index 692ae33..b3e3096 100644 --- a/README.md +++ b/README.md @@ -307,18 +307,20 @@ The whole package is under construction and the documentation is progressively e + Mengqi Gao (China University of Geosciences, Beijing, China) + Chengtu Li(Trenki, Henan Polytechnic University, Beijing, China) + Yucheng Yan (Andy, University of Sydney, Australia) ++ Ruitao Chang (China University of Geosciences Beijing, China) ++ Panyan Weng (The University of Sydney, Australia) **Product Group**: + Yang Lyu (Daisy, Zhejiang University, China) -+ Bailun Jiang (EPSI / Lille University, France) -+ Ruitao Chang (China University of Geosciences Beijing, China) -+ Panyan Weng (The University of Sydney, Australia) + Siqi Yao (Clara, Dongguan University of Technology, China) + Zhelan Lin(Lan, Fuzhou University, China) + ShuYi Li (Communication University Of China, Beijing, China) + Junbo Wang (China University Of Geosciences, Beijing, China) + Haibin Wang(Watson, University of Sydney, Australia) ++ Guoqiang Qiu(Elsen, Fuzhou University, China) ++ Yating Dong (Yetta,Dongguan University of Technology,China) ++ Haibin Lai (Michael, Southern University of Science and Technology, China) ## Join Us :) @@ -398,3 +400,4 @@ More Videos will be recorded soon. + Zhenglin Xu (Garry, Jilin University, China) + Jianing Wang (National University of Singapore, Singapore) + Junchi Liao(Roceda, University of Electronic Science and Technology of China, China) ++ Bailun Jiang (EPSI / Lille University, France) diff --git a/docs/source/For Developer/Add New Model To Framework.md b/docs/source/For Developer/Add New Model To Framework.md index 5f4a2a0..010cd78 100644 --- a/docs/source/For Developer/Add New Model To Framework.md +++ b/docs/source/For Developer/Add New Model To Framework.md @@ -865,8 +865,46 @@ Only for those algorithms, they belong to either regression or classification an ## 5. Test Model Workflow Class -After the model workflow class is added, you can test it through running the command `python start_cli_pipeline.py` on the terminal. If the test reports an error, you need to debug and fix it. If there is no error, it can be submitted. +After the model workflow class is added, you can test it through running the command `python start_cli_pipeline.py` on the terminal. + +If you can successfully run the pipeline, there are three aspects to verify the correctness of your modification: + +(1) Check whether the output info in the console is what you expect. + +image + +(2) Check whether the artifacts (e.g., dataset, images) produced saved properly in `geopi_output` folder and whether the content of the artifacts is what you expect. You can know where the `geopi_output` folder via the path in the console. + +image + +(3) Check whether the same artifacts (e.g., dataset, images) produced saved properly in MLflow. You can use this command `mlflow ui --backend-store-uri file:/path/to/geopi_tracking --port PORT_NUMBER` to launch the web interface supported by MLflow. Copy the link `http://127.0.0.1:PORT_NUMBER` to the brower. Click the corresponding experiment and run you created and check the artifacts accordingly. + +image + +image + +image + +For more details on how to use MLflow, you can watch the video as below: + +MLflow UI user guide - Geochemistry π v0.5.0 [[Bilibili]](https://b23.tv/CW5Rjmo) | [[YouTube]](https://www.youtube.com/watch?v=Yu1nzNeLfRY) + +If you fail to run the pipeline, you need to debug and fix it. Here is a recommended way - **breakpoint debugging**. In VSCode, you need to open the file `start_cli_pipeline.py` and click the button VSCode provides. + +image + +You can search the benefits of using **breakpoint debugging** to debug. There are two major benefits: + +(1) Lookup the value of the variable in the stack frame in memory directly. + +image + +(2) Create temporary watch (code to debug) to evaluate in the current stack frame. + +image + +After fixing the problem, don't forget to verify the produced artifacts in three aspects. ## 6. Completed Pull Request diff --git a/docs/source/Home/Introduction.md b/docs/source/Home/Introduction.md index 469c345..55bd1a2 100644 --- a/docs/source/Home/Introduction.md +++ b/docs/source/Home/Introduction.md @@ -308,18 +308,20 @@ The whole package is under construction and the documentation is progressively e + Mengqi Gao (China University of Geosciences, Beijing, China) + Chengtu Li(Trenki, Henan Polytechnic University, Beijing, China) + Yucheng Yan (Andy, University of Sydney, Australia) ++ Ruitao Chang (China University of Geosciences Beijing, China) ++ Panyan Weng (The University of Sydney, Australia) **Product Group**: + Yang Lyu (Daisy, Zhejiang University, China) -+ Bailun Jiang (EPSI / Lille University, France) -+ Ruitao Chang (China University of Geosciences Beijing, China) -+ Junchi Liao(Roceda, University of Electronic Science and Technology of China, China) -+ Panyan Weng (The University of Sydney, Australia) + Siqi Yao (Clara, Dongguan University of Technology, China) + Zhelan Lin(Lan, Fuzhou University, China) + ShuYi Li (Communication University Of China, Beijing, China) + Junbo Wang (China University Of Geosciences, Beijing, China) ++ Haibin Wang(Watson, University of Sydney, Australia) ++ Guoqiang Qiu(Elsen, Fuzhou University, China) ++ Yating Dong (Yetta,Dongguan University of Technology,China) ++ Haibin Lai (Michael, Southern University of Science and Technology, China) ## Join Us :) diff --git a/docs/source/python_apis/geochemistrypi.data_mining.model.func.algo_anomalydetection.rst b/docs/source/python_apis/geochemistrypi.data_mining.model.func.algo_anomalydetection.rst index 011e4b8..206e7d5 100644 --- a/docs/source/python_apis/geochemistrypi.data_mining.model.func.algo_anomalydetection.rst +++ b/docs/source/python_apis/geochemistrypi.data_mining.model.func.algo_anomalydetection.rst @@ -1,5 +1,5 @@ geochemistrypi.data\_mining.model.func.algo\_anomalydetection package -====================================================================== +===================================================================== Module contents --------------- diff --git a/geochemistrypi/data_mining/model/_base.py b/geochemistrypi/data_mining/model/_base.py index 113f980..38bd161 100644 --- a/geochemistrypi/data_mining/model/_base.py +++ b/geochemistrypi/data_mining/model/_base.py @@ -377,11 +377,10 @@ class ClusteringMetricsMixin: """Mixin class for clustering metrics.""" @staticmethod - def _get_num_clusters(func_name: str, algorithm_name: str, trained_model: object, store_path: str) -> None: - """Get and log the number of clusters.""" - labels = trained_model.labels_ - num_clusters = len(np.unique(labels)) + def _get_num_clusters(labels: pd.Series, func_name: str, algorithm_name: str, store_path: str) -> None: + """Get and log the number of clusters. It is only used in those algorithms which don't allow to set the number of cluster in advance.""" print(f"-----* {func_name} *-----") + num_clusters = len(np.unique(labels.to_numpy())) print(f"{func_name}: {num_clusters}") num_clusters_dict = {f"{func_name}": num_clusters} mlflow.log_metrics(num_clusters_dict) diff --git a/geochemistrypi/data_mining/model/clustering.py b/geochemistrypi/data_mining/model/clustering.py index 47c0e1a..b9943ec 100644 --- a/geochemistrypi/data_mining/model/clustering.py +++ b/geochemistrypi/data_mining/model/clustering.py @@ -29,7 +29,8 @@ class ClusteringWorkflowBase(WorkflowBase): def __init__(self): super().__init__() - self.clustering_result = None + self.cluster_labels = None + self.cluster_centers = None self.mode = "Clustering" def fit(self, X: pd.DataFrame, y: Optional[pd.DataFrame] = None) -> None: @@ -43,33 +44,43 @@ def manual_hyper_parameters(cls) -> Dict: """Manual hyper-parameters specification.""" return dict() - # TODO(Samson 1057266013@qq.com): This function might need to be rethought. - def get_cluster_centers(self) -> np.ndarray: + @staticmethod + def _get_cluster_centers(func_name: str, trained_model: object, algorithm_name: str, local_path: str, mlflow_path: str) -> Optional[pd.DataFrame]: """Get the cluster centers.""" - print("-----* Clustering Centers *-----") - print(getattr(self.model, "cluster_centers_", "This class don not have cluster_centers_")) - return getattr(self.model, "cluster_centers_", "This class don not have cluster_centers_") + print(f"-----* {func_name} *-----") + cluster_centers = getattr(trained_model, "cluster_centers_", None) + if cluster_centers is None: + print("This algorithm does not provide cluster centers") + else: + column_name = [] + for i in range(cluster_centers.shape[1]): + column_name.append(f"Dimension {i+1}") + print(cluster_centers) - def get_labels(self): + cluster_centers = pd.DataFrame(cluster_centers, columns=column_name) + save_data(cluster_centers, f"{func_name} - {algorithm_name}", local_path, mlflow_path) + return cluster_centers + + @staticmethod + def _get_cluster_labels(func_name: str, trained_model: object, algorithm_name: str, local_path: str, mlflow_path: str) -> pd.DataFrame: """Get the cluster labels.""" - print("-----* Clustering Labels *-----") - # self.X['clustering result'] = self.model.labels_ - self.clustering_result = pd.DataFrame(self.model.labels_, columns=["clustering result"]) - print(self.clustering_result) - GEOPI_OUTPUT_ARTIFACTS_DATA_PATH = os.getenv("GEOPI_OUTPUT_ARTIFACTS_DATA_PATH") - save_data(self.clustering_result, f"{self.naming} Result", GEOPI_OUTPUT_ARTIFACTS_DATA_PATH, MLFLOW_ARTIFACT_DATA_PATH) + print(f"-----* {func_name} *-----") + cluster_label = pd.DataFrame(trained_model.labels_, columns=[func_name]) + print(cluster_label) + save_data(cluster_label, f"{func_name} - {algorithm_name}", local_path, mlflow_path) + return cluster_label @staticmethod - def _score(data: pd.DataFrame, labels: pd.DataFrame, func_name: str, algorithm_name: str, store_path: str) -> None: + def _score(data: pd.DataFrame, labels: pd.Series, func_name: str, algorithm_name: str, store_path: str) -> None: """Calculate the score of the model.""" print(f"-----* {func_name} *-----") scores = score(data, labels) scores_str = json.dumps(scores, indent=4) - save_text(scores_str, f"{func_name}- {algorithm_name}", store_path) + save_text(scores_str, f"{func_name} - {algorithm_name}", store_path) mlflow.log_metrics(scores) @staticmethod - def _scatter2d(data: pd.DataFrame, labels: pd.DataFrame, cluster_centers_: np.ndarray, algorithm_name: str, local_path: str, mlflow_path: str) -> None: + def _scatter2d(data: pd.DataFrame, labels: pd.Series, cluster_centers_: pd.DataFrame, algorithm_name: str, local_path: str, mlflow_path: str) -> None: """Plot the two-dimensional diagram of the clustering result.""" print("-----* Cluster Two-Dimensional Diagram *-----") scatter2d(data, labels, cluster_centers_, algorithm_name) @@ -78,7 +89,7 @@ def _scatter2d(data: pd.DataFrame, labels: pd.DataFrame, cluster_centers_: np.nd save_data(data_with_labels, f"Cluster Two-Dimensional Diagram - {algorithm_name}", local_path, mlflow_path) @staticmethod - def _scatter3d(data: pd.DataFrame, labels: pd.DataFrame, algorithm_name: str, local_path: str, mlflow_path: str) -> None: + def _scatter3d(data: pd.DataFrame, labels: pd.Series, algorithm_name: str, local_path: str, mlflow_path: str) -> None: """Plot the three-dimensional diagram of the clustering result.""" print("-----* Cluster Three-Dimensional Diagram *-----") scatter3d(data, labels, algorithm_name) @@ -87,7 +98,7 @@ def _scatter3d(data: pd.DataFrame, labels: pd.DataFrame, algorithm_name: str, lo save_data(data_with_labels, f"Cluster Two-Dimensional Diagram - {algorithm_name}", local_path, mlflow_path) @staticmethod - def _plot_silhouette_diagram(data: pd.DataFrame, labels: pd.DataFrame, model: object, cluster_centers_: np.ndarray, algorithm_name: str, local_path: str, mlflow_path: str) -> None: + def _plot_silhouette_diagram(data: pd.DataFrame, labels: pd.Series, model: object, cluster_centers_: np.ndarray, algorithm_name: str, local_path: str, mlflow_path: str) -> None: """Plot the silhouette diagram of the clustering result.""" print("-----* Silhouette Diagram *-----") plot_silhouette_diagram(data, labels, cluster_centers_, model, algorithm_name) @@ -99,7 +110,7 @@ def _plot_silhouette_diagram(data: pd.DataFrame, labels: pd.DataFrame, model: ob save_data(cluster_center_data, "Silhouette Diagram - Cluster Centers", local_path, mlflow_path) @staticmethod - def _plot_silhouette_value_diagram(data: pd.DataFrame, labels: pd.DataFrame, algorithm_name: str, local_path: str, mlflow_path: str) -> None: + def _plot_silhouette_value_diagram(data: pd.DataFrame, labels: pd.Series, algorithm_name: str, local_path: str, mlflow_path: str) -> None: """Plot the silhouette value diagram of the clustering result.""" print("-----* Silhouette value Diagram *-----") plot_silhouette_value_diagram(data, labels, algorithm_name) @@ -111,9 +122,24 @@ def common_components(self) -> None: """Invoke all common application functions for clustering algorithms.""" GEOPI_OUTPUT_METRICS_PATH = os.getenv("GEOPI_OUTPUT_METRICS_PATH") GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH = os.getenv("GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH") + GEOPI_OUTPUT_ARTIFACTS_DATA_PATH = os.getenv("GEOPI_OUTPUT_ARTIFACTS_DATA_PATH") + self.cluster_centers = self._get_cluster_centers( + func_name=ClusteringCommonFunction.CLUSTER_CENTERS.value, + trained_model=self.model, + algorithm_name=self.naming, + local_path=GEOPI_OUTPUT_ARTIFACTS_DATA_PATH, + mlflow_path=MLFLOW_ARTIFACT_DATA_PATH, + ) + self.cluster_labels = self._get_cluster_labels( + func_name=ClusteringCommonFunction.CLUSTER_LABELS.value, + trained_model=self.model, + algorithm_name=self.naming, + local_path=GEOPI_OUTPUT_ARTIFACTS_DATA_PATH, + mlflow_path=MLFLOW_ARTIFACT_DATA_PATH, + ) self._score( data=self.X, - labels=self.clustering_result["clustering result"], + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], func_name=ClusteringCommonFunction.MODEL_SCORE.value, algorithm_name=self.naming, store_path=GEOPI_OUTPUT_METRICS_PATH, @@ -123,8 +149,8 @@ def common_components(self) -> None: two_dimen_axis_index, two_dimen_data = self.choose_dimension_data(self.X, 2) self._scatter2d( data=two_dimen_data, - labels=self.clustering_result["clustering result"], - cluster_centers_=self.get_cluster_centers(), + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], + cluster_centers_=self.cluster_centers, algorithm_name=self.naming, local_path=GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH, mlflow_path=MLFLOW_ARTIFACT_IMAGE_MODEL_OUTPUT_PATH, @@ -134,7 +160,7 @@ def common_components(self) -> None: three_dimen_axis_index, three_dimen_data = self.choose_dimension_data(self.X, 3) self._scatter3d( data=three_dimen_data, - labels=self.clustering_result["clustering result"], + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], algorithm_name=self.naming, local_path=GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH, mlflow_path=MLFLOW_ARTIFACT_IMAGE_MODEL_OUTPUT_PATH, @@ -144,8 +170,8 @@ def common_components(self) -> None: two_dimen_axis_index, two_dimen_data = self.choose_dimension_data(self.X, 2) self._scatter2d( data=two_dimen_data, - labels=self.clustering_result["clustering result"], - cluster_centers_=self.get_cluster_centers(), + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], + cluster_centers_=self.cluster_centers, algorithm_name=self.naming, local_path=GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH, mlflow_path=MLFLOW_ARTIFACT_IMAGE_MODEL_OUTPUT_PATH, @@ -154,7 +180,7 @@ def common_components(self) -> None: # no need to choose self._scatter3d( data=self.X, - labels=self.clustering_result["clustering result"], + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], algorithm_name=self.naming, local_path=GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH, mlflow_path=MLFLOW_ARTIFACT_IMAGE_MODEL_OUTPUT_PATH, @@ -162,8 +188,8 @@ def common_components(self) -> None: elif self.X.shape[1] == 2: self._scatter2d( data=self.X, - labels=self.clustering_result["clustering result"], - cluster_centers_=self.get_cluster_centers(), + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], + cluster_centers_=self.cluster_centers, algorithm_name=self.naming, local_path=GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH, mlflow_path=MLFLOW_ARTIFACT_IMAGE_MODEL_OUTPUT_PATH, @@ -173,8 +199,8 @@ def common_components(self) -> None: self._plot_silhouette_diagram( data=self.X, - labels=self.clustering_result["clustering result"], - cluster_centers_=self.get_cluster_centers(), + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], + cluster_centers_=self.cluster_centers, model=self.model, algorithm_name=self.naming, local_path=GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH, @@ -182,7 +208,7 @@ def common_components(self) -> None: ) self._plot_silhouette_value_diagram( data=self.X, - labels=self.clustering_result["clustering result"], + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], algorithm_name=self.naming, local_path=GEOPI_OUTPUT_ARTIFACTS_IMAGE_MODEL_OUTPUT_PATH, mlflow_path=MLFLOW_ARTIFACT_IMAGE_MODEL_OUTPUT_PATH, @@ -429,9 +455,9 @@ def special_components(self, **kwargs: Union[Dict, np.ndarray, int]) -> None: """Invoke all special application functions for this algorithm by Scikit-learn framework.""" GEOPI_OUTPUT_METRICS_PATH = os.getenv("GEOPI_OUTPUT_METRICS_PATH") self._get_num_clusters( + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], func_name=MeanShiftSpecialFunction.NUM_CLUSTERS.value, algorithm_name=self.naming, - trained_model=self.model, store_path=GEOPI_OUTPUT_METRICS_PATH, ) @@ -767,24 +793,12 @@ def special_components(self, **kwargs: Union[Dict, np.ndarray, int]) -> None: """Invoke all special application functions for this algorithm by Scikit-learn framework.""" GEOPI_OUTPUT_METRICS_PATH = os.getenv("GEOPI_OUTPUT_METRICS_PATH") self._get_num_clusters( + labels=self.cluster_labels[ClusteringCommonFunction.CLUSTER_LABELS.value], func_name=MeanShiftSpecialFunction.NUM_CLUSTERS.value, algorithm_name=self.naming, - trained_model=self.model, store_path=GEOPI_OUTPUT_METRICS_PATH, ) - @staticmethod - def _get_num_clusters(func_name: str, algorithm_name: str, trained_model: object, store_path: str) -> None: - """Get and log the number of clusters.""" - labels = trained_model.labels_ - num_clusters = len(np.unique(labels)) - print(f"-----* {func_name} *-----") - print(f"{func_name}: {num_clusters}") - num_clusters_dict = {f"{func_name}": num_clusters} - mlflow.log_metrics(num_clusters_dict) - num_clusters_str = json.dumps(num_clusters_dict, indent=4) - save_text(num_clusters_str, f"{func_name} - {algorithm_name}", store_path) - class SpectralClustering(ClusteringWorkflowBase): name = "Spectral" diff --git a/geochemistrypi/data_mining/model/func/algo_clustering/_common.py b/geochemistrypi/data_mining/model/func/algo_clustering/_common.py index b9f7e44..234f376 100644 --- a/geochemistrypi/data_mining/model/func/algo_clustering/_common.py +++ b/geochemistrypi/data_mining/model/func/algo_clustering/_common.py @@ -11,7 +11,7 @@ from sklearn.metrics import calinski_harabasz_score, silhouette_samples, silhouette_score -def score(data: pd.DataFrame, labels: pd.DataFrame) -> Dict: +def score(data: pd.DataFrame, labels: pd.Series) -> Dict: """Calculate the scores of the clustering model. Parameters @@ -19,7 +19,7 @@ def score(data: pd.DataFrame, labels: pd.DataFrame) -> Dict: data : pd.DataFrame (n_samples, n_components) The true values. - labels : pd.DataFrame (n_samples, n_components) + labels : pd.Series (n_samples, ) Labels of each point. Returns @@ -38,7 +38,7 @@ def score(data: pd.DataFrame, labels: pd.DataFrame) -> Dict: return scores -def scatter2d(data: pd.DataFrame, labels: pd.DataFrame, cluster_centers_: np.ndarray, algorithm_name: str) -> None: +def scatter2d(data: pd.DataFrame, labels: pd.Series, cluster_centers_: pd.DataFrame, algorithm_name: str) -> None: """ Draw the result-2D diagram for analysis. @@ -47,10 +47,10 @@ def scatter2d(data: pd.DataFrame, labels: pd.DataFrame, cluster_centers_: np.nda data : pd.DataFrame (n_samples, n_components) The features of the data. - labels : pd.DataFrame (n_samples,) + labels : pd.Series (n_samples,) Labels of each point. - cluster_centers_: np.ndarray (n_samples,) + cluster_centers_: pd.DataFrame (n_samples,) Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_. algorithm_name : str @@ -94,12 +94,12 @@ def scatter2d(data: pd.DataFrame, labels: pd.DataFrame, cluster_centers_: np.nda plt.scatter(cluster_data.iloc[:, 0], cluster_data.iloc[:, 1], c=color, marker=marker) # Plot the cluster centers - if not isinstance(cluster_centers_, str): + if cluster_centers_ is not None: # Draw white circles at cluster centers - plt.scatter(cluster_centers_[:, 0], cluster_centers_[:, 1], c="white", marker="o", alpha=1, s=200, edgecolor="k") + plt.scatter(cluster_centers_.iloc[:, 0], cluster_centers_.iloc[:, 1], c="white", marker="o", alpha=1, s=200, edgecolor="k") # Label the cluster centers - for i, c in enumerate(cluster_centers_): + for i, c in enumerate(cluster_centers_.to_numpy()): plt.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k") plt.xlabel(f"{data.columns[0]}") @@ -107,7 +107,7 @@ def scatter2d(data: pd.DataFrame, labels: pd.DataFrame, cluster_centers_: np.nda plt.title(f"Cluster Data Bi-plot - {algorithm_name}") -def scatter3d(data: pd.DataFrame, labels: pd.DataFrame, algorithm_name: str) -> None: +def scatter3d(data: pd.DataFrame, labels: pd.Series, algorithm_name: str) -> None: """ Draw the result-3D diagram for analysis. @@ -116,7 +116,7 @@ def scatter3d(data: pd.DataFrame, labels: pd.DataFrame, algorithm_name: str) -> data : pd.DataFrame (n_samples, n_components) The features of the data. - labels : pd.DataFrame (n_samples,) + labels : pd.Series (n_samples,) Labels of each point. algorithm_name : str @@ -177,7 +177,7 @@ def scatter3d(data: pd.DataFrame, labels: pd.DataFrame, algorithm_name: str) -> ax2.set_title(f"Cluster Data Tri-plot - {algorithm_name}") -def plot_silhouette_diagram(data: pd.DataFrame, labels: pd.DataFrame, cluster_centers_: np.ndarray, model: object, algorithm_name: str) -> None: +def plot_silhouette_diagram(data: pd.DataFrame, labels: pd.Series, cluster_centers_: pd.DataFrame, model: object, algorithm_name: str) -> None: """ Draw the silhouette diagram for analysis. @@ -186,10 +186,10 @@ def plot_silhouette_diagram(data: pd.DataFrame, labels: pd.DataFrame, cluster_ce data : pd.DataFrame (n_samples, n_components) The true values. - labels : pd.DataFrame (n_samples,) + labels : pd.Series (n_samples,) Labels of each point. - cluster_centers_: np.ndarray (n_samples,) + cluster_centers_: pd.DataFrame (n_samples,) Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_. model : sklearn algorithm model @@ -286,11 +286,11 @@ def plot_silhouette_diagram(data: pd.DataFrame, labels: pd.DataFrame, cluster_ce colors = cm.nipy_spectral(labels.astype(float) / n_clusters) ax2.scatter(data.iloc[:, 0], data.iloc[:, 1], marker=".", s=30, lw=0, alpha=0.7, c=colors, edgecolor="k") - if not isinstance(cluster_centers_, str): + if cluster_centers_ is not None: # Draw white circles at cluster centers ax2.scatter( - cluster_centers_[:, 0], - cluster_centers_[:, 1], + cluster_centers_.iloc[:, 0], + cluster_centers_.iloc[:, 1], marker="o", c="white", alpha=1, @@ -299,7 +299,7 @@ def plot_silhouette_diagram(data: pd.DataFrame, labels: pd.DataFrame, cluster_ce ) # Label the cluster centers - for i, c in enumerate(cluster_centers_): + for i, c in enumerate(cluster_centers_.to_numpy()): ax2.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k") ax2.set_title("The visualization of the clustered data.") @@ -312,7 +312,7 @@ def plot_silhouette_diagram(data: pd.DataFrame, labels: pd.DataFrame, cluster_ce ) -def plot_silhouette_value_diagram(data, labels, algorithm_name: str): +def plot_silhouette_value_diagram(data: pd.DataFrame, labels: pd.Series, algorithm_name: str) -> None: """Calculate the scores of the clustering model. Parameters @@ -320,7 +320,7 @@ def plot_silhouette_value_diagram(data, labels, algorithm_name: str): data : pd.DataFrame (n_samples, n_components) The true values. - labels : pd.DataFrame (n_samples, n_components) + labels : pd.Series (n_samples, ) Labels of each point. algorithm_name : str diff --git a/geochemistrypi/data_mining/model/func/algo_clustering/_meanshift.py b/geochemistrypi/data_mining/model/func/algo_clustering/_meanshift.py index 9ed374c..6813eaf 100644 --- a/geochemistrypi/data_mining/model/func/algo_clustering/_meanshift.py +++ b/geochemistrypi/data_mining/model/func/algo_clustering/_meanshift.py @@ -17,7 +17,7 @@ def meanshift_manual_hyper_parameters() -> Dict: print("Bandwidth: The bandwidth of the kernel used in the algorithm. This parameter can greatly influence the results.") print("If you do not have a specific value in mind, you can leave this as 0, and the algorithm will estimate it automatically.") print("A good starting point could be around 0.5 to 1.5, depending on your data's scale.") - bandwidth_input = num_input(SECTION[2], "Enter Bandwidth (or 0 for automatic estimation): ") + bandwidth_input = num_input(SECTION[2], "Bandwidth: ") bandwidth = None if bandwidth_input == 0 else bandwidth_input print("Cluster All: By default, only points at least as close to a cluster center as the given bandwidth are assigned to that cluster.") @@ -31,16 +31,16 @@ def meanshift_manual_hyper_parameters() -> Dict: print("Min Bin Frequency: To speed up the algorithm, accept only those bins with at least min_bin_freq points as seeds.") print("A typical value is 1, but you might increase this for very large datasets to reduce the number of seeds.") - min_bin_freq = num_input(SECTION[2], "Enter Min Bin Frequency (default is 1): ") + min_bin_freq = num_input(SECTION[2], "Min Bin Frequency: ") print("Number of Jobs: The number of jobs to use for the computation. 1 means using all processors.") print("If you are unsure, use 1 to utilize all available processors.") - n_jobs = num_input(SECTION[2], "Enter Number of Jobs (or None): ") + n_jobs = num_input(SECTION[2], "Number of Jobs: ") n_jobs = -1 if n_jobs == 1 else int(n_jobs) print("Max Iterations: Maximum number of iterations, per seed point before the clustering operation terminates (for that seed point), if has not converged yet.") print("The default value is 300, which is sufficient for most use cases. You might increase this for very complex data.") - max_iter = num_input(SECTION[2], "Enter Max Iterations (default is 300): ") + max_iter = num_input(SECTION[2], "Max Iterations: ") hyper_parameters = { "bandwidth": bandwidth, diff --git a/geochemistrypi/data_mining/process/cluster.py b/geochemistrypi/data_mining/process/cluster.py index e650c46..1f1663a 100644 --- a/geochemistrypi/data_mining/process/cluster.py +++ b/geochemistrypi/data_mining/process/cluster.py @@ -79,9 +79,6 @@ def activate( # Use Scikit-learn style API to process input data self.clt_workflow.fit(X) - # TODO: Move this into common_components() - self.clt_workflow.get_cluster_centers() - self.clt_workflow.get_labels() # Save the model hyper-parameters self.clt_workflow.save_hyper_parameters(hyper_parameters, self.model_name, os.getenv("GEOPI_OUTPUT_PARAMETERS_PATH"))