Pre processing

Pre-processing your input data

As with all families of machine learning algorithms, performance is a function of the quality of the input data. Clust4j includes the following Transformer classes:

BoxCoxTransformer
MeanCenterer
MedianCenterer
MinMaxScaler
PCA
RobustScaler
StandardScaler
WeightTransformer
YeoJohnsonTransformer

As with many other clust4j classes, their interface is written to be familiar to sklearn users. All Transformer classes can be used in the following pseudo-code method:

RealMatrix X1 = some_data;
RealMatrix X2 = some_other_data;

// initialize and fit
Transformer t = new StandardScaler().fit(X1);

// transform train and test
RealMatrix train = t.transform(X1);
RealMatrix test  = t.transform(X2);

// you can also inverse transform
RealMatrix inverse_train = t.inverseTransform(train); // should equal X1

Pre-processing in practice (in conjunction with the `Pipeline` object)

Clust4j includes a toy DataSet of intertwining crescents for bench-marking various algorithms (see ExampleDataSets.loadToyMoons()). X1 vs X2:

X1 vs X3 (notice that in this dimension, we can achieve linear separability!):

Head:

X1	X2	X3	labels
1.582023	-0.445815	0.461456	1
0.066045	0.439207	0.480332	1
0.736631	-0.398963	0.501694	1
-1.056928	0.242456	0.025548	0

Here's the setup for an example you can run:

// load the dataset
DataSet moons = ExampleDataSets.loadToyMoons();
final int[] actual_labels = moons.getLabels();

Most algorithms cannot segment the two classes without any pre-processing:

RealMatrix data = moons.getData();
KMeansParameters params = new KMeansParameters(2);

KMeans model = params.fitNewModel(data);
int[] predicted_labels = model.getLabels(); // maybe 50% accurate (depending on random state)?

However, using a WeightTransformer, we can emphasize the importance of the X3 feature over the others:

// With just a bit of preprocessing...
UnsupervisedPipeline<KMeans> pipe = new UnsupervisedPipeline<KMeans>(
    params,
    new WeightTransformer(new double[]{0.5, 0.0, 2.0})
);
		
predicted_labels = pipe.fit(data).getLabels();

Final thoughts

Though this is a trivial example and rarely will there ever be a perfectly linearly-separable hyperplane in your data, it emphasizes the importance of exploring your data before modeling, and applying transformations or pre-preprocessing techniques where appropriate to achieve maximal efficacy in your clustering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre processing

Pre-processing your input data

Pre-processing in practice (in conjunction with the `Pipeline` object)

Final thoughts

Clone this wiki locally

Pre processing

Pre-processing your input data

Pre-processing in practice (in conjunction with the Pipeline object)

Final thoughts

Clone this wiki locally

Pre-processing in practice (in conjunction with the `Pipeline` object)