Skip to content
Taylor G Smith edited this page May 2, 2016 · 4 revisions

Pre-processing your input data

As with all families of machine learning algorithms, performance is a function of the quality of the input data. Clust4j includes the following Transformer classes:

  • BoxCoxTransformer
  • MeanCenterer
  • MedianCenterer
  • MinMaxScaler
  • PCA
  • RobustScaler
  • StandardScaler
  • WeightTransformer
  • YeoJohnsonTransformer

As with many other clust4j classes, their interface is written to be familiar to sklearn users. All Transformer classes can be used in the following pseudo-code method:

RealMatrix X1 = some_data;
RealMatrix X2 = some_other_data;

// initialize and fit
Transformer t = new StandardScaler().fit(X1);

// transform train and test
RealMatrix train = t.transform(X1);
RealMatrix test  = t.transform(X2);

// you can also inverse transform
RealMatrix inverse_train = t.inverseTransform(train); // should equal X1

Pre-processing in practice (in conjunction with the Pipeline object)

Clust4j includes a toy DataSet of intertwining crescents for bench-marking various algorithms (see ExampleDataSets.loadToyMoons()). X1 vs X2:

X1 vs X3 (notice that in this dimension, we can achieve linear separability!):

Head:

X1 X2 X3 labels
1.582023 -0.445815 0.461456 1
0.066045 0.439207 0.480332 1
0.736631 -0.398963 0.501694 1
-1.056928 0.242456 0.025548 0

Here's the setup for an example you can run:

// load the dataset
DataSet moons = ExampleDataSets.loadToyMoons();
final int[] actual_labels = moons.getLabels();

Most algorithms cannot segment the two classes without any pre-processing:

RealMatrix data = moons.getData();
KMeansParameters params = new KMeansParameters(2);

KMeans model = params.fitNewModel(data);
int[] predicted_labels = model.getLabels(); // maybe 50% accurate (depending on random state)?

However, using a WeightTransformer, we can emphasize the importance of the X3 feature over the others:

// With just a bit of preprocessing...
UnsupervisedPipeline<KMeans> pipe = new UnsupervisedPipeline<KMeans>(
    params,
    new WeightTransformer(new double[]{0.5, 0.0, 2.0})
);
		
predicted_labels = pipe.fit(data).getLabels();

Final thoughts

Though this is a trivial example and rarely will there ever be a perfectly linearly-separable hyperplane in your data, it emphasizes the importance of exploring your data before modeling, and applying transformations or pre-preprocessing techniques where appropriate to achieve maximal efficacy in your clustering.