- To implement the activation functions
relu, identity, tanh and sigmoid
we simply usenumpy
to implement the respective mathematical formulae. - The
cross_entropy
function which calculates the loss is given by- sum (yi * log(pi))
, for allyi ∈ Y
(true values) andpi ∈ P
(predicted values). In a 2D space this result can be calculated by-np.sum(np.sum(y * np.log(p), axis=1)) / y.shape[0]
. The true values are encoded usingone_hot_encoding
.
- Given our training input
X
, we feed this to the hidden layer and calculate its weighted sumZ1 = (X.wh) + bh
. WhereX.wh
is the dot product of the input and the weights vectorwh
for the hidden layer, andbh
is the bias vector. The outputZ1
is fed to the activation function to give usG1
. - Subsequently,
G1
is fed to the output layer to calculateZ2 = (G1.wo) + bo
. WhereX.wo
is the dot product of the input and the weights vectorwo
for the output layer, andbo
is the bias vector.Z2
then goes through the output activation function (which in this case issoftmax
), to give usG2
.G2
is the output of our MLP.
- Naturally, the MLP has to learn to predict accurate classes, it does so by minimizing its errors in the backpropagation phase. This is done immediately after the forward propagation phase.
- In very simple terms, we propagate the error from the last layer to the input layer by 'reversing' the order of operations.
- To start with, the error is calculated by a simple matrix subtraction
EO = G2 - y
. Then, we calculate the gradient of the weighted sumZ2
, by passing it through the derivative of thesoftmax
function,SO = softmax_derivate(Z2)
. We then calculate the difference (delta)DO = SO * EO
(scalar matrix multiplication). DO
is back-propagated to calculate the error in the hidden layerEH = DO.woT
, wherewoT
is the transpose of the output layer weights vectorwo
. The gradient of the hidden layer with respect toSH = hidden_activation_derivative(Z1)
. We then calculate the delta for hidden layerDH = SH * EH
.- After doing these calculations, we have the necessary ingredients to update the weights and biases, which are the essential steps that allow the MLP to make better predictions.
- This is given by
wh = XT.DH * LR
,wo = G1T.DO * LR
,bh = sum(DH) * LR
andbo = sum(DO) * LR
. WhereXT
is the transpose of the input,G1T
is the transpose ofG1
andLR
is the learning rate. The learning rate essentially controls what percentage of weights and biases are updated.
- This method is responsible for running the forward and back propagation phases for a number of iterations. After every
20 iterations, the loss computed by
cross_entropy
should decrease, signifying the MLP improving its predictions.
- Once our MLP has finished training via the
fit
method, the new testing data is passed propagated (forward) to give us the predictions (which are probabilities of the class labels). To calculate the class labels, we simply get theargmax
for these probabilities.np.array([np.argmax(x) for x in self.mlp_output])
.