# Introduction

In this post, I try to discuss the element-wise operations of the entire Neural Network algorithm first before doing any vectorization. Because I believe it is easier to focus on one thing at a time.

# 1. Architecture & Notations:

## 1.1 Architecture

A Neural Network is an interconnected layer of nodes/neurons. In the post, a simple 3-layer Neural Network is discussed as shown in Fig. 1. It takes 2 feature values and outputs 2 prediction values and its hidden layer consists of 3 neurons.

## 1.2 Activation function & Neuron

Each neuron represents a function takes 1 value and outputs 1 value. The function is called the activation function. The following is how the activation functions are used from layer 0 to layer 2: $\begin{array}{lcl} a^{(0)}_i & = & f^{(0)}(z^{(0)}_i) \\ a^{(1)}_j & = & f^{(1)}(z^{(1)}_j) \\ a^{(2)}_k & = & f^{(2)}(z^{(2)}_k) \end{array}$

where

• i, j and k are the neuron index in layer 0, 1 and 2 respectively.

The neurons in the input layer basically do nothing to the input values as $f^{(0)}(x) = x$. The use of the activation function can be generalized for more than 3 layers: $a^{(l)}_p = f^{(l)}(z^{(l)}_p) \quad (1)$

where

• $p$ is the p-th neuron in the l-th layer.

## 1.3 Interconnection & Weight

Each connection between 2 neurons represents a weight. The input of the neuron is the multiple of the neurons in the previous layer and the weights between them. $\begin{array}{lcl} z^{(1)}_j & = & \sum_i w^{(1)}_{(ij)} a^{(0)}_i \\ z^{(2)}_k & = & \sum_j w^{(2)}_{(jk)} a^{(1)}_j \end{array}$

where

• i, j and k are the neuron index in layer 0, 1 and 2 respectively.

This operation is called Feed-forward as showin in Fig. 2. It can be generalized for more than 3 layers: $z^{(l)}_p = \sum_q w^{(l)}_{(qp)} a^{(l-1)}_q \quad (2)$

where

• $p$ is the p-th neuron in the l-th layer and the $q$ is the q-th neuron in the (l-1)-th layer.

# 2 Feed-forward

The Feed-forward is an operation, which generates prediction by feeding values forward along the Neural Network as described in 1.3. The animated overall Feed-forward operation is shown in Fig.3.

By applying equation (1) & (2), the following equations can be obtained to describe the animation in Fig.3. $\begin{array}{lcl} a^{(2)}_k & = & f^{(2)} \sum_j w^{(2)}_{jk} a^{(1)}_j \\ a^{(2)}_k & = & f^{(2)} \sum_j w^{(2)}_{jk} f^{(1)} \sum_i w^{(1)}_{ij} a^{(0)}_i \quad (3) \end{array}$

# 3 Error / Cost

In the supervised learning setting, the ground truth is given. The objective of the supervised learning is to minimize the error/cost of the output.

The gradient/slope/derivative of a function tells whether the curve is going up, going down or stationary at a given point as shown in Fig.4.

## 4.2 Minimizing a function

As mentioned in section 3, the objective of training a Neural Network is to find the minimum point of the error/cost function. Gradient Descent is one relatively efficient way to do that.

The Gradient Descent updates the value of $x$ to decrease the gradient iteratively until finding a point with $f'(x) = 0$.

The update direction of $x$ is always opposite to the sign of $f'$, therefore a Gradient Descent step can be described as: $x_{new} = x_{old} - \alpha f'(x_{old}) \quad (4)$

where

• $\alpha$ is the learning rate (the update step size)

## 4.3 Minimizing the Neural Network’s Error function

Similar to equation (4) in section 4.2, the weights of the network can be updated by the Gradient Descent to find the set of weights that minimize the error. Since the dimension of the equations of the Neural Network is more than 2, the Gradient Descent is done by calculating the partial derivatives as shown below.

Let’s define the error of the whole model be: $E = \sum e_k$

• Weights between layer 1 and 2  as shown as Fig.6: $w^{(2)'}_{jk} = w^{(2)}_{jk} - \alpha \frac{\partial E}{\partial w^{(2)}_{jk}}$ $w^{(2)'}_{jk} = w^{(2)}_{jk} - \alpha \frac{\partial e_k}{\partial w^{(2)}_{jk}}$

• Weights between layer 0 and 1 as shown as Fig.7: $w^{(1)'}_{ij} = w^{(1)}_{ij} - \alpha \frac{\partial E}{\partial w^{(1)}_{ij}}$ $w^{(1)'}_{ij} = w^{(1)}_{ij} - \alpha \sum_k\frac{\partial e_k}{\partial w^{(1)}_{ij}}$

Therefore, the weight update can be generalized for more than 3 layers:

For output layer (L-th layer): $w^{(L)'}_{qp} = w^{(L)}_{qp} - \alpha \frac{\partial e_k}{\partial w^{(L)}_{qp}} \quad (5)$

For hidden layers: $w^{(l)'}_{qp} = w^{(l)}_{qp} - \alpha \sum_k \frac{\partial e_k}{\partial w^{(l)}_{qp}} \quad (6)$

where

• $p$ is the p-th neuron in the l-th layer
• the $q$ is the q-th neuron in the (l-1)-th layer.

# 5 Back Propagation

## 5.1 Calculate the weight update with chain rule

Recall equcation (5) & (6), the gradient update step can be calculated by chain rule:

• Weights between layer 1 and 2: $\frac{\partial e_k}{\partial w^{(2)}_{jk}} = \frac{\partial e_k}{\partial a^{(2)}_k} \frac{\partial a^{(2)}_k}{\partial z^{(2)}_k} \frac{\partial z^{(2)}_k}{\partial w^{(2)}_{jk}} \quad (7)$

• Weights between layer 0 and 1: $\sum_k\frac{\partial e_k}{\partial w^{(1)}_{ij}} = \sum_k (\frac{\partial e_k}{\partial a^{(2)}_k} \frac{\partial a^{(2)}_k}{\partial z^{(2)}_k}) \frac{\partial z^{(2)}_k}{\partial a^{(1)}_j} \frac{\partial a^{(1)}_j}{\partial z^{(1)}_j} \frac{\partial z^{(1)}_j}{\partial w^{(1)}_{ij}} \quad (8)$

• For induction purpose, let assume there is one more layer l = -1, Weights between layer -1 and 0: $\sum_k\frac{\partial e_k}{\partial w^{(0)}_{mi}} = \sum_k \sum_j ((\frac{\partial e_k}{\partial a^{(2)}_k} \frac{\partial a^{(2)}_k}{\partial z^{(2)}_k}) \frac{\partial z^{(2)}_k}{\partial a^{(1)}_j} \frac{\partial a^{(1)}_j}{\partial z^{(1)}_j}) \frac{\partial z^{(1)}_j}{\partial a^{(0)}_{i}} \frac{\partial a^{(0)}_i}{\partial z^{(0)}_{i}} \frac{\partial z^{(0)}_i}{\partial w^{(0)}_{mi}} \quad (9)$

## 5.2 Backpropagate

There are  common factors from the equation (7) to (8) and (8) to (9), these common factors are backpropagated from layer to layer to save computation time (Dynamic Programming).

# Next

1. Maths in a Neural Network: Element-wise
2. Maths in a Neural Network: Vectorization
3. Code a Neural Network with Numpy
4. Maths in a Neural Network: Batch Training