# Introduction

Isn’t it too many notations in a Neural Network? There are many tutorials talking the element-wise and vector operations at the same time. When I was reading them, I just can’t remember all the notations for those scalars and vectors in the Neural Network.

In this post, I try to discuss the element-wise operations of the entire Neural Network algorithm first before doing any vectorization. Because I believe it is easier to focus on one thing at a time.

# 1. Architecture & Notations:

## 1.1 Architecture

A Neural Network is an interconnected layer of nodes/neurons. In the post, a simple 3-layer Neural Network is discussed as shown in Fig. 1. It takes 2 feature values and outputs 2 prediction values and its hidden layer consists of 3 neurons.

## 1.2 Activation function & Neuron

Each neuron represents a function takes 1 value and outputs 1 value. The function is called the activation function. The following is how the activation functions are used from layer 0 to layer 2:

where

- i, j and k are the neuron index in layer 0, 1 and 2 respectively.

The neurons in the input layer basically do nothing to the input values as . The use of the activation function can be generalized for more than 3 layers:

where

- is the p-th neuron in the l-th layer.

## 1.3 Interconnection & Weight

Each connection between 2 neurons represents a weight. The input of the neuron is the multiple of the neurons in the previous layer and the weights between them.

where

- i, j and k are the neuron index in layer 0, 1 and 2 respectively.

This operation is called Feed-forward as showin in Fig. 2. It can be generalized for more than 3 layers:

where

- is the p-th neuron in the l-th layer and the is the q-th neuron in the (l-1)-th layer.

# 2 Feed-forward

The Feed-forward is an operation, which generates prediction by feeding values forward along the Neural Network as described in 1.3. The animated overall Feed-forward operation is shown in Fig.3.

By applying equation (1) & (2), the following equations can be obtained to describe the animation in Fig.3.

# 3 Error / Cost

In the supervised learning setting, the ground truth is given. The objective of the supervised learning is to minimize the error/cost of the output.

# 4 Gradient Descent

## 4.1 Gradient

The gradient/slope/derivative of a function tells whether the curve is going up, going down or stationary at a given point as shown in Fig.4.

## 4.2 Minimizing a function

As mentioned in section 3, the objective of training a Neural Network is to find the minimum point of the error/cost function. Gradient Descent is one relatively efficient way to do that.

The Gradient Descent updates the value of to decrease the gradient iteratively until finding a point with .

The update direction of is always opposite to the sign of , therefore a Gradient Descent step can be described as:

where

- is the learning rate (the update step size)

## 4.3 Minimizing the Neural Network’s Error function

Similar to equation (4) in section 4.2, the weights of the network can be updated by the Gradient Descent to find the set of weights that minimize the error. Since the dimension of the equations of the Neural Network is more than 2, the Gradient Descent is done by calculating the partial derivatives as shown below.

- Weights between layer 1 and 2 as shown as Fig.6:

- Weights between layer 0 and 1 as shown as Fig.7:

Therefore, the weight update can be generalized for more than 3 layers:

For output layer (L-th layer):

For hidden layers:

where

- is the p-th neuron in the l-th layer
- the is the q-th neuron in the (l-1)-th layer.

# 5 Back Propagation

## 5.1 Calculate the weight update with chain rule

Recall equcation (5) & (6), the gradient update step can be calculated by chain rule:

- Weights between layer 1 and 2:

- Weights between layer 0 and 1:

## 5.2 Backpropagate

There is a common factor between the equation (7) & (8): . To avoid duplicate computation, this value is reused/backpropagated to earlier layers. It is called the Back Propagation.

Therefore, the backpropagated value into l-th layer can be generalized into , and the weight update can be generalized for more than 3 layers:

# 6 Summary

## 6.1 Feed-forward

where:

- is the p-th neuron in the l-th layer
- is the q-th neuron in the (l-1)-th layer
- is the output of a neuron
- is the input of a neuron
- is the activaition function of a neuron

## 6.2 Weight update

For output layer (L-th layer):

For hidden layers:

where:

- is the p-th neuron in the l-th layer
- is the q-th neuron in the (l-1)-th layer
- is the output of a neuron
- is the input of a neuron
- is the activaition function of a neuron

# Next

- Maths in a Neural Network: Element-wise
- Maths in a Neural Network: Vectorization
- Maths in a Neural Network: Batch Training
- Code a Neural Network with Numpy