Maths in a Neural Network: Vectorization

Introduction

The element-wise operations have been discussed in the last post Maths in a Neural Network: Element-wise. This post will focus on how to represent all the equations found in the previous post in vectors.

1 Feed-forward

Let’s consider a 2-3-2 network.

1.1 Element-wise operations

a^{(l)}_p = f^{(l)}(z^{(l)}_p)\quad (1)

z^{(l)}_p = \sum_q w^{(l)}_{(qp)} a^{(l-1)}_q \quad (2)

1.2 Vectorization

Applying equation (1):

A^{(2)}= f^{(2)}(Z^{(2)})

Applying equation (2):

\begin{array}{lcl} \begin{bmatrix}z^{(2)}_0 \\z^{(2)}_1 \end{bmatrix} & = & \begin{bmatrix} w^{(2)}_{00} a^{(1)}_0 + w^{(2)}_{10} a^{(1)}_1 + w^{(2)}_{20} a^{(1)}_2 \\ w^{(2)}_{01} a^{(1)}_0 + w^{(2)}_{11} a^{(1)}_1 + w^{(2)}_{21} a^{(1)}_2 \\ \end{bmatrix} \\ \begin{bmatrix}z^{(2)}_0 \\z^{(2)}_1  \end{bmatrix} & = & \begin{bmatrix} w^{(2)}_{00} & w^{(2)}_{01} \\ w^{(2)}_{10} & w^{(2)}_{11} \\ w^{(2)}_{20} & w^{(2)}_{21} \\ \end{bmatrix} ^T \begin{bmatrix} a^{(1)}_0 \\ a^{(1)}_1 \\ a^{(1)}_2 \end{bmatrix} \\ Z^{(2)} & = & W^{(2) T} A^{(1)} \end{array}

These can be generalized for more than 3 layers:

A^{(l)}= f^{(l)}(Z^{(l)})\quad (3)

Z^{(l)} = W^{(l) T} A^{(l-1)} \quad (4)

2 Weight Update

Let’s consider a 2-3-2 network.

2.1 Element-wise operations

For output layer:

w^{(2)'}_{jk} = w^{(2)}_{jk} - \alpha \frac{\partial e_k}{\partial w^{(2)}_{jk}} \quad (5)

\frac{\partial e_k}{\partial w^{(2)}_{jk}} = \frac{\partial e_k}{\partial a^{(2)}_k} \frac{\partial a^{(2)}_k}{\partial z^{(2)}_k} \frac{\partial z^{(2)}_k}{\partial w^{(2)}_{jk}} = \frac{\partial e_k}{\partial a^{(2)}_k} f^{(2)'}(z^{(2)}_k) a^{(1)}_j \quad (6)

For hidden layers:

w^{(1)'}_{qp} = w^{(0)}_{ij} - \alpha \sum_k \frac{\partial e_k}{\partial w^{(1)}_{ij}} \quad (7)

\sum_k\frac{\partial e_k}{\partial w^{(1)}_{ij}} = \sum_k \frac{\partial e_k}{\partial a^{(2)}_k} \frac{\partial a^{(2)}_k}{\partial z^{(2)}_k} \frac{\partial z^{(2)}_k}{\partial a^{(1)}_j} \frac{\partial a^{(1)}_j}{\partial z^{(1)}_j} \frac{\partial z^{(1)}_j}{\partial w^{(1)}_{ij}} = \sum_k \frac{\partial e_k}{\partial a^{(2)}_k} f^{(2)'}(z^{(2)}_k) w^{(1)}_{jk} f^{(1)'}(z^{(1)}_j) a^{(0)}_i \quad (8)

2.1 Vectorization

For output layer:

Lets vectorize the weight update equation, applying equation (5):
\begin{bmatrix} w^{(2)'}_{00} & w^{(2)'}_{01} \\ w^{(2)'}_{10} & w^{(2)'}_{11} \\ w^{(2)'}_{20} & w^{(2)'}_{21} \\ \end{bmatrix} = \begin{bmatrix} w^{(2)}_{00} & w^{(2)}_{01} \\ w^{(2)}_{10} & w^{(2)}_{11} \\ w^{(2)}_{20} & w^{(2)}_{21} \\ \end{bmatrix} - \alpha \begin{bmatrix} \frac{\partial e_0}{\partial w^{(2)}_{00}} & \frac{\partial e_1}{\partial w^{(2)}_{01}} \\ \frac{\partial e_0}{\partial w^{(2)}_{10}} & \frac{\partial e_1}{\partial w^{(2)}_{11}} \\ \frac{\partial e_0}{\partial w^{(2)}_{20}} & \frac{\partial e_1}{\partial w^{(2)}_{21}} \\ \end{bmatrix}\\

Let’s represent the weight as a mitrix and apply equation (6):

W^{(2)'} = W^{(2)} - \alpha \begin{bmatrix} \frac{\partial e_0}{\partial a^{(2)}_0} f^{'(2)}(z^{(2)}_0) a^{(1)}_0 && \frac{\partial e_1}{\partial a^{(2)}_1} f^{'(2)}(z^{(2)}_1) a^{(1)}_0 \\ \frac{\partial e_0}{\partial a^{(2)}_0} f^{'(2)}(z^{(2)}_0) a^{(1)}_1 && \frac{\partial e_1}{\partial a^{(2)}_1} f^{'(2)}(z^{(2)}_1) a^{(1)}_1 \\ \frac{\partial e_0}{\partial a^{(2)}_0} f^{'(2)}(z^{(2)}_0) a^{(1)}_2 && \frac{\partial e_1}{\partial a^{(2)}_1} f^{'(2)}(z^{(2)}_1) a^{(1)}_2 \end{bmatrix}

W^{(2)'} = W^{(2)} - \alpha A^{(1)} (E^{'}(A^{(2)}) \diamond f^{'(2)}(Z^{(2)}))^T

Where:

E^{'}(A^{(2)}) = \begin{bmatrix} \frac{\partial e_0}{\partial a^{(2)}_0} \\ \frac{\partial e_1}{\partial a^{(2)}_1} \end{bmatrix}

Define backprogated value delta:

\delta^{(2)} = E^{'}(A^{(L)}) \diamond f^{'(L)}(Z^{(L)}) \quad (9)

W^{(2)'} = W^{(2)} - \alpha A^{(1)} \delta^{(2)T} \quad (10)

For hidden layers:

Applying equation (7):

\begin{bmatrix} w^{(1)'}_{00} & w^{(1)'}_{01} && w^{(1)'}_{02} \\ w^{(1)'}_{10} & w^{(1)'}_{11} && w^{(1)'}_{12}\\ \end{bmatrix} = \begin{bmatrix} w^{(1)}_{00} & w^{(1)}_{01} && w^{(1)}_{02} \\ w^{(1)}_{10} & w^{(1)}_{11} && w^{(1)}_{12}\\ \end{bmatrix} - \alpha \begin{bmatrix} \sum_k \frac{\partial e_k}{\partial w^{(1)}_{00}} && \sum_k \frac{\partial e_k}{\partial w^{(1)}_{01}} && \sum_k \frac{\partial e_k}{\partial w^{(1)}_{02}} \\ \sum_k \frac{\partial e_k}{\partial w^{(0)}_{10}} && \sum_k \frac{\partial e_k}{\partial w^{(1)}_{11}} && \sum_k \frac{\partial e_k}{\partial w^{(1)}_{12}} \end{bmatrix}

Let’s represent the weight as a mitrix and apply equation (8):

W^{(1)'} = W^{(1)} - \alpha \begin{bmatrix} \sum_k \frac{\partial e_k}{\partial a^{(2)}_k} f^{(2)'}(z^{(2)}_k) w^{(2)}_{0k} f^{(1)'}(z^{(1)}_0) a^{(0)}_0 && .... && \sum_k \frac{\partial e_k}{\partial a^{(2)}_k} f^{(2)'}(z^{(2)}_k) w^{(2)}_{2k} f^{(1)'}(z^{(1)}_2) a^{(0)}_0 \\ \sum_k \frac{\partial e_k}{\partial a^{(2)}_k} f^{(2)'}(z^{(2)}_k) w^{(2)}_{0k} f^{(1)'}(z^{(1)}_0) a^{(0)}_1 && ... && \sum_k \frac{\partial e_k}{\partial a^{(2)}_k} f^{(2)'}(z^{(2)}_k) w^{(2)}_{2k} f^{(1)'}(z^{(1)}_2) a^{(0)}_1 \end{bmatrix}

W^{(1)'} = W^{(1)} - \alpha A^{(0)} (f^{(1)'}(Z^{(1)}) \diamond W^{(2)} E^{'}(A^{(2)}) \diamond f^{(2)'}(Z^{(2)})^T

W^{(1)'} = W^{(1)} - \alpha A^{(0)} (f^{(1)'}(Z^{(1)}) \diamond W^{(2)} \delta^{(2)})^T

Define backprogated value delta:

\delta^{(1)} = f^{(1)'}(Z^{(1)}) \diamond W^{(2)} \delta^{(2)} \quad (11)

W^{(1)'} = W^{(1)} - \alpha A^{(0)} \delta^{(1)T} \quad (12)

These can be generalized for more than 3 layers, total L – 1 layers:

Output layer:

\delta^{(L)} = E^{'}(A^{(L)}) \diamond f^{'(L)}(Z^{(L)}) \quad (13) \\ W^{(L)'} = W^{(L)} - \alpha A^{(L-1)} (\delta^{(L)})^T\quad (14)

Hidden layer:

\delta^{(l)} = f^{(l)'}(Z^{(l)}) \diamond W^{(l+1)} \delta^{(l+1)}\quad (15) \\ W^{(l)'} = W^{(l)} - \alpha A^{(l-1)} \delta^{(l)T} \quad (16)

3 Summary

3.1 Feed-forward:

A^{(l)}= f^{(l)}(Z^{(l)})

Z^{(l)} = W^{(l) T} A^{(l-1)}

3.2 Weight update:

Output layer:

\delta^{(L)} = E^{'}(A^{(L)}) \diamond f^{'(L)}(Z^{(L)}) \\ W^{(L)'} = W^{(L)} - \alpha A^{(L-1)} (\delta^{(L)})^T

Hidden layer:

\delta^{(l)} = f^{(l)'}(Z^{(l)}) \diamond W^{(l+1)} \delta^{(l+1)} \\ W^{(l)'} = W^{(l)} - \alpha A^{(l-1)} \delta^{(l)T}

Next

  1. Maths in a Neural Network: Element-wise
  2. Maths in a Neural Network: Vectorization
  3. Code a Neural Network with Numpy
  4. Maths in a Neural Network: Batch Training

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s