CIL6 Neural Network


Composition

  • layer

    • neurons/units
      • input
      • weight
      • activation function
      • output $x^{(l)}=\sigma^{(l)}(W^{(l)}x^{(l-1)})$​
  • network depth: the feature hierarchy

  • layer width: the number of features

Activation function

Sigmoid

change the linearity to non-linearity

ReLU

change the linearity in some way

simple derivative

Training

Loss function

Define the training objective, can be chosen by the output type. $y^*$ is the target output, $y$ is the predicted output

  • $y^*$​​ is categorical: cross-entropy loss:
    $$
    l(y^*,y)=-y^* \log y - (1-y^*)\log(1-y)
    $$
  • $y^*$ is numerical: squared loss:
    $$
    l(y^*,y)=\frac{1}{2}(y^*-y)^2
    $$

Regularization

to penalize the parameters

  • L2 regularization: $L_{\lambda}(X;\theta)=L(X;\theta)+\frac{\lambda}{2}|\theta|_2^2$

Backpropagation

SGD

different from past SGD, here we also include a step size $\eta$, because the steepest / original descent is too expensive for large data sets.

In the past, it should be $\theta\leftarrow (1-\lambda)\theta - \nabla_\theta l(y_t^*, y(x_t, \theta))$

But with $\eta$​​, it becomes $\theta\leftarrow (1-\eta\lambda)\theta - \eta\nabla_\theta l(y_t^*, y(x_t, \theta))$​​

Chain rule

$$
\frac{\partial x^{(l)}}{\partial x^{(l-n)}} = J^{(l)}\cdot J^{(l-1)}\cdots J^{(l-n+1)}\
\nabla_{x^{(l)}}^T l = \nabla_y^T l\cdot J^{(L)}\cdots J^{(l+1)}
$$

Weights influence

$$
\frac{\partial x_i^{(l)}}{\partial w_{ij}^{(l)}} = \sigma’([w_i^l]^Tx^{(l-1)})x_j^{(l-1)}
$$

is composed of two parts:

  • sensitivity
  • activation

Comparison to Logistics Regression

Logistic Regression:

  • linear

MLP: multi-layer perceptron:

  • learn intermediate feature representation

  • include non-linearity

Convolutional Neural Network

Receptive field

The creative point of convolutional neural network is how it chooses and organizes the input $x^{(l)}$

This variant in some way complicates the neural network

Weight sharing

This simplifies the neural network. Neurons share the same weights.

Weights define a filter mask. A filter mask corresponds to a vector of $y$, in CNN, which is called channel.

Building blocks

  • convolutional layer
  • pooling layer
  • fully-connected layer

Convolutional layer

from Medium The objective of the Convolution Operation is to extract the high-level features such as edges, from the input image. ConvNets need not be limited to only one Convolutional Layer. Conventionally,

  • the first ConvLayer is responsible for capturing the Low-Level features such as edges, color, gradient orientation, etc.
  • With added layers, the architecture adapts to the High-Level features as well, giving us a network which has the wholesome understanding of images in the dataset, similar to how we would.

Formula

$$
F_{n,m}(x; w)=\sigma(b+\sum_{k=-2}^2\sum_{l=-2}^2w_{k,l}\cdot x_{n+k, m+l})
$$

Pooling layer

the Pooling layer is responsible

  • for reducing the spatial size of the Convolved Feature
  • to decrease the computational power required to process the data through dimensionality reduction.
  • useful for extracting dominant features which are rotational and positional invariant, thus maintaining the process of effectively training of the model.

Methods

  • max pooling
  • average pooling

Fully-connected layer

Fully-Connected layer is a (usually) cheap way

  • of learning non-linear combinations of the high-level features as represented by the output of the convolutional layer.
  • it combines all the output from the previous layer. Different from the convolutional layer, which only uses part of the output from the previous layer.

Variants

Deeper Network

the number of layers grows from 10+ to 100+.

not only use the output from the previous layer but also the input of the previous layer

Semantic Segmentation

add de-convolutional layers.


Author: Fululu
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Fululu !
  TOC