Machine Learning

1.5 Neural Networks Part 1: Setting up the Architecture

Imron Rosyadi

Understanding Neural Networks: From Neurons to Layers

Sources

Stanford University CS231n: Deep Learning for Computer Vision

CS231n Deep Learning for Computer Vision

Table of Contents

  1. Quick Intro: Linear vs NN
  2. Modeling One Neuron
  3. Neural Network Architectures
  4. Summary
  5. Additional References

1. Quick Intro: Beyond Linear Models

Linear Classification:

  • Scores obtained directly from weighted input sum.
  • Example: \( s = W x \)
    • \(x\): input vector (e.g., image pixels).
    • \(W\): weight matrix.
    • \(s\): class scores.

Note

Issue: Limited to linear decision boundaries.

Neural Network Approach (2-layer):

  • Introduces a non-linear transformation.
  • Example: \( s = W_2 (0, W_1 x) \)
    • \(W_1\): First layer weights.
    • \((0, )\): Element-wise non-linearity (ReLU).
    • \(W_2\): Second layer weights.

Important

The non-linearity is crucial for modeling complex relationships and introducing the “wiggle”.

Quick Intro: Visualizing the Flow

graph LR
    Input["Input Image (x)"] --> W1_Layer("Layer 1 - W1")
    W1_Layer --> Nonlinearity["Non-Linear Activation (max(0, .))"]
    Nonlinearity --> W2_Layer("Layer 2 - W2")
    W2_Layer --> Scores["Output Scores (s)"]

    style Input fill:#f9f,stroke:#333,stroke-width:2px;
    style Scores fill:#bbf,stroke:#333,stroke-width:2px;

  • Deeper Networks:
    • A three-layer network: \( s = W_3 (0, W_2 (0, W_1 x)) \)
    • Adds more non-linear transformations.
    • Parameters \(W_i\) are learned via stochastic gradient descent and backpropagation.

2. Modeling One Neuron: Biological Motivation

Biological Neuron (Left):

  • Dendrites: Receive input signals.
  • Cell Body: Integrates signals.
  • Axon: Transmits output signals.
  • Synapses: Connections to other neurons, with variable strengths.
A cartoon drawing of a biological neuron.

Computational Model (Right):

  • Inputs (\(x_i\)) correspond to signals from other neurons.
  • Weights (\(w_i\)) represent synaptic strengths (learnable).
  • Summation: \(_i w_i x_i + b\) (cell body processing).
  • Activation function (\(f\)): Simulates firing rate.
Mathematical model of a neuron.

Modeling One Neuron: Interactive Forward Pass

Let’s simulate a single neuron’s forward computation. You can modify inputs, weights, and bias to observe the firing_rate.

Single Neuron as a Linear Classifier

A single neuron’s output \((_i w_i x_i + b)\) can be interpreted as a probability.

  • Binary Softmax Classifier (Logistic Regression)
    • \(P(y=1 x; w) = (_i w_i x_i + b)\)
    • \(P(y=0 x; w) = 1 - P(y=1 x; w)\)
    • Optimized with cross-entropy loss.
  • Binary SVM Classifier
    • The neuron’s output can be combined with a max-margin hinge loss.
    • The neuron effectively “fires” if the input falls into one class, and not for the other.

Tip

Regularization: In this context, regularization loss (e.g., L2) can be seen as “gradual forgetting” of synaptic weights.

Note

A single neuron can be used to implement a binary classifier (e.g., binary Softmax or binary SVM classifiers).

3. Commonly Used Activation Functions

An activation function (or non-linearity) takes a single number and performs a fixed mathematical operation.

Sigmoid Function \((x) = 1 / (1 + e^{-x})\)

  • Squashes real numbers to range [0, 1].
  • Historically popular for “firing rate” interpretation.
Sigmoid non-linearity.

Warning

Drawbacks:

  • Saturates and kills gradients: At tails (0 or 1), gradient is near zero, hindering learning.
  • Non-zero-centered output: Can lead to zig-zagging gradient updates.

Tanh Function \((x) = 2 (2x) -1\) * Squashes real numbers to range [-1, 1]. * Zero-centered output - an improvement over sigmoid.

Tanh non-linearity.

Note

Preferred over Sigmoid: Due to its zero-centered output, it generally performs better than sigmoid. Still suffers from saturation.

Activation Functions: ReLU and Variants

Rectified Linear Unit (ReLU) \(f(x) = (0, x)\)

  • Output is 0 for negative input, \(x\) for positive input.
  • Pros:
    • Accelerates convergence significantly.
    • Computationally efficient (simple thresholding).
    • Does not saturate in the positive region.
ReLU activation function.

ReLU Cons & Variants:

  • “Dying ReLU” problem: Neurons can become inactive (output 0) for all future inputs if gradients are too large.
  • Leaky ReLU: \(f(x) = (x < 0) (x) + (x ) (x)\)
    • Introduces a small positive slope \(\) for negative inputs (e.g., 0.01).
    • Aims to prevent dying ReLUs.
  • Maxout: \((w_1^Tx+b_1, w_2^Tx + b_2)\)
    • Generalizes ReLU and its leaky version.
    • No saturation, no dying problem.
    • Drawback: Doubles the number of parameters per neuron.

Warning

For ReLU, monitor “dead” units. High learning rates can exacerbate the dying ReLU problem.

Interactive Activation Functions

Explore how Sigmoid, Tanh, and ReLU functions respond to different input values. Adjust the Input X Value slider to see the corresponding output for each function.

4. Neural Network Architectures

Layer-wise Organization

Neural Networks are collections of neurons connected in an acyclic graph. Most common organizations are into distinct layers.

Fully-Connected Layer:

  • Neurons between adjacent layers are fully pairwise connected.
  • Neurons within a single layer share no connections.
Two-layer Neural Network topology.

Example - 3-Layer Network:

  • Three inputs.
  • Two hidden layers, each with 4 neurons.
  • One output layer.
Three-layer Neural Network topology.

Neural Network Architecture: A Deeper Look

Let’s visualize the connections in a 3-layer neural network with 3 inputs, two hidden layers of 4 neurons, and 1 output.

G cluster_0 Input Layer cluster_1 Hidden Layer 1 cluster_2 Hidden Layer 2 cluster_3 Output Layer input_node x0 x1 x2 H1_0 H1_0 input_node:x0->H1_0 input_node:x1->H1_0 input_node:x2->H1_0 H1_1 H1_1 input_node:x0->H1_1 input_node:x1->H1_1 input_node:x2->H1_1 H1_2 H1_2 input_node:x0->H1_2 input_node:x1->H1_2 input_node:x2->H1_2 H1_3 H1_3 input_node:x0->H1_3 input_node:x1->H1_3 input_node:x2->H1_3 H2_0 H2_0 H1_0->H2_0 H2_1 H2_1 H1_0->H2_1 H2_2 H2_2 H1_0->H2_2 H2_3 H2_3 H1_0->H2_3 H1_1->H2_0 H1_1->H2_1 H1_1->H2_2 H1_1->H2_3 H1_2->H2_0 H1_2->H2_1 H1_2->H2_2 H1_2->H2_3 H1_3->H2_0 H1_3->H2_1 H1_3->H2_2 H1_3->H2_3 O0 O0 H2_0->O0 H2_1->O0 H2_2->O0 H2_3->O0

Naming Conventions & Sizing Neural Networks

  • N-layer network: Number of hidden layers + output layer (input layer is not usually counted).
    • Single-layer network: No hidden layers (e.g., Logistic Regression).
    • Artificial Neural Networks (ANN) or Multi-Layer Perceptrons (MLP) are common synonyms.
    • “Units” is a more general term than “neurons”.
  • Output Layer: Typically has no activation function (or linear identity) for class scores or regression targets.

Sizing Metrics:

  1. Number of neurons (excluding input).
  2. Number of parameters (weights + biases) - more common.

Example 1 (2-layer NN):

  • 4 hidden, 2 output neurons.
  • Weights: \([3 ] + [4 ] = 12 + 8 = 20\)
  • Biases: \(4 + 2 = 6\)
  • Total Parameters: \(20 + 6 = 26\)

Example 2 (3-layer NN):

  • 4 hidden (H1), 4 hidden (H2), 1 output neuron.
  • Weights: \([3 ] + [4 ] + [4 ] = 12 + 16 + 4 = 32\)
  • Biases: \(4 + 4 + 1 = 9\)
  • Total Parameters: \(32 + 9 = 41\)

Tip

Modern Convolutional Networks can have 10-20 layers and >100 million parameters (“deep learning”).

Example: Feed-Forward Computation

The layered structure allows for efficient computation using matrix vector operations.

Consider a 3-layer network:

  • Input \(x\): [3x1] vector.
  • First hidden layer weights \(W_1\): [4x3] matrix, biases \(b_1\): [4x1] vector.
  • Second hidden layer weights \(W_2\): [4x4] matrix, biases \(b_2\): [4x1] vector.
  • Output layer weights \(W_3\): [1x4] matrix, biases \(b_3\): [1x1] vector.

The full forward pass:

Note

The forward pass of a fully-connected layer corresponds to one matrix multiplication followed by a bias offset and an activation function.

Representational Power: Universal Approximators

Are there functions a Neural Network cannot model?

  • Universal Approximation Theorem:
    • A Neural Network with one hidden layer (and a reasonable non-linearity, e.g., sigmoid) can approximate any continuous function to an arbitrary degree of accuracy.
    • \( f(x), > 0 g(x) f(x) - g(x) < \)

Note

Mathematically, a single hidden layer is sufficient, but this doesn’t tell us about practical learning or performance.

Why go deeper then?

  • The “universal approximator” statement is theoretically comforting but practically weak.
  • Deeper networks (multiple hidden layers) often perform empirically better.
  • They learn more compact, hierarchical, and abstract representations.
  • Especially true for data with inherent hierarchical structure (e.g., images: edges \(\rightarrow\) textures \(\rightarrow\) objects).

Cybenko (1989) and Michael Nielsen’s intuitive explanation details this.

Setting Number of Layers and Their Sizes

Network Capacity: The ability of a model to approximate complex functions. Increases with more layers and neurons.

Larger NNs can represent more complicated functions. Circles are data points, colors are classes, decision regions by trained NNs. (ConvNetsJS demo)
  • Overfitting: When a high-capacity model learns noise in training data instead of underlying patterns.
    • Left: 1 hidden neuron - too low capacity, underfits.
    • Middle: 3 hidden neurons - good balance.
    • Right: 20 hidden neurons - very high capacity, potentially overfits by creating complex, disjoint decision regions.

Controlling Overfitting: Prioritizing Regularization

Counterintuitive Advice:

  • Don’t use smaller networks to prevent overfitting.
  • Smaller networks are harder to train effectively with gradient descent; they often converge to “bad” local minima.
  • Larger networks have many more local minima, but these tend to be better in terms of actual loss.

Note

Always use as big of a neural network as your computational budget allows!

Preferred Strategy:

  • Use a large network to ensure high capacity.
  • Control overfitting with robust regularization techniques.
Effects of regularization strength (20 hidden neurons each). Stronger regularization yields smoother decision regions. (ConvNetsJS demo)

Summary

  • Introduced a coarse model of a biological neuron and its computational counterpart.
  • Explored various activation functions (Sigmoid, Tanh, ReLU, Leaky ReLU, Maxout), with ReLU being the most common choice today.
  • Defined Neural Networks with Fully-Connected layers, characterized by pairwise connections between adjacent layers.
  • Understood how this layered architecture allows for efficient feed-forward computation via matrix multiplications.
  • Discussed that Neural Networks are universal function approximators, and why deep layers are still empirically preferred in practice.
  • Emphasized using large networks and controlling overfitting with strong regularization rather than limiting network size.

Additional References